0
# ArXiv
1
2
A Python wrapper for the arXiv API that provides programmatic access to arXiv's database of over 1,000,000 academic papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. The library offers a clean, object-oriented interface with comprehensive search capabilities, rate limiting, retry logic, and convenient download methods.
3
4
## Package Information
5
6
- **Package Name**: arxiv
7
- **Language**: Python
8
- **Installation**: `pip install arxiv`
9
- **Python Version**: >= 3.7
10
11
## Core Imports
12
13
```python
14
import arxiv
15
```
16
17
All classes and enums are available directly from the main module:
18
19
```python
20
from arxiv import Client, Search, Result, SortCriterion, SortOrder
21
```
22
23
For type annotations, the package uses:
24
25
```python
26
from typing import List, Optional, Generator, Dict
27
from datetime import datetime
28
import feedparser
29
```
30
31
Internal constants:
32
33
```python
34
_DEFAULT_TIME = datetime.min # Default datetime for Result objects
35
```
36
37
## Basic Usage
38
39
```python
40
import arxiv
41
42
# Create a search query
43
search = arxiv.Search(
44
query="quantum computing",
45
max_results=10,
46
sort_by=arxiv.SortCriterion.SubmittedDate,
47
sort_order=arxiv.SortOrder.Descending
48
)
49
50
# Use default client to get results
51
client = arxiv.Client()
52
results = client.results(search)
53
54
# Iterate through results
55
for result in results:
56
print(f"Title: {result.title}")
57
print(f"Authors: {', '.join([author.name for author in result.authors])}")
58
print(f"Published: {result.published}")
59
print(f"Summary: {result.summary[:200]}...")
60
print(f"PDF URL: {result.pdf_url}")
61
print("-" * 80)
62
63
# Download the first paper's PDF
64
first_result = next(client.results(search))
65
first_result.download_pdf(dirpath="./downloads/", filename="paper.pdf")
66
```
67
68
## Architecture
69
70
The arxiv package uses a three-layer architecture:
71
72
- **Search**: Query specification with parameters like keywords, ID lists, result limits, and sorting
73
- **Client**: HTTP client managing API requests, pagination, rate limiting, and retry logic
74
- **Result**: Paper metadata with download capabilities, containing nested Author and Link objects
75
76
This design separates query construction from execution and provides reusable clients for efficient API usage across multiple searches.
77
78
## Capabilities
79
80
### Search Construction
81
82
Build queries using arXiv's search syntax with support for field-specific searches, boolean operators, and ID-based lookups.
83
84
```python { .api }
85
class Search:
86
def __init__(
87
self,
88
query: str = "",
89
id_list: List[str] = [],
90
max_results: int | None = None,
91
sort_by: SortCriterion = SortCriterion.Relevance,
92
sort_order: SortOrder = SortOrder.Descending
93
):
94
"""
95
Constructs an arXiv API search with the specified criteria.
96
97
Parameters:
98
- query: Search query string (unencoded). Use syntax like "au:author AND ti:title"
99
- id_list: List of arXiv article IDs to limit search to
100
- max_results: Maximum number of results (None for all available, API limit: 300,000)
101
- sort_by: Sort criterion (Relevance, LastUpdatedDate, SubmittedDate)
102
- sort_order: Sort order (Ascending, Descending)
103
"""
104
105
def results(self, offset: int = 0) -> Generator[Result, None, None]:
106
"""
107
Executes search using default client.
108
109
DEPRECATED after 2.0.0: Use Client.results() instead.
110
This method will emit a DeprecationWarning.
111
"""
112
```
113
114
### Client Configuration
115
116
Configure API client behavior including pagination, rate limiting, and retry strategies.
117
118
```python { .api }
119
class Client:
120
query_url_format: str = "https://export.arxiv.org/api/query?{}"
121
122
def __init__(
123
self,
124
page_size: int = 100,
125
delay_seconds: float = 3.0,
126
num_retries: int = 3
127
):
128
"""
129
Constructs an arXiv API client with specified options.
130
131
Note: the default parameters should provide a robust request strategy
132
for most use cases. Extreme page sizes, delays, or retries risk
133
violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
134
135
Parameters:
136
- page_size: Results per API request (max: 2000, smaller is faster but more requests)
137
- delay_seconds: Seconds between requests (arXiv ToU requires ≥3 seconds)
138
- num_retries: Retry attempts before raising exception
139
"""
140
141
def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
142
"""
143
Fetches search results using pagination, yielding Result objects.
144
145
Parameters:
146
- search: Search specification
147
- offset: Skip leading records (when >= max_results, returns empty)
148
149
Returns:
150
Generator yielding Result objects until max_results reached or no more results
151
152
Raises:
153
- HTTPError: Non-200 response after all retries
154
- UnexpectedEmptyPageError: Empty non-first page after all retries
155
"""
156
```
157
158
### Result Data and Downloads
159
160
Access paper metadata and download PDFs or source archives with customizable paths and filenames.
161
162
```python { .api }
163
class Result:
164
entry_id: str # URL like "https://arxiv.org/abs/2107.05580v1"
165
updated: datetime # When result was last updated
166
published: datetime # When result was originally published
167
title: str # Paper title
168
authors: List["Result.Author"] # List of Author objects
169
summary: str # Paper abstract
170
comment: Optional[str] # Authors' comment if present
171
journal_ref: Optional[str] # Journal reference if present
172
doi: Optional[str] # DOI URL if present
173
primary_category: str # Primary arXiv category
174
categories: List[str] # All categories
175
links: List["Result.Link"] # Associated URLs
176
pdf_url: Optional[str] # PDF download URL if available
177
178
def __init__(
179
self,
180
entry_id: str,
181
updated: datetime = _DEFAULT_TIME,
182
published: datetime = _DEFAULT_TIME,
183
title: str = "",
184
authors: List["Result.Author"] = [],
185
summary: str = "",
186
comment: str = "",
187
journal_ref: str = "",
188
doi: str = "",
189
primary_category: str = "",
190
categories: List[str] = [],
191
links: List["Result.Link"] = []
192
):
193
"""
194
Constructs an arXiv search result item.
195
196
In most cases, prefer creating Result objects from API responses
197
using the arxiv Client rather than constructing them manually.
198
"""
199
200
201
def get_short_id(self) -> str:
202
"""
203
Returns short ID extracted from entry_id.
204
205
Examples:
206
- "https://arxiv.org/abs/2107.05580v1" → "2107.05580v1"
207
- "https://arxiv.org/abs/quant-ph/0201082v1" → "quant-ph/0201082v1"
208
"""
209
210
def download_pdf(
211
self,
212
dirpath: str = "./",
213
filename: str = "",
214
download_domain: str = "export.arxiv.org"
215
) -> str:
216
"""
217
Downloads PDF to specified directory with optional custom filename.
218
219
Parameters:
220
- dirpath: Target directory path
221
- filename: Custom filename (auto-generated if empty)
222
- download_domain: Domain for download (for testing/mirroring)
223
224
Returns:
225
Path to downloaded file
226
"""
227
228
def download_source(
229
self,
230
dirpath: str = "./",
231
filename: str = "",
232
download_domain: str = "export.arxiv.org"
233
) -> str:
234
"""
235
Downloads source tarfile (.tar.gz) to specified directory.
236
237
Parameters:
238
- dirpath: Target directory path
239
- filename: Custom filename (auto-generated with .tar.gz if empty)
240
- download_domain: Domain for download (for testing/mirroring)
241
242
Returns:
243
Path to downloaded file
244
"""
245
```
246
247
### Author and Link Information
248
249
Access structured metadata about paper authors and associated links.
250
251
```python { .api }
252
class Result.Author:
253
"""Inner class representing a paper's author."""
254
255
name: str # Author's name
256
257
def __init__(self, name: str):
258
"""
259
Constructs Author with specified name.
260
Prefer using Result.Author._from_feed_author() for API parsing.
261
"""
262
263
264
class Result.Link:
265
"""Inner class representing a paper's associated links."""
266
267
href: str # Link URL
268
title: Optional[str] # Link title
269
rel: str # Relationship to Result
270
content_type: str # HTTP content type
271
272
def __init__(
273
self,
274
href: str,
275
title: Optional[str] = None,
276
rel: Optional[str] = None,
277
content_type: Optional[str] = None
278
):
279
"""
280
Constructs Link with specified metadata.
281
Prefer using Result.Link._from_feed_link() for API parsing.
282
"""
283
284
```
285
286
### Sort Configuration
287
288
Control result ordering using predefined sort criteria and order options.
289
290
```python { .api }
291
from enum import Enum
292
293
class SortCriterion(Enum):
294
"""
295
Properties by which search results can be sorted.
296
"""
297
Relevance = "relevance"
298
LastUpdatedDate = "lastUpdatedDate"
299
SubmittedDate = "submittedDate"
300
301
class SortOrder(Enum):
302
"""
303
Order in which search results are sorted according to SortCriterion.
304
"""
305
Ascending = "ascending"
306
Descending = "descending"
307
```
308
309
### Error Handling
310
311
Handle API errors, network issues, and data parsing problems with specific exception types.
312
313
```python { .api }
314
class ArxivError(Exception):
315
"""
316
Base exception class for arxiv package errors.
317
"""
318
url: str # Feed URL that could not be fetched
319
retry: int # Request try number (0 for initial, 1+ for retries)
320
message: str # Error description
321
322
def __init__(self, url: str, retry: int, message: str):
323
"""
324
Constructs ArxivError for specified URL and retry attempt.
325
"""
326
327
class HTTPError(ArxivError):
328
"""
329
Non-200 HTTP status encountered while fetching results.
330
"""
331
status: int # HTTP status code
332
333
def __init__(self, url: str, retry: int, status: int):
334
"""
335
Constructs HTTPError for specified status code and URL.
336
"""
337
338
class UnexpectedEmptyPageError(ArxivError):
339
"""
340
Error when a non-first page of results is unexpectedly empty.
341
Usually resolved by retries due to arXiv API brittleness.
342
"""
343
raw_feed: feedparser.FeedParserDict # Raw feedparser output for diagnostics
344
345
def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
346
"""
347
Constructs UnexpectedEmptyPageError for specified URL and feed.
348
"""
349
350
class Result.MissingFieldError(Exception):
351
"""
352
Error indicating entry cannot be parsed due to missing required fields.
353
This is a nested exception class inside Result.
354
"""
355
missing_field: str # Required field missing from entry
356
message: str # Error description
357
358
def __init__(self, missing_field: str):
359
"""
360
Constructs MissingFieldError for specified missing field.
361
362
Parameters:
363
- missing_field: The name of the required field that was missing
364
"""
365
```
366
367
## Advanced Usage Examples
368
369
### Complex Search Queries
370
371
```python
372
import arxiv
373
374
# Author and title search
375
search = arxiv.Search(query="au:del_maestro AND ti:checkerboard")
376
377
# Category-specific search with date range
378
search = arxiv.Search(
379
query="cat:cs.AI AND submittedDate:[20230101 TO 20231231]",
380
max_results=50,
381
sort_by=arxiv.SortCriterion.SubmittedDate
382
)
383
384
# Multiple specific papers by ID
385
search = arxiv.Search(id_list=["1605.08386v1", "2107.05580", "quant-ph/0201082"])
386
387
client = arxiv.Client()
388
for result in client.results(search):
389
print(f"{result.get_short_id()}: {result.title}")
390
```
391
392
### Custom Client Configuration
393
394
```python
395
import arxiv
396
397
# High-throughput client (be careful with rate limits)
398
fast_client = arxiv.Client(
399
page_size=2000, # Maximum page size
400
delay_seconds=3.0, # Minimum required by arXiv ToU
401
num_retries=5 # More retries for reliability
402
)
403
404
# Conservative client for fragile networks
405
safe_client = arxiv.Client(
406
page_size=50, # Smaller pages
407
delay_seconds=5.0, # Extra delay
408
num_retries=10 # Many retries
409
)
410
411
search = arxiv.Search(query="machine learning", max_results=1000)
412
413
# Use specific client
414
results = list(fast_client.results(search))
415
print(f"Retrieved {len(results)} papers")
416
```
417
418
### Batch Downloads
419
420
```python
421
import arxiv
422
import os
423
424
# Create download directory
425
os.makedirs("./papers", exist_ok=True)
426
427
search = arxiv.Search(
428
query="cat:cs.LG AND ti:transformer",
429
max_results=20,
430
sort_by=arxiv.SortCriterion.SubmittedDate,
431
sort_order=arxiv.SortOrder.Descending
432
)
433
434
client = arxiv.Client()
435
436
for i, result in enumerate(client.results(search)):
437
try:
438
# Download PDF with custom filename
439
filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.pdf"
440
path = result.download_pdf(dirpath="./papers", filename=filename)
441
print(f"Downloaded: {path}")
442
443
# Also download source if available
444
src_filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.tar.gz"
445
src_path = result.download_source(dirpath="./papers", filename=src_filename)
446
print(f"Downloaded source: {src_path}")
447
448
except Exception as e:
449
print(f"Failed to download {result.entry_id}: {e}")
450
```
451
452
### Error Handling
453
454
```python
455
import arxiv
456
import logging
457
458
# Enable debug logging to see API calls
459
logging.basicConfig(level=logging.DEBUG)
460
461
client = arxiv.Client(num_retries=2)
462
search = arxiv.Search(query="invalid:query:syntax", max_results=10)
463
464
try:
465
results = list(client.results(search))
466
print(f"Found {len(results)} results")
467
468
except arxiv.HTTPError as e:
469
print(f"HTTP error {e.status} after {e.retry} retries: {e.message}")
470
print(f"URL: {e.url}")
471
472
except arxiv.UnexpectedEmptyPageError as e:
473
print(f"Empty page after {e.retry} retries: {e.message}")
474
print(f"Raw feed info: {e.raw_feed.bozo_exception if e.raw_feed.bozo else 'No bozo exception'}")
475
476
except Exception as e:
477
print(f"Unexpected error: {e}")
478
```