Tessl Tile for pypi/arxiv@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# ArXiv
1

2
A Python wrapper for the arXiv API that provides programmatic access to arXiv's database of over 1,000,000 academic papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. The library offers a clean, object-oriented interface with comprehensive search capabilities, rate limiting, retry logic, and convenient download methods.
3

4
## Package Information
5

6
- **Package Name**: arxiv
7
- **Language**: Python
8
- **Installation**: `pip install arxiv`
9
- **Python Version**: >= 3.7
10

11
## Core Imports
12

13
```python
14
import arxiv
15
```
16

17
All classes and enums are available directly from the main module:
18

19
```python
20
from arxiv import Client, Search, Result, SortCriterion, SortOrder
21
```
22

23
For type annotations, the package uses:
24

25
```python
26
from typing import List, Optional, Generator, Dict
27
from datetime import datetime
28
import feedparser
29
```
30

31
Internal constants:
32

33
```python
34
_DEFAULT_TIME = datetime.min  # Default datetime for Result objects
35
```
36

37
## Basic Usage
38

39
```python
40
import arxiv
41

42
# Create a search query
43
search = arxiv.Search(
44
    query="quantum computing",
45
    max_results=10,
46
    sort_by=arxiv.SortCriterion.SubmittedDate,
47
    sort_order=arxiv.SortOrder.Descending
48
)
49

50
# Use default client to get results
51
client = arxiv.Client()
52
results = client.results(search)
53

54
# Iterate through results
55
for result in results:
56
    print(f"Title: {result.title}")
57
    print(f"Authors: {', '.join([author.name for author in result.authors])}")
58
    print(f"Published: {result.published}")
59
    print(f"Summary: {result.summary[:200]}...")
60
    print(f"PDF URL: {result.pdf_url}")
61
    print("-" * 80)
62

63
# Download the first paper's PDF
64
first_result = next(client.results(search))
65
first_result.download_pdf(dirpath="./downloads/", filename="paper.pdf")
66
```
67

68
## Architecture
69

70
The arxiv package uses a three-layer architecture:
71

72
- **Search**: Query specification with parameters like keywords, ID lists, result limits, and sorting
73
- **Client**: HTTP client managing API requests, pagination, rate limiting, and retry logic
74
- **Result**: Paper metadata with download capabilities, containing nested Author and Link objects
75

76
This design separates query construction from execution and provides reusable clients for efficient API usage across multiple searches.
77

78
## Capabilities
79

80
### Search Construction
81

82
Build queries using arXiv's search syntax with support for field-specific searches, boolean operators, and ID-based lookups.
83

84
```python { .api }
85
class Search:
86
    def __init__(
87
        self,
88
        query: str = "",
89
        id_list: List[str] = [],
90
        max_results: int | None = None,
91
        sort_by: SortCriterion = SortCriterion.Relevance,
92
        sort_order: SortOrder = SortOrder.Descending
93
    ):
94
        """
95
        Constructs an arXiv API search with the specified criteria.
96
        
97
        Parameters:
98
        - query: Search query string (unencoded). Use syntax like "au:author AND ti:title"
99
        - id_list: List of arXiv article IDs to limit search to  
100
        - max_results: Maximum number of results (None for all available, API limit: 300,000)
101
        - sort_by: Sort criterion (Relevance, LastUpdatedDate, SubmittedDate)
102
        - sort_order: Sort order (Ascending, Descending)
103
        """
104

105
    def results(self, offset: int = 0) -> Generator[Result, None, None]:
106
        """
107
        Executes search using default client. 
108
        
109
        DEPRECATED after 2.0.0: Use Client.results() instead.
110
        This method will emit a DeprecationWarning.
111
        """
112
```
113

114
### Client Configuration
115

116
Configure API client behavior including pagination, rate limiting, and retry strategies.
117

118
```python { .api }
119
class Client:
120
    query_url_format: str = "https://export.arxiv.org/api/query?{}"
121

122
    def __init__(
123
        self,
124
        page_size: int = 100,
125
        delay_seconds: float = 3.0,
126
        num_retries: int = 3
127
    ):
128
        """
129
        Constructs an arXiv API client with specified options.
130
        
131
        Note: the default parameters should provide a robust request strategy
132
        for most use cases. Extreme page sizes, delays, or retries risk
133
        violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
134
        
135
        Parameters:
136
        - page_size: Results per API request (max: 2000, smaller is faster but more requests)
137
        - delay_seconds: Seconds between requests (arXiv ToU requires ≥3 seconds)
138
        - num_retries: Retry attempts before raising exception
139
        """
140

141
    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
142
        """
143
        Fetches search results using pagination, yielding Result objects.
144
        
145
        Parameters:
146
        - search: Search specification
147
        - offset: Skip leading records (when >= max_results, returns empty)
148
        
149
        Returns:
150
        Generator yielding Result objects until max_results reached or no more results
151
        
152
        Raises:
153
        - HTTPError: Non-200 response after all retries
154
        - UnexpectedEmptyPageError: Empty non-first page after all retries  
155
        """
156
```
157

158
### Result Data and Downloads
159

160
Access paper metadata and download PDFs or source archives with customizable paths and filenames.
161

162
```python { .api }
163
class Result:
164
    entry_id: str          # URL like "https://arxiv.org/abs/2107.05580v1"
165
    updated: datetime      # When result was last updated
166
    published: datetime    # When result was originally published  
167
    title: str            # Paper title
168
    authors: List["Result.Author"] # List of Author objects
169
    summary: str          # Paper abstract
170
    comment: Optional[str]   # Authors' comment if present
171
    journal_ref: Optional[str] # Journal reference if present
172
    doi: Optional[str]       # DOI URL if present
173
    primary_category: str # Primary arXiv category
174
    categories: List[str] # All categories
175
    links: List["Result.Link"]     # Associated URLs
176
    pdf_url: Optional[str]   # PDF download URL if available
177

178
    def __init__(
179
        self,
180
        entry_id: str,
181
        updated: datetime = _DEFAULT_TIME,
182
        published: datetime = _DEFAULT_TIME,
183
        title: str = "",
184
        authors: List["Result.Author"] = [],
185
        summary: str = "",
186
        comment: str = "",
187
        journal_ref: str = "",
188
        doi: str = "",
189
        primary_category: str = "",
190
        categories: List[str] = [],
191
        links: List["Result.Link"] = []
192
    ):
193
        """
194
        Constructs an arXiv search result item.
195
        
196
        In most cases, prefer creating Result objects from API responses
197
        using the arxiv Client rather than constructing them manually.
198
        """
199

200

201
    def get_short_id(self) -> str:
202
        """
203
        Returns short ID extracted from entry_id.
204
        
205
        Examples:
206
        - "https://arxiv.org/abs/2107.05580v1" → "2107.05580v1"
207
        - "https://arxiv.org/abs/quant-ph/0201082v1" → "quant-ph/0201082v1"
208
        """
209

210
    def download_pdf(
211
        self,
212
        dirpath: str = "./",
213
        filename: str = "",
214
        download_domain: str = "export.arxiv.org"
215
    ) -> str:
216
        """
217
        Downloads PDF to specified directory with optional custom filename.
218
        
219
        Parameters:
220
        - dirpath: Target directory path
221
        - filename: Custom filename (auto-generated if empty)
222
        - download_domain: Domain for download (for testing/mirroring)
223
        
224
        Returns:
225
        Path to downloaded file
226
        """
227

228
    def download_source(
229
        self,
230
        dirpath: str = "./",
231
        filename: str = "",
232
        download_domain: str = "export.arxiv.org"
233
    ) -> str:
234
        """
235
        Downloads source tarfile (.tar.gz) to specified directory.
236
        
237
        Parameters:
238
        - dirpath: Target directory path  
239
        - filename: Custom filename (auto-generated with .tar.gz if empty)
240
        - download_domain: Domain for download (for testing/mirroring)
241
        
242
        Returns:
243
        Path to downloaded file
244
        """
245
```
246

247
### Author and Link Information
248

249
Access structured metadata about paper authors and associated links.
250

251
```python { .api }
252
class Result.Author:
253
    """Inner class representing a paper's author."""
254
    
255
    name: str  # Author's name
256

257
    def __init__(self, name: str):
258
        """
259
        Constructs Author with specified name.
260
        Prefer using Result.Author._from_feed_author() for API parsing.
261
        """
262

263

264
class Result.Link:
265
    """Inner class representing a paper's associated links."""
266
    
267
    href: str              # Link URL
268
    title: Optional[str]      # Link title  
269
    rel: str              # Relationship to Result  
270
    content_type: str     # HTTP content type
271

272
    def __init__(
273
        self,
274
        href: str,
275
        title: Optional[str] = None,
276
        rel: Optional[str] = None,
277
        content_type: Optional[str] = None
278
    ):
279
        """
280
        Constructs Link with specified metadata.
281
        Prefer using Result.Link._from_feed_link() for API parsing.
282
        """
283
284
```
285

286
### Sort Configuration
287

288
Control result ordering using predefined sort criteria and order options.
289

290
```python { .api }
291
from enum import Enum
292

293
class SortCriterion(Enum):
294
    """
295
    Properties by which search results can be sorted.
296
    """
297
    Relevance = "relevance"
298
    LastUpdatedDate = "lastUpdatedDate" 
299
    SubmittedDate = "submittedDate"
300

301
class SortOrder(Enum):
302
    """
303
    Order in which search results are sorted according to SortCriterion.
304
    """
305
    Ascending = "ascending"
306
    Descending = "descending"
307
```
308

309
### Error Handling
310

311
Handle API errors, network issues, and data parsing problems with specific exception types.
312

313
```python { .api }
314
class ArxivError(Exception):
315
    """
316
    Base exception class for arxiv package errors.
317
    """
318
    url: str      # Feed URL that could not be fetched
319
    retry: int    # Request try number (0 for initial, 1+ for retries)
320
    message: str  # Error description
321

322
    def __init__(self, url: str, retry: int, message: str):
323
        """
324
        Constructs ArxivError for specified URL and retry attempt.
325
        """
326

327
class HTTPError(ArxivError):  
328
    """
329
    Non-200 HTTP status encountered while fetching results.
330
    """
331
    status: int  # HTTP status code
332

333
    def __init__(self, url: str, retry: int, status: int):
334
        """
335
        Constructs HTTPError for specified status code and URL.
336
        """
337

338
class UnexpectedEmptyPageError(ArxivError):
339
    """
340
    Error when a non-first page of results is unexpectedly empty.
341
    Usually resolved by retries due to arXiv API brittleness. 
342
    """
343
    raw_feed: feedparser.FeedParserDict  # Raw feedparser output for diagnostics
344

345
    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
346
        """
347
        Constructs UnexpectedEmptyPageError for specified URL and feed.
348
        """
349

350
class Result.MissingFieldError(Exception):
351
    """
352
    Error indicating entry cannot be parsed due to missing required fields.
353
    This is a nested exception class inside Result.
354
    """
355
    missing_field: str  # Required field missing from entry
356
    message: str       # Error description
357

358
    def __init__(self, missing_field: str):
359
        """
360
        Constructs MissingFieldError for specified missing field.
361
        
362
        Parameters:
363
        - missing_field: The name of the required field that was missing
364
        """
365
```
366

367
## Advanced Usage Examples
368

369
### Complex Search Queries
370

371
```python
372
import arxiv
373

374
# Author and title search
375
search = arxiv.Search(query="au:del_maestro AND ti:checkerboard")
376

377
# Category-specific search with date range  
378
search = arxiv.Search(
379
    query="cat:cs.AI AND submittedDate:[20230101 TO 20231231]",
380
    max_results=50,
381
    sort_by=arxiv.SortCriterion.SubmittedDate
382
)
383

384
# Multiple specific papers by ID
385
search = arxiv.Search(id_list=["1605.08386v1", "2107.05580", "quant-ph/0201082"])
386

387
client = arxiv.Client()
388
for result in client.results(search):
389
    print(f"{result.get_short_id()}: {result.title}")
390
```
391

392
### Custom Client Configuration
393

394
```python
395
import arxiv
396

397
# High-throughput client (be careful with rate limits)
398
fast_client = arxiv.Client(
399
    page_size=2000,      # Maximum page size
400
    delay_seconds=3.0,   # Minimum required by arXiv ToU
401
    num_retries=5        # More retries for reliability
402
)
403

404
# Conservative client for fragile networks
405
safe_client = arxiv.Client(
406
    page_size=50,        # Smaller pages
407
    delay_seconds=5.0,   # Extra delay
408
    num_retries=10       # Many retries
409
)
410

411
search = arxiv.Search(query="machine learning", max_results=1000)
412

413
# Use specific client
414
results = list(fast_client.results(search))
415
print(f"Retrieved {len(results)} papers")
416
```
417

418
### Batch Downloads
419

420
```python  
421
import arxiv
422
import os
423

424
# Create download directory
425
os.makedirs("./papers", exist_ok=True)
426

427
search = arxiv.Search(
428
    query="cat:cs.LG AND ti:transformer",
429
    max_results=20,
430
    sort_by=arxiv.SortCriterion.SubmittedDate,
431
    sort_order=arxiv.SortOrder.Descending
432
)
433

434
client = arxiv.Client()
435

436
for i, result in enumerate(client.results(search)):
437
    try:
438
        # Download PDF with custom filename
439
        filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.pdf"
440
        path = result.download_pdf(dirpath="./papers", filename=filename)
441
        print(f"Downloaded: {path}")
442
        
443
        # Also download source if available
444
        src_filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.tar.gz"
445
        src_path = result.download_source(dirpath="./papers", filename=src_filename)
446
        print(f"Downloaded source: {src_path}")
447
        
448
    except Exception as e:
449
        print(f"Failed to download {result.entry_id}: {e}")
450
```
451

452
### Error Handling
453

454
```python
455
import arxiv
456
import logging
457

458
# Enable debug logging to see API calls
459
logging.basicConfig(level=logging.DEBUG)
460

461
client = arxiv.Client(num_retries=2)
462
search = arxiv.Search(query="invalid:query:syntax", max_results=10)
463

464
try:
465
    results = list(client.results(search))
466
    print(f"Found {len(results)} results")
467
    
468
except arxiv.HTTPError as e:
469
    print(f"HTTP error {e.status} after {e.retry} retries: {e.message}")
470
    print(f"URL: {e.url}")
471
    
472
except arxiv.UnexpectedEmptyPageError as e:
473
    print(f"Empty page after {e.retry} retries: {e.message}")
474
    print(f"Raw feed info: {e.raw_feed.bozo_exception if e.raw_feed.bozo else 'No bozo exception'}")
475
    
476
except Exception as e:
477
    print(f"Unexpected error: {e}")
478
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/