or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# ArXiv

1

2

A Python wrapper for the arXiv API that provides programmatic access to arXiv's database of over 1,000,000 academic papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. The library offers a clean, object-oriented interface with comprehensive search capabilities, rate limiting, retry logic, and convenient download methods.

3

4

## Package Information

5

6

- **Package Name**: arxiv

7

- **Language**: Python

8

- **Installation**: `pip install arxiv`

9

- **Python Version**: >= 3.7

10

11

## Core Imports

12

13

```python

14

import arxiv

15

```

16

17

All classes and enums are available directly from the main module:

18

19

```python

20

from arxiv import Client, Search, Result, SortCriterion, SortOrder

21

```

22

23

For type annotations, the package uses:

24

25

```python

26

from typing import List, Optional, Generator, Dict

27

from datetime import datetime

28

import feedparser

29

```

30

31

Internal constants:

32

33

```python

34

_DEFAULT_TIME = datetime.min # Default datetime for Result objects

35

```

36

37

## Basic Usage

38

39

```python

40

import arxiv

41

42

# Create a search query

43

search = arxiv.Search(

44

query="quantum computing",

45

max_results=10,

46

sort_by=arxiv.SortCriterion.SubmittedDate,

47

sort_order=arxiv.SortOrder.Descending

48

)

49

50

# Use default client to get results

51

client = arxiv.Client()

52

results = client.results(search)

53

54

# Iterate through results

55

for result in results:

56

print(f"Title: {result.title}")

57

print(f"Authors: {', '.join([author.name for author in result.authors])}")

58

print(f"Published: {result.published}")

59

print(f"Summary: {result.summary[:200]}...")

60

print(f"PDF URL: {result.pdf_url}")

61

print("-" * 80)

62

63

# Download the first paper's PDF

64

first_result = next(client.results(search))

65

first_result.download_pdf(dirpath="./downloads/", filename="paper.pdf")

66

```

67

68

## Architecture

69

70

The arxiv package uses a three-layer architecture:

71

72

- **Search**: Query specification with parameters like keywords, ID lists, result limits, and sorting

73

- **Client**: HTTP client managing API requests, pagination, rate limiting, and retry logic

74

- **Result**: Paper metadata with download capabilities, containing nested Author and Link objects

75

76

This design separates query construction from execution and provides reusable clients for efficient API usage across multiple searches.

77

78

## Capabilities

79

80

### Search Construction

81

82

Build queries using arXiv's search syntax with support for field-specific searches, boolean operators, and ID-based lookups.

83

84

```python { .api }

85

class Search:

86

def __init__(

87

self,

88

query: str = "",

89

id_list: List[str] = [],

90

max_results: int | None = None,

91

sort_by: SortCriterion = SortCriterion.Relevance,

92

sort_order: SortOrder = SortOrder.Descending

93

):

94

"""

95

Constructs an arXiv API search with the specified criteria.

96

97

Parameters:

98

- query: Search query string (unencoded). Use syntax like "au:author AND ti:title"

99

- id_list: List of arXiv article IDs to limit search to

100

- max_results: Maximum number of results (None for all available, API limit: 300,000)

101

- sort_by: Sort criterion (Relevance, LastUpdatedDate, SubmittedDate)

102

- sort_order: Sort order (Ascending, Descending)

103

"""

104

105

def results(self, offset: int = 0) -> Generator[Result, None, None]:

106

"""

107

Executes search using default client.

108

109

DEPRECATED after 2.0.0: Use Client.results() instead.

110

This method will emit a DeprecationWarning.

111

"""

112

```

113

114

### Client Configuration

115

116

Configure API client behavior including pagination, rate limiting, and retry strategies.

117

118

```python { .api }

119

class Client:

120

query_url_format: str = "https://export.arxiv.org/api/query?{}"

121

122

def __init__(

123

self,

124

page_size: int = 100,

125

delay_seconds: float = 3.0,

126

num_retries: int = 3

127

):

128

"""

129

Constructs an arXiv API client with specified options.

130

131

Note: the default parameters should provide a robust request strategy

132

for most use cases. Extreme page sizes, delays, or retries risk

133

violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.

134

135

Parameters:

136

- page_size: Results per API request (max: 2000, smaller is faster but more requests)

137

- delay_seconds: Seconds between requests (arXiv ToU requires ≥3 seconds)

138

- num_retries: Retry attempts before raising exception

139

"""

140

141

def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:

142

"""

143

Fetches search results using pagination, yielding Result objects.

144

145

Parameters:

146

- search: Search specification

147

- offset: Skip leading records (when >= max_results, returns empty)

148

149

Returns:

150

Generator yielding Result objects until max_results reached or no more results

151

152

Raises:

153

- HTTPError: Non-200 response after all retries

154

- UnexpectedEmptyPageError: Empty non-first page after all retries

155

"""

156

```

157

158

### Result Data and Downloads

159

160

Access paper metadata and download PDFs or source archives with customizable paths and filenames.

161

162

```python { .api }

163

class Result:

164

entry_id: str # URL like "https://arxiv.org/abs/2107.05580v1"

165

updated: datetime # When result was last updated

166

published: datetime # When result was originally published

167

title: str # Paper title

168

authors: List["Result.Author"] # List of Author objects

169

summary: str # Paper abstract

170

comment: Optional[str] # Authors' comment if present

171

journal_ref: Optional[str] # Journal reference if present

172

doi: Optional[str] # DOI URL if present

173

primary_category: str # Primary arXiv category

174

categories: List[str] # All categories

175

links: List["Result.Link"] # Associated URLs

176

pdf_url: Optional[str] # PDF download URL if available

177

178

def __init__(

179

self,

180

entry_id: str,

181

updated: datetime = _DEFAULT_TIME,

182

published: datetime = _DEFAULT_TIME,

183

title: str = "",

184

authors: List["Result.Author"] = [],

185

summary: str = "",

186

comment: str = "",

187

journal_ref: str = "",

188

doi: str = "",

189

primary_category: str = "",

190

categories: List[str] = [],

191

links: List["Result.Link"] = []

192

):

193

"""

194

Constructs an arXiv search result item.

195

196

In most cases, prefer creating Result objects from API responses

197

using the arxiv Client rather than constructing them manually.

198

"""

199

200

201

def get_short_id(self) -> str:

202

"""

203

Returns short ID extracted from entry_id.

204

205

Examples:

206

- "https://arxiv.org/abs/2107.05580v1" → "2107.05580v1"

207

- "https://arxiv.org/abs/quant-ph/0201082v1" → "quant-ph/0201082v1"

208

"""

209

210

def download_pdf(

211

self,

212

dirpath: str = "./",

213

filename: str = "",

214

download_domain: str = "export.arxiv.org"

215

) -> str:

216

"""

217

Downloads PDF to specified directory with optional custom filename.

218

219

Parameters:

220

- dirpath: Target directory path

221

- filename: Custom filename (auto-generated if empty)

222

- download_domain: Domain for download (for testing/mirroring)

223

224

Returns:

225

Path to downloaded file

226

"""

227

228

def download_source(

229

self,

230

dirpath: str = "./",

231

filename: str = "",

232

download_domain: str = "export.arxiv.org"

233

) -> str:

234

"""

235

Downloads source tarfile (.tar.gz) to specified directory.

236

237

Parameters:

238

- dirpath: Target directory path

239

- filename: Custom filename (auto-generated with .tar.gz if empty)

240

- download_domain: Domain for download (for testing/mirroring)

241

242

Returns:

243

Path to downloaded file

244

"""

245

```

246

247

### Author and Link Information

248

249

Access structured metadata about paper authors and associated links.

250

251

```python { .api }

252

class Result.Author:

253

"""Inner class representing a paper's author."""

254

255

name: str # Author's name

256

257

def __init__(self, name: str):

258

"""

259

Constructs Author with specified name.

260

Prefer using Result.Author._from_feed_author() for API parsing.

261

"""

262

263

264

class Result.Link:

265

"""Inner class representing a paper's associated links."""

266

267

href: str # Link URL

268

title: Optional[str] # Link title

269

rel: str # Relationship to Result

270

content_type: str # HTTP content type

271

272

def __init__(

273

self,

274

href: str,

275

title: Optional[str] = None,

276

rel: Optional[str] = None,

277

content_type: Optional[str] = None

278

):

279

"""

280

Constructs Link with specified metadata.

281

Prefer using Result.Link._from_feed_link() for API parsing.

282

"""

283

284

```

285

286

### Sort Configuration

287

288

Control result ordering using predefined sort criteria and order options.

289

290

```python { .api }

291

from enum import Enum

292

293

class SortCriterion(Enum):

294

"""

295

Properties by which search results can be sorted.

296

"""

297

Relevance = "relevance"

298

LastUpdatedDate = "lastUpdatedDate"

299

SubmittedDate = "submittedDate"

300

301

class SortOrder(Enum):

302

"""

303

Order in which search results are sorted according to SortCriterion.

304

"""

305

Ascending = "ascending"

306

Descending = "descending"

307

```

308

309

### Error Handling

310

311

Handle API errors, network issues, and data parsing problems with specific exception types.

312

313

```python { .api }

314

class ArxivError(Exception):

315

"""

316

Base exception class for arxiv package errors.

317

"""

318

url: str # Feed URL that could not be fetched

319

retry: int # Request try number (0 for initial, 1+ for retries)

320

message: str # Error description

321

322

def __init__(self, url: str, retry: int, message: str):

323

"""

324

Constructs ArxivError for specified URL and retry attempt.

325

"""

326

327

class HTTPError(ArxivError):

328

"""

329

Non-200 HTTP status encountered while fetching results.

330

"""

331

status: int # HTTP status code

332

333

def __init__(self, url: str, retry: int, status: int):

334

"""

335

Constructs HTTPError for specified status code and URL.

336

"""

337

338

class UnexpectedEmptyPageError(ArxivError):

339

"""

340

Error when a non-first page of results is unexpectedly empty.

341

Usually resolved by retries due to arXiv API brittleness.

342

"""

343

raw_feed: feedparser.FeedParserDict # Raw feedparser output for diagnostics

344

345

def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):

346

"""

347

Constructs UnexpectedEmptyPageError for specified URL and feed.

348

"""

349

350

class Result.MissingFieldError(Exception):

351

"""

352

Error indicating entry cannot be parsed due to missing required fields.

353

This is a nested exception class inside Result.

354

"""

355

missing_field: str # Required field missing from entry

356

message: str # Error description

357

358

def __init__(self, missing_field: str):

359

"""

360

Constructs MissingFieldError for specified missing field.

361

362

Parameters:

363

- missing_field: The name of the required field that was missing

364

"""

365

```

366

367

## Advanced Usage Examples

368

369

### Complex Search Queries

370

371

```python

372

import arxiv

373

374

# Author and title search

375

search = arxiv.Search(query="au:del_maestro AND ti:checkerboard")

376

377

# Category-specific search with date range

378

search = arxiv.Search(

379

query="cat:cs.AI AND submittedDate:[20230101 TO 20231231]",

380

max_results=50,

381

sort_by=arxiv.SortCriterion.SubmittedDate

382

)

383

384

# Multiple specific papers by ID

385

search = arxiv.Search(id_list=["1605.08386v1", "2107.05580", "quant-ph/0201082"])

386

387

client = arxiv.Client()

388

for result in client.results(search):

389

print(f"{result.get_short_id()}: {result.title}")

390

```

391

392

### Custom Client Configuration

393

394

```python

395

import arxiv

396

397

# High-throughput client (be careful with rate limits)

398

fast_client = arxiv.Client(

399

page_size=2000, # Maximum page size

400

delay_seconds=3.0, # Minimum required by arXiv ToU

401

num_retries=5 # More retries for reliability

402

)

403

404

# Conservative client for fragile networks

405

safe_client = arxiv.Client(

406

page_size=50, # Smaller pages

407

delay_seconds=5.0, # Extra delay

408

num_retries=10 # Many retries

409

)

410

411

search = arxiv.Search(query="machine learning", max_results=1000)

412

413

# Use specific client

414

results = list(fast_client.results(search))

415

print(f"Retrieved {len(results)} papers")

416

```

417

418

### Batch Downloads

419

420

```python

421

import arxiv

422

import os

423

424

# Create download directory

425

os.makedirs("./papers", exist_ok=True)

426

427

search = arxiv.Search(

428

query="cat:cs.LG AND ti:transformer",

429

max_results=20,

430

sort_by=arxiv.SortCriterion.SubmittedDate,

431

sort_order=arxiv.SortOrder.Descending

432

)

433

434

client = arxiv.Client()

435

436

for i, result in enumerate(client.results(search)):

437

try:

438

# Download PDF with custom filename

439

filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.pdf"

440

path = result.download_pdf(dirpath="./papers", filename=filename)

441

print(f"Downloaded: {path}")

442

443

# Also download source if available

444

src_filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.tar.gz"

445

src_path = result.download_source(dirpath="./papers", filename=src_filename)

446

print(f"Downloaded source: {src_path}")

447

448

except Exception as e:

449

print(f"Failed to download {result.entry_id}: {e}")

450

```

451

452

### Error Handling

453

454

```python

455

import arxiv

456

import logging

457

458

# Enable debug logging to see API calls

459

logging.basicConfig(level=logging.DEBUG)

460

461

client = arxiv.Client(num_retries=2)

462

search = arxiv.Search(query="invalid:query:syntax", max_results=10)

463

464

try:

465

results = list(client.results(search))

466

print(f"Found {len(results)} results")

467

468

except arxiv.HTTPError as e:

469

print(f"HTTP error {e.status} after {e.retry} retries: {e.message}")

470

print(f"URL: {e.url}")

471

472

except arxiv.UnexpectedEmptyPageError as e:

473

print(f"Empty page after {e.retry} retries: {e.message}")

474

print(f"Raw feed info: {e.raw_feed.bozo_exception if e.raw_feed.bozo else 'No bozo exception'}")

475

476

except Exception as e:

477

print(f"Unexpected error: {e}")

478

```