or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.md
tile.json

tessl/pypi-arxiv

Python wrapper for the arXiv API enabling search, fetch, and download of academic papers.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/arxiv@2.2.x

To install, run

npx @tessl/cli install tessl/pypi-arxiv@2.2.0

index.mddocs/

ArXiv

A Python wrapper for the arXiv API that provides programmatic access to arXiv's database of over 1,000,000 academic papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. The library offers a clean, object-oriented interface with comprehensive search capabilities, rate limiting, retry logic, and convenient download methods.

Package Information

  • Package Name: arxiv
  • Language: Python
  • Installation: pip install arxiv
  • Python Version: >= 3.7

Core Imports

import arxiv

All classes and enums are available directly from the main module:

from arxiv import Client, Search, Result, SortCriterion, SortOrder

For type annotations, the package uses:

from typing import List, Optional, Generator, Dict
from datetime import datetime
import feedparser

Internal constants:

_DEFAULT_TIME = datetime.min  # Default datetime for Result objects

Basic Usage

import arxiv

# Create a search query
search = arxiv.Search(
    query="quantum computing",
    max_results=10,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending
)

# Use default client to get results
client = arxiv.Client()
results = client.results(search)

# Iterate through results
for result in results:
    print(f"Title: {result.title}")
    print(f"Authors: {', '.join([author.name for author in result.authors])}")
    print(f"Published: {result.published}")
    print(f"Summary: {result.summary[:200]}...")
    print(f"PDF URL: {result.pdf_url}")
    print("-" * 80)

# Download the first paper's PDF
first_result = next(client.results(search))
first_result.download_pdf(dirpath="./downloads/", filename="paper.pdf")

Architecture

The arxiv package uses a three-layer architecture:

  • Search: Query specification with parameters like keywords, ID lists, result limits, and sorting
  • Client: HTTP client managing API requests, pagination, rate limiting, and retry logic
  • Result: Paper metadata with download capabilities, containing nested Author and Link objects

This design separates query construction from execution and provides reusable clients for efficient API usage across multiple searches.

Capabilities

Search Construction

Build queries using arXiv's search syntax with support for field-specific searches, boolean operators, and ID-based lookups.

class Search:
    def __init__(
        self,
        query: str = "",
        id_list: List[str] = [],
        max_results: int | None = None,
        sort_by: SortCriterion = SortCriterion.Relevance,
        sort_order: SortOrder = SortOrder.Descending
    ):
        """
        Constructs an arXiv API search with the specified criteria.
        
        Parameters:
        - query: Search query string (unencoded). Use syntax like "au:author AND ti:title"
        - id_list: List of arXiv article IDs to limit search to  
        - max_results: Maximum number of results (None for all available, API limit: 300,000)
        - sort_by: Sort criterion (Relevance, LastUpdatedDate, SubmittedDate)
        - sort_order: Sort order (Ascending, Descending)
        """

    def results(self, offset: int = 0) -> Generator[Result, None, None]:
        """
        Executes search using default client. 
        
        DEPRECATED after 2.0.0: Use Client.results() instead.
        This method will emit a DeprecationWarning.
        """

Client Configuration

Configure API client behavior including pagination, rate limiting, and retry strategies.

class Client:
    query_url_format: str = "https://export.arxiv.org/api/query?{}"

    def __init__(
        self,
        page_size: int = 100,
        delay_seconds: float = 3.0,
        num_retries: int = 3
    ):
        """
        Constructs an arXiv API client with specified options.
        
        Note: the default parameters should provide a robust request strategy
        for most use cases. Extreme page sizes, delays, or retries risk
        violating the arXiv API Terms of Use, brittle behavior, and inconsistent results.
        
        Parameters:
        - page_size: Results per API request (max: 2000, smaller is faster but more requests)
        - delay_seconds: Seconds between requests (arXiv ToU requires ≥3 seconds)
        - num_retries: Retry attempts before raising exception
        """

    def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
        """
        Fetches search results using pagination, yielding Result objects.
        
        Parameters:
        - search: Search specification
        - offset: Skip leading records (when >= max_results, returns empty)
        
        Returns:
        Generator yielding Result objects until max_results reached or no more results
        
        Raises:
        - HTTPError: Non-200 response after all retries
        - UnexpectedEmptyPageError: Empty non-first page after all retries  
        """

Result Data and Downloads

Access paper metadata and download PDFs or source archives with customizable paths and filenames.

class Result:
    entry_id: str          # URL like "https://arxiv.org/abs/2107.05580v1"
    updated: datetime      # When result was last updated
    published: datetime    # When result was originally published  
    title: str            # Paper title
    authors: List["Result.Author"] # List of Author objects
    summary: str          # Paper abstract
    comment: Optional[str]   # Authors' comment if present
    journal_ref: Optional[str] # Journal reference if present
    doi: Optional[str]       # DOI URL if present
    primary_category: str # Primary arXiv category
    categories: List[str] # All categories
    links: List["Result.Link"]     # Associated URLs
    pdf_url: Optional[str]   # PDF download URL if available

    def __init__(
        self,
        entry_id: str,
        updated: datetime = _DEFAULT_TIME,
        published: datetime = _DEFAULT_TIME,
        title: str = "",
        authors: List["Result.Author"] = [],
        summary: str = "",
        comment: str = "",
        journal_ref: str = "",
        doi: str = "",
        primary_category: str = "",
        categories: List[str] = [],
        links: List["Result.Link"] = []
    ):
        """
        Constructs an arXiv search result item.
        
        In most cases, prefer creating Result objects from API responses
        using the arxiv Client rather than constructing them manually.
        """


    def get_short_id(self) -> str:
        """
        Returns short ID extracted from entry_id.
        
        Examples:
        - "https://arxiv.org/abs/2107.05580v1" → "2107.05580v1"
        - "https://arxiv.org/abs/quant-ph/0201082v1" → "quant-ph/0201082v1"
        """

    def download_pdf(
        self,
        dirpath: str = "./",
        filename: str = "",
        download_domain: str = "export.arxiv.org"
    ) -> str:
        """
        Downloads PDF to specified directory with optional custom filename.
        
        Parameters:
        - dirpath: Target directory path
        - filename: Custom filename (auto-generated if empty)
        - download_domain: Domain for download (for testing/mirroring)
        
        Returns:
        Path to downloaded file
        """

    def download_source(
        self,
        dirpath: str = "./",
        filename: str = "",
        download_domain: str = "export.arxiv.org"
    ) -> str:
        """
        Downloads source tarfile (.tar.gz) to specified directory.
        
        Parameters:
        - dirpath: Target directory path  
        - filename: Custom filename (auto-generated with .tar.gz if empty)
        - download_domain: Domain for download (for testing/mirroring)
        
        Returns:
        Path to downloaded file
        """

Author and Link Information

Access structured metadata about paper authors and associated links.

class Result.Author:
    """Inner class representing a paper's author."""
    
    name: str  # Author's name

    def __init__(self, name: str):
        """
        Constructs Author with specified name.
        Prefer using Result.Author._from_feed_author() for API parsing.
        """


class Result.Link:
    """Inner class representing a paper's associated links."""
    
    href: str              # Link URL
    title: Optional[str]      # Link title  
    rel: str              # Relationship to Result  
    content_type: str     # HTTP content type

    def __init__(
        self,
        href: str,
        title: Optional[str] = None,
        rel: Optional[str] = None,
        content_type: Optional[str] = None
    ):
        """
        Constructs Link with specified metadata.
        Prefer using Result.Link._from_feed_link() for API parsing.
        """

Sort Configuration

Control result ordering using predefined sort criteria and order options.

from enum import Enum

class SortCriterion(Enum):
    """
    Properties by which search results can be sorted.
    """
    Relevance = "relevance"
    LastUpdatedDate = "lastUpdatedDate" 
    SubmittedDate = "submittedDate"

class SortOrder(Enum):
    """
    Order in which search results are sorted according to SortCriterion.
    """
    Ascending = "ascending"
    Descending = "descending"

Error Handling

Handle API errors, network issues, and data parsing problems with specific exception types.

class ArxivError(Exception):
    """
    Base exception class for arxiv package errors.
    """
    url: str      # Feed URL that could not be fetched
    retry: int    # Request try number (0 for initial, 1+ for retries)
    message: str  # Error description

    def __init__(self, url: str, retry: int, message: str):
        """
        Constructs ArxivError for specified URL and retry attempt.
        """

class HTTPError(ArxivError):  
    """
    Non-200 HTTP status encountered while fetching results.
    """
    status: int  # HTTP status code

    def __init__(self, url: str, retry: int, status: int):
        """
        Constructs HTTPError for specified status code and URL.
        """

class UnexpectedEmptyPageError(ArxivError):
    """
    Error when a non-first page of results is unexpectedly empty.
    Usually resolved by retries due to arXiv API brittleness. 
    """
    raw_feed: feedparser.FeedParserDict  # Raw feedparser output for diagnostics

    def __init__(self, url: str, retry: int, raw_feed: feedparser.FeedParserDict):
        """
        Constructs UnexpectedEmptyPageError for specified URL and feed.
        """

class Result.MissingFieldError(Exception):
    """
    Error indicating entry cannot be parsed due to missing required fields.
    This is a nested exception class inside Result.
    """
    missing_field: str  # Required field missing from entry
    message: str       # Error description

    def __init__(self, missing_field: str):
        """
        Constructs MissingFieldError for specified missing field.
        
        Parameters:
        - missing_field: The name of the required field that was missing
        """

Advanced Usage Examples

Complex Search Queries

import arxiv

# Author and title search
search = arxiv.Search(query="au:del_maestro AND ti:checkerboard")

# Category-specific search with date range  
search = arxiv.Search(
    query="cat:cs.AI AND submittedDate:[20230101 TO 20231231]",
    max_results=50,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

# Multiple specific papers by ID
search = arxiv.Search(id_list=["1605.08386v1", "2107.05580", "quant-ph/0201082"])

client = arxiv.Client()
for result in client.results(search):
    print(f"{result.get_short_id()}: {result.title}")

Custom Client Configuration

import arxiv

# High-throughput client (be careful with rate limits)
fast_client = arxiv.Client(
    page_size=2000,      # Maximum page size
    delay_seconds=3.0,   # Minimum required by arXiv ToU
    num_retries=5        # More retries for reliability
)

# Conservative client for fragile networks
safe_client = arxiv.Client(
    page_size=50,        # Smaller pages
    delay_seconds=5.0,   # Extra delay
    num_retries=10       # Many retries
)

search = arxiv.Search(query="machine learning", max_results=1000)

# Use specific client
results = list(fast_client.results(search))
print(f"Retrieved {len(results)} papers")

Batch Downloads

import arxiv
import os

# Create download directory
os.makedirs("./papers", exist_ok=True)

search = arxiv.Search(
    query="cat:cs.LG AND ti:transformer",
    max_results=20,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending
)

client = arxiv.Client()

for i, result in enumerate(client.results(search)):
    try:
        # Download PDF with custom filename
        filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.pdf"
        path = result.download_pdf(dirpath="./papers", filename=filename)
        print(f"Downloaded: {path}")
        
        # Also download source if available
        src_filename = f"{i:02d}_{result.get_short_id().replace('/', '_')}.tar.gz"
        src_path = result.download_source(dirpath="./papers", filename=src_filename)
        print(f"Downloaded source: {src_path}")
        
    except Exception as e:
        print(f"Failed to download {result.entry_id}: {e}")

Error Handling

import arxiv
import logging

# Enable debug logging to see API calls
logging.basicConfig(level=logging.DEBUG)

client = arxiv.Client(num_retries=2)
search = arxiv.Search(query="invalid:query:syntax", max_results=10)

try:
    results = list(client.results(search))
    print(f"Found {len(results)} results")
    
except arxiv.HTTPError as e:
    print(f"HTTP error {e.status} after {e.retry} retries: {e.message}")
    print(f"URL: {e.url}")
    
except arxiv.UnexpectedEmptyPageError as e:
    print(f"Empty page after {e.retry} retries: {e.message}")
    print(f"Raw feed info: {e.raw_feed.bozo_exception if e.raw_feed.bozo else 'No bozo exception'}")
    
except Exception as e:
    print(f"Unexpected error: {e}")