A Python interface to archive.org for programmatic access to the Internet Archive's digital library
npx @tessl/cli install tessl/pypi-internetarchive@5.5.00
# Internet Archive Python Library
1
2
A comprehensive Python interface to archive.org for programmatic access to the Internet Archive's vast digital library. This library enables developers to search, download, upload, and manage items in the Internet Archive through both a Python API and command-line tools.
3
4
## Package Information
5
6
- **Package Name**: internetarchive
7
- **Language**: Python
8
- **Installation**: `pip install internetarchive`
9
- **Version**: 5.5.0
10
- **License**: AGPL-3.0
11
12
## Core Imports
13
14
```python
15
import internetarchive
16
```
17
18
Common imports for specific functionality:
19
20
```python
21
from internetarchive import get_item, search_items, get_session
22
from internetarchive import Item, Search, ArchiveSession
23
```
24
25
## Basic Usage
26
27
```python
28
import internetarchive
29
30
# Get an item from the Internet Archive
31
item = internetarchive.get_item('govlawgacode20071')
32
print(f"Item exists: {item.exists}")
33
print(f"Item title: {item.metadata.get('title')}")
34
35
# Download files from an item
36
item.download()
37
38
# Search for items
39
search = internetarchive.search_items('collection:nasa')
40
for result in search:
41
print(f"Found: {result['identifier']} - {result.get('title', 'No title')}")
42
43
# Upload files to create or update an item
44
internetarchive.upload('my-item-identifier',
45
files=['local-file.txt'],
46
metadata={'title': 'My Item', 'creator': 'Your Name'})
47
```
48
49
## Architecture
50
51
The Internet Archive Python library follows a layered architecture:
52
53
- **ArchiveSession**: Core session management with persistent configuration and authentication
54
- **Item/Collection**: Object-oriented representation of Archive.org items and collections
55
- **File**: Individual file objects within items with download and management capabilities
56
- **Search**: Powerful search interface with result iteration and filtering
57
- **Catalog**: Task management system for Archive.org operations
58
- **CLI Tools**: Comprehensive command-line interface via the `ia` command
59
60
This design enables both high-level convenience functions and low-level session-based access patterns, supporting everything from simple file downloads to complex metadata operations and bulk processing workflows.
61
62
## Capabilities
63
64
### Session Management
65
66
Create and manage persistent sessions with configuration, authentication, and HTTP adapter customization for efficient bulk operations.
67
68
```python { .api }
69
def get_session(config=None, config_file=None, debug=False, http_adapter_kwargs=None):
70
"""
71
Return a new ArchiveSession object for persistent configuration across tasks.
72
73
Args:
74
config (dict, optional): Configuration dictionary
75
config_file (str, optional): Path to configuration file
76
debug (bool): Enable debug logging
77
http_adapter_kwargs (dict, optional): HTTP adapter keyword arguments
78
79
Returns:
80
ArchiveSession: Session object for API interactions
81
"""
82
```
83
84
[Session Management](./session-management.md)
85
86
### Item Operations
87
88
Access, download, upload, and manage Archive.org items with comprehensive metadata support and file filtering capabilities.
89
90
```python { .api }
91
def get_item(identifier, config=None, config_file=None, archive_session=None, debug=False, http_adapter_kwargs=None, request_kwargs=None):
92
"""
93
Get an Item object by Archive.org identifier.
94
95
Args:
96
identifier (str): The globally unique Archive.org item identifier
97
config (dict, optional): Configuration dictionary
98
config_file (str, optional): Path to configuration file
99
archive_session (ArchiveSession, optional): Existing session object
100
debug (bool): Enable debug logging
101
http_adapter_kwargs (dict, optional): HTTP adapter kwargs
102
request_kwargs (dict, optional): Request kwargs
103
104
Returns:
105
Item: Item object for the specified identifier
106
"""
107
108
def upload(identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=False, verify=False, checksum=False, delete=False, retries=None, retries_sleep=None, debug=False, validate_identifier=False, request_kwargs=None, **get_item_kwargs):
109
"""
110
Upload files to an Archive.org item (creates item if it doesn't exist).
111
112
Args:
113
identifier (str): Item identifier to upload to
114
files (list): List of file paths or file-like objects to upload
115
metadata (dict, optional): Item metadata
116
headers (dict, optional): HTTP headers
117
Various authentication and upload options...
118
119
Returns:
120
list: List of Request/Response objects from upload operations
121
"""
122
123
def download(identifier, files=None, formats=None, glob_pattern=None, dry_run=False, verbose=False, ignore_existing=False, checksum=False, checksum_archive=False, destdir=None, no_directory=False, retries=None, item_index=None, ignore_errors=False, on_the_fly=False, return_responses=False, no_change_timestamp=False, timeout=None, **get_item_kwargs):
124
"""
125
Download files from an Archive.org item with extensive filtering options.
126
127
Args:
128
identifier (str): Item identifier to download from
129
files (list, optional): Specific files to download
130
formats (list, optional): File formats to download
131
glob_pattern (str, optional): Glob pattern for file selection
132
Various download configuration options...
133
134
Returns:
135
list: List of Request/Response objects from download operations
136
"""
137
```
138
139
[Item Operations](./item-operations.md)
140
141
### Search Operations
142
143
Search the Internet Archive with advanced query syntax, field selection, sorting, and full-text search capabilities.
144
145
```python { .api }
146
def search_items(query, fields=None, sorts=None, params=None, full_text_search=False, dsl_fts=False, archive_session=None, config=None, config_file=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None):
147
"""
148
Search for items on Archive.org with advanced filtering options.
149
150
Args:
151
query (str): Search query string
152
fields (list, optional): Fields to return in results
153
sorts (list, optional): Sort criteria
154
params (dict, optional): Additional search parameters
155
full_text_search (bool): Enable full-text search
156
dsl_fts (bool): Enable DSL full-text search
157
Various session and request options...
158
159
Returns:
160
Search: Search object for iterating over results
161
"""
162
```
163
164
[Search Operations](./search-operations.md)
165
166
### File Management
167
168
Access and manage individual files within Archive.org items, including download, deletion, and metadata access.
169
170
```python { .api }
171
def get_files(identifier, files=None, formats=None, glob_pattern=None, exclude_pattern=None, on_the_fly=False, **get_item_kwargs):
172
"""
173
Get File objects from an item with optional filtering.
174
175
Args:
176
identifier (str): Item identifier
177
files (list, optional): Specific files to retrieve
178
formats (list, optional): File formats to filter by
179
glob_pattern (str, optional): Glob pattern for file selection
180
exclude_pattern (str, optional): Glob pattern for exclusion
181
on_the_fly (bool): Include on-the-fly files
182
183
Returns:
184
list: List of File objects
185
"""
186
187
def delete(identifier, files=None, formats=None, glob_pattern=None, cascade_delete=False, access_key=None, secret_key=None, verbose=False, debug=False, **kwargs):
188
"""
189
Delete files from an Archive.org item.
190
191
Args:
192
identifier (str): Item identifier
193
files (list, optional): Specific files to delete
194
formats (list, optional): File formats to delete
195
glob_pattern (str, optional): Glob pattern for file selection
196
cascade_delete (bool): Delete derived files
197
Various authentication and request options...
198
199
Returns:
200
list: List of Request/Response objects from delete operations
201
"""
202
```
203
204
[File Management](./file-management.md)
205
206
### Metadata Operations
207
208
View and modify item metadata with support for appending, targeting specific metadata sections, and batch operations.
209
210
```python { .api }
211
def modify_metadata(identifier, metadata, target=None, append=False, append_list=False, priority=0, access_key=None, secret_key=None, debug=False, request_kwargs=None, **get_item_kwargs):
212
"""
213
Modify metadata of an existing Archive.org item.
214
215
Args:
216
identifier (str): Item identifier
217
metadata (dict): Metadata changes to apply
218
target (str, optional): Target metadata section
219
append (bool): Append to existing metadata
220
append_list (bool): Append to metadata lists
221
priority (int): Task priority
222
Various authentication and request options...
223
224
Returns:
225
Request or Response: Metadata modification result
226
"""
227
```
228
229
[Metadata Operations](./metadata-operations.md)
230
231
### Task Management
232
233
Manage Archive.org catalog tasks including derive operations, item processing, and task monitoring.
234
235
```python { .api }
236
def get_tasks(identifier="", params=None, config=None, config_file=None, archive_session=None, http_adapter_kwargs=None, request_kwargs=None):
237
"""
238
Get tasks from the Archive.org catalog system.
239
240
Args:
241
identifier (str, optional): Filter tasks by item identifier
242
params (dict, optional): Additional task query parameters
243
Various session and request options...
244
245
Returns:
246
set: Set of CatalogTask objects
247
"""
248
```
249
250
[Task Management](./task-management.md)
251
252
### Configuration and Authentication
253
254
Configure the library with Archive.org credentials and retrieve user information.
255
256
```python { .api }
257
def configure(username="", password="", config_file="", host="archive.org"):
258
"""
259
Configure internetarchive with Archive.org credentials.
260
261
Args:
262
username (str): Archive.org username
263
password (str): Archive.org password
264
config_file (str): Path to config file
265
host (str): Archive.org host
266
267
Returns:
268
str: Path to configuration file
269
"""
270
271
def get_username(access_key, secret_key):
272
"""
273
Get Archive.org username from IA-S3 key pair.
274
275
Args:
276
access_key (str): IA-S3 access key
277
secret_key (str): IA-S3 secret key
278
279
Returns:
280
str: Archive.org username
281
"""
282
283
def get_user_info(access_key, secret_key):
284
"""
285
Get detailed user information from IA-S3 key pair.
286
287
Args:
288
access_key (str): IA-S3 access key
289
secret_key (str): IA-S3 secret key
290
291
Returns:
292
dict: User information dictionary
293
"""
294
```
295
296
[Configuration and Authentication](./configuration-auth.md)
297
298
### Account Management
299
300
Administrative functions for managing Archive.org user accounts. Requires administrative privileges.
301
302
**Note:** The Account class is not part of the main public API but can be imported directly from `internetarchive.account`.
303
304
```python { .api }
305
# Import required for Account class
306
from internetarchive.account import Account
307
308
class Account:
309
"""
310
Administrative interface for managing Archive.org user accounts.
311
312
Note: Requires administrative privileges.
313
"""
314
315
@classmethod
316
def from_account_lookup(cls, identifier_type: str, identifier: str, session=None):
317
"""
318
Factory method to get Account by identifier type and value.
319
320
Args:
321
identifier_type (str): Type of identifier ('email', 'screenname', 'itemname')
322
identifier (str): The identifier value (e.g., 'user@example.com')
323
session (ArchiveSession, optional): Session object to use
324
325
Returns:
326
Account: Account object with user information
327
328
Raises:
329
AccountAPIError: If account lookup fails or access denied
330
"""
331
332
def lock(self, comment: str):
333
"""Lock the account with a comment."""
334
335
def unlock(self, comment: str):
336
"""Unlock the account with a comment."""
337
338
def to_dict(self):
339
"""Convert account data to dictionary."""
340
```
341
342
[Account Management](./account-management.md)
343
344
### Command Line Interface
345
346
Comprehensive command-line tools accessible through the `ia` command for all major Archive.org operations.
347
348
```python { .api }
349
# CLI Commands (accessed via command line):
350
# ia configure - Configure credentials
351
# ia upload - Upload files to items
352
# ia download - Download files from items
353
# ia delete - Delete files from items
354
# ia metadata - View/modify item metadata
355
# ia search - Search Archive.org
356
# ia list - List item files
357
# ia tasks - Manage catalog tasks
358
# ia copy - Copy files between items
359
# ia move - Move files between items
360
# ia account - Account management
361
# ia reviews - Manage item reviews
362
# ia flag - Flag items for review
363
```
364
365
[Command Line Interface](./cli-interface.md)
366
367
## Types
368
369
```python { .api }
370
class ArchiveSession:
371
"""Main session class for Internet Archive operations."""
372
373
class Item:
374
"""Represents an Archive.org item."""
375
376
class Collection:
377
"""Represents an Archive.org collection (extends Item)."""
378
379
class File:
380
"""Represents a file within an Archive.org item."""
381
382
class Search:
383
"""Represents a search query and results."""
384
385
class Catalog:
386
"""Interface to Archive.org catalog/tasks system."""
387
388
class CatalogTask:
389
"""Represents a catalog task."""
390
391
class Account:
392
"""Account management interface (requires admin privileges)."""
393
394
# Package metadata
395
__version__: str
396
"""Current version of the internetarchive package (5.5.0)."""
397
398
# Exceptions
399
class AuthenticationError(Exception):
400
"""Authentication failed."""
401
402
class ItemLocateError(Exception):
403
"""Item cannot be located (dark or non-existent)."""
404
405
class InvalidChecksumError(Exception):
406
"""File corrupt, checksums don't match."""
407
408
class AccountAPIError(Exception):
409
"""Account API-related errors."""
410
```