Streaming WARC (and ARC) IO library for reading and writing web archive files
npx @tessl/cli install tessl/pypi-warcio@1.7.00
# warcio
1
2
A comprehensive Python library for reading and writing WARC (Web ARChive) and ARC (ARChive) files. warcio provides streaming I/O capabilities with automatic format detection, compression handling, and HTTP traffic capture functionality, serving as the foundation for web archiving and digital preservation workflows.
3
4
## Package Information
5
6
- **Package Name**: warcio
7
- **Language**: Python
8
- **Installation**: `pip install warcio`
9
10
## Core Imports
11
12
```python
13
from warcio import StatusAndHeaders, ArchiveIterator, WARCWriter
14
```
15
16
Individual components:
17
18
```python
19
from warcio.statusandheaders import StatusAndHeaders, StatusAndHeadersParser
20
from warcio.archiveiterator import ArchiveIterator, WARCIterator, ARCIterator
21
from warcio.warcwriter import WARCWriter, BufferWARCWriter
22
from warcio.recordbuilder import RecordBuilder
23
from warcio.capture_http import capture_http
24
from warcio.utils import Digester, BUFF_SIZE
25
from warcio.exceptions import ArchiveLoadFailed
26
from warcio.indexer import Indexer
27
from warcio.checker import Checker
28
from warcio.extractor import Extractor
29
from warcio.recompressor import Recompressor
30
```
31
32
## Basic Usage
33
34
```python
35
from warcio import ArchiveIterator, WARCWriter, StatusAndHeaders
36
from warcio.recordbuilder import RecordBuilder
37
from warcio.capture_http import capture_http
38
import requests
39
import io
40
41
# Reading WARC files
42
with open('example.warc.gz', 'rb') as stream:
43
for record in ArchiveIterator(stream):
44
if record.rec_type == 'response':
45
print(f"URL: {record.rec_headers.get_header('WARC-Target-URI')}")
46
print(f"Status: {record.http_headers.get_statuscode()}")
47
print(f"Content-Type: {record.http_headers.get_header('Content-Type')}")
48
# Access decompressed content
49
content = record.content_stream().read()
50
51
# Writing WARC files manually
52
output_buffer = io.BytesIO()
53
writer = WARCWriter(output_buffer)
54
builder = RecordBuilder()
55
56
# Create a response record
57
record = builder.create_warc_record(
58
uri='http://example.com',
59
record_type='response',
60
payload=io.BytesIO(b'Hello, World!'),
61
http_headers=StatusAndHeaders('200 OK', [('Content-Type', 'text/plain')])
62
)
63
writer.write_record(record)
64
65
# HTTP capture (common usage)
66
with capture_http('example.warc.gz') as writer:
67
requests.get('https://example.com/') # Automatically captured to WARC
68
```
69
70
## Architecture
71
72
warcio follows a layered architecture designed for streaming processing:
73
74
- **ArchiveIterator**: Provides sequential access to records with automatic format detection and decompression
75
- **RecordBuilder**: Creates new WARC records with proper headers and digests
76
- **WARCWriter**: Handles serialization and compression for output files
77
- **StatusAndHeaders**: Manages HTTP-style headers with case-insensitive access
78
- **Stream Processing**: Buffered readers with compression support and digest verification (BufferedReader, LimitReader, DigestVerifyingReader)
79
- **HTTP Capture**: Live traffic recording with monkey-patching of http.client
80
- **CLI Tools**: Command-line utilities for indexing, checking, extraction, and recompression
81
- **Time Utilities**: Comprehensive timestamp handling for web archive formats
82
- **Exception Handling**: Specialized exceptions for archive loading and parsing errors
83
84
This design enables efficient processing of large archive files without loading entire contents into memory, supporting both WARC 1.0/1.1 and legacy ARC formats.
85
86
## Capabilities
87
88
### Archive Reading and Iteration
89
90
Core functionality for reading and iterating through WARC and ARC files with automatic format detection, decompression, and record parsing.
91
92
```python { .api }
93
class ArchiveIterator:
94
def __init__(self, fileobj, no_record_parse=False, verify_http=False,
95
arc2warc=False, ensure_http_headers=False,
96
block_size=16384, check_digests=False): ...
97
98
def __iter__(self): ...
99
def __next__(self): ...
100
def close(self): ...
101
def get_record_offset(self): ...
102
def get_record_length(self): ...
103
104
class WARCIterator(ArchiveIterator):
105
def __init__(self, *args, **kwargs): ...
106
107
class ARCIterator(ArchiveIterator):
108
def __init__(self, *args, **kwargs): ...
109
```
110
111
[Archive Reading](./archive-reading.md)
112
113
### WARC Writing and Record Creation
114
115
Functionality for creating and writing WARC files, including record building, header management, and compression.
116
117
```python { .api }
118
class WARCWriter:
119
def __init__(self, filebuf, gzip=True, warc_version=None, header_filter=None): ...
120
def write_record(self, record, params=None): ...
121
def write_request_response_pair(self, req, resp, params=None): ...
122
123
class BufferWARCWriter(WARCWriter):
124
def __init__(self, gzip=True, warc_version=None, header_filter=None): ...
125
def get_contents(self): ...
126
def get_stream(self): ...
127
128
class RecordBuilder:
129
def __init__(self, warc_version=None, header_filter=None): ...
130
def create_warc_record(self, uri, record_type, payload=None, length=None,
131
warc_content_type='', warc_headers_dict=None,
132
warc_headers=None, http_headers=None): ...
133
def create_revisit_record(self, uri, digest, refers_to_uri, refers_to_date,
134
http_headers=None, warc_headers_dict=None): ...
135
def create_warcinfo_record(self, filename, info): ...
136
```
137
138
[WARC Writing](./warc-writing.md)
139
140
### HTTP Headers and Status Management
141
142
Comprehensive HTTP header parsing, manipulation, and formatting with support for status lines and case-insensitive access.
143
144
```python { .api }
145
class StatusAndHeaders:
146
def __init__(self, statusline, headers, protocol='', total_len=0,
147
is_http_request=False): ...
148
def get_header(self, name, default_value=None): ...
149
def add_header(self, name, value): ...
150
def replace_header(self, name, value): ...
151
def remove_header(self, name): ...
152
def get_statuscode(self): ...
153
154
class StatusAndHeadersParser:
155
def __init__(self, statuslist, verify=True): ...
156
def parse(self, stream, full_statusline=None): ...
157
```
158
159
[HTTP Headers](./http-headers.md)
160
161
### HTTP Traffic Capture
162
163
Live HTTP traffic recording capabilities that capture requests and responses directly to WARC format.
164
165
```python { .api }
166
def capture_http(warc_writer=None, filter_func=None, append=True,
167
record_ip=True, **kwargs): ...
168
```
169
170
[HTTP Capture](./http-capture.md)
171
172
### Stream Processing and Utilities
173
174
Advanced stream processing with compression, digest verification, and buffered reading capabilities.
175
176
```python { .api }
177
class BufferedReader:
178
def __init__(self, stream, block_size=16384, decomp_type=None,
179
starting_data=None, read_all_members=False): ...
180
def read(self, length=None): ...
181
def readline(self, length=None): ...
182
183
class LimitReader:
184
def __init__(self, stream, limit): ...
185
def read(self, length=None): ...
186
def readline(self, length=None): ...
187
188
class DigestVerifyingReader:
189
def __init__(self, stream, limit, digest_checker, record_type=None,
190
payload_digest=None, block_digest=None, segment_number=None): ...
191
```
192
193
[Stream Processing](./stream-processing.md)
194
195
### Time and Date Utilities
196
197
Comprehensive time handling for web archive timestamps with support for multiple date formats and timezone handling.
198
199
```python { .api }
200
def iso_date_to_datetime(string, tz_aware=False): ...
201
def http_date_to_datetime(string, tz_aware=False): ...
202
def datetime_to_http_date(the_datetime): ...
203
def datetime_to_iso_date(the_datetime, use_micros=False): ...
204
def timestamp_now(): ...
205
def timestamp_to_datetime(string, tz_aware=False): ...
206
```
207
208
[Time Utilities](./time-utilities.md)
209
210
### Command Line Tools
211
212
Built-in command line utilities for indexing, checking, extracting, and recompressing WARC/ARC files.
213
214
```python { .api }
215
class Indexer:
216
def __init__(self, fields, inputs, output, verify_http=False): ...
217
def process_all(self): ...
218
219
class Checker:
220
def __init__(self, cmd): ...
221
def process_all(self): ...
222
223
class Extractor:
224
def __init__(self, filename, offset): ...
225
def extract(self, payload_only, headers_only): ...
226
227
class Recompressor:
228
def __init__(self, filename, output, verbose=False): ...
229
def recompress(self): ...
230
```
231
232
[Command Line Tools](./cli-tools.md)
233
234
## Types
235
236
```python { .api }
237
class ArcWarcRecord:
238
"""Represents a parsed WARC/ARC record."""
239
def __init__(self, format, rec_type, rec_headers, raw_stream,
240
http_headers=None, content_type=None, length=None,
241
payload_length=-1, digest_checker=None): ...
242
def content_stream(self): ...
243
244
class Digester:
245
"""Hash digest calculator."""
246
def __init__(self, type_='sha1'): ...
247
def update(self, buff): ...
248
def __str__(self): ...
249
250
class DigestChecker:
251
"""Digest validation checker."""
252
def __init__(self, kind=None): ...
253
@property
254
def passed(self): ...
255
@property
256
def problems(self): ...
257
258
# Exception Classes
259
class ArchiveLoadFailed(Exception):
260
"""Exception for archive loading failures."""
261
def __init__(self, reason): ...
262
263
class ChunkedDataException(Exception):
264
"""Exception for chunked data parsing errors."""
265
def __init__(self, msg, data=b''): ...
266
267
class StatusAndHeadersParserException(Exception):
268
"""Exception for status/headers parsing errors."""
269
def __init__(self, msg, statusline): ...
270
271
# Constants
272
BUFF_SIZE = 16384
273
```