or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

archive-reading.mdcli-tools.mdhttp-capture.mdhttp-headers.mdindex.mdstream-processing.mdtime-utilities.mdwarc-writing.md

index.mddocs/

0

# warcio

1

2

A comprehensive Python library for reading and writing WARC (Web ARChive) and ARC (ARChive) files. warcio provides streaming I/O capabilities with automatic format detection, compression handling, and HTTP traffic capture functionality, serving as the foundation for web archiving and digital preservation workflows.

3

4

## Package Information

5

6

- **Package Name**: warcio

7

- **Language**: Python

8

- **Installation**: `pip install warcio`

9

10

## Core Imports

11

12

```python

13

from warcio import StatusAndHeaders, ArchiveIterator, WARCWriter

14

```

15

16

Individual components:

17

18

```python

19

from warcio.statusandheaders import StatusAndHeaders, StatusAndHeadersParser

20

from warcio.archiveiterator import ArchiveIterator, WARCIterator, ARCIterator

21

from warcio.warcwriter import WARCWriter, BufferWARCWriter

22

from warcio.recordbuilder import RecordBuilder

23

from warcio.capture_http import capture_http

24

from warcio.utils import Digester, BUFF_SIZE

25

from warcio.exceptions import ArchiveLoadFailed

26

from warcio.indexer import Indexer

27

from warcio.checker import Checker

28

from warcio.extractor import Extractor

29

from warcio.recompressor import Recompressor

30

```

31

32

## Basic Usage

33

34

```python

35

from warcio import ArchiveIterator, WARCWriter, StatusAndHeaders

36

from warcio.recordbuilder import RecordBuilder

37

from warcio.capture_http import capture_http

38

import requests

39

import io

40

41

# Reading WARC files

42

with open('example.warc.gz', 'rb') as stream:

43

for record in ArchiveIterator(stream):

44

if record.rec_type == 'response':

45

print(f"URL: {record.rec_headers.get_header('WARC-Target-URI')}")

46

print(f"Status: {record.http_headers.get_statuscode()}")

47

print(f"Content-Type: {record.http_headers.get_header('Content-Type')}")

48

# Access decompressed content

49

content = record.content_stream().read()

50

51

# Writing WARC files manually

52

output_buffer = io.BytesIO()

53

writer = WARCWriter(output_buffer)

54

builder = RecordBuilder()

55

56

# Create a response record

57

record = builder.create_warc_record(

58

uri='http://example.com',

59

record_type='response',

60

payload=io.BytesIO(b'Hello, World!'),

61

http_headers=StatusAndHeaders('200 OK', [('Content-Type', 'text/plain')])

62

)

63

writer.write_record(record)

64

65

# HTTP capture (common usage)

66

with capture_http('example.warc.gz') as writer:

67

requests.get('https://example.com/') # Automatically captured to WARC

68

```

69

70

## Architecture

71

72

warcio follows a layered architecture designed for streaming processing:

73

74

- **ArchiveIterator**: Provides sequential access to records with automatic format detection and decompression

75

- **RecordBuilder**: Creates new WARC records with proper headers and digests

76

- **WARCWriter**: Handles serialization and compression for output files

77

- **StatusAndHeaders**: Manages HTTP-style headers with case-insensitive access

78

- **Stream Processing**: Buffered readers with compression support and digest verification (BufferedReader, LimitReader, DigestVerifyingReader)

79

- **HTTP Capture**: Live traffic recording with monkey-patching of http.client

80

- **CLI Tools**: Command-line utilities for indexing, checking, extraction, and recompression

81

- **Time Utilities**: Comprehensive timestamp handling for web archive formats

82

- **Exception Handling**: Specialized exceptions for archive loading and parsing errors

83

84

This design enables efficient processing of large archive files without loading entire contents into memory, supporting both WARC 1.0/1.1 and legacy ARC formats.

85

86

## Capabilities

87

88

### Archive Reading and Iteration

89

90

Core functionality for reading and iterating through WARC and ARC files with automatic format detection, decompression, and record parsing.

91

92

```python { .api }

93

class ArchiveIterator:

94

def __init__(self, fileobj, no_record_parse=False, verify_http=False,

95

arc2warc=False, ensure_http_headers=False,

96

block_size=16384, check_digests=False): ...

97

98

def __iter__(self): ...

99

def __next__(self): ...

100

def close(self): ...

101

def get_record_offset(self): ...

102

def get_record_length(self): ...

103

104

class WARCIterator(ArchiveIterator):

105

def __init__(self, *args, **kwargs): ...

106

107

class ARCIterator(ArchiveIterator):

108

def __init__(self, *args, **kwargs): ...

109

```

110

111

[Archive Reading](./archive-reading.md)

112

113

### WARC Writing and Record Creation

114

115

Functionality for creating and writing WARC files, including record building, header management, and compression.

116

117

```python { .api }

118

class WARCWriter:

119

def __init__(self, filebuf, gzip=True, warc_version=None, header_filter=None): ...

120

def write_record(self, record, params=None): ...

121

def write_request_response_pair(self, req, resp, params=None): ...

122

123

class BufferWARCWriter(WARCWriter):

124

def __init__(self, gzip=True, warc_version=None, header_filter=None): ...

125

def get_contents(self): ...

126

def get_stream(self): ...

127

128

class RecordBuilder:

129

def __init__(self, warc_version=None, header_filter=None): ...

130

def create_warc_record(self, uri, record_type, payload=None, length=None,

131

warc_content_type='', warc_headers_dict=None,

132

warc_headers=None, http_headers=None): ...

133

def create_revisit_record(self, uri, digest, refers_to_uri, refers_to_date,

134

http_headers=None, warc_headers_dict=None): ...

135

def create_warcinfo_record(self, filename, info): ...

136

```

137

138

[WARC Writing](./warc-writing.md)

139

140

### HTTP Headers and Status Management

141

142

Comprehensive HTTP header parsing, manipulation, and formatting with support for status lines and case-insensitive access.

143

144

```python { .api }

145

class StatusAndHeaders:

146

def __init__(self, statusline, headers, protocol='', total_len=0,

147

is_http_request=False): ...

148

def get_header(self, name, default_value=None): ...

149

def add_header(self, name, value): ...

150

def replace_header(self, name, value): ...

151

def remove_header(self, name): ...

152

def get_statuscode(self): ...

153

154

class StatusAndHeadersParser:

155

def __init__(self, statuslist, verify=True): ...

156

def parse(self, stream, full_statusline=None): ...

157

```

158

159

[HTTP Headers](./http-headers.md)

160

161

### HTTP Traffic Capture

162

163

Live HTTP traffic recording capabilities that capture requests and responses directly to WARC format.

164

165

```python { .api }

166

def capture_http(warc_writer=None, filter_func=None, append=True,

167

record_ip=True, **kwargs): ...

168

```

169

170

[HTTP Capture](./http-capture.md)

171

172

### Stream Processing and Utilities

173

174

Advanced stream processing with compression, digest verification, and buffered reading capabilities.

175

176

```python { .api }

177

class BufferedReader:

178

def __init__(self, stream, block_size=16384, decomp_type=None,

179

starting_data=None, read_all_members=False): ...

180

def read(self, length=None): ...

181

def readline(self, length=None): ...

182

183

class LimitReader:

184

def __init__(self, stream, limit): ...

185

def read(self, length=None): ...

186

def readline(self, length=None): ...

187

188

class DigestVerifyingReader:

189

def __init__(self, stream, limit, digest_checker, record_type=None,

190

payload_digest=None, block_digest=None, segment_number=None): ...

191

```

192

193

[Stream Processing](./stream-processing.md)

194

195

### Time and Date Utilities

196

197

Comprehensive time handling for web archive timestamps with support for multiple date formats and timezone handling.

198

199

```python { .api }

200

def iso_date_to_datetime(string, tz_aware=False): ...

201

def http_date_to_datetime(string, tz_aware=False): ...

202

def datetime_to_http_date(the_datetime): ...

203

def datetime_to_iso_date(the_datetime, use_micros=False): ...

204

def timestamp_now(): ...

205

def timestamp_to_datetime(string, tz_aware=False): ...

206

```

207

208

[Time Utilities](./time-utilities.md)

209

210

### Command Line Tools

211

212

Built-in command line utilities for indexing, checking, extracting, and recompressing WARC/ARC files.

213

214

```python { .api }

215

class Indexer:

216

def __init__(self, fields, inputs, output, verify_http=False): ...

217

def process_all(self): ...

218

219

class Checker:

220

def __init__(self, cmd): ...

221

def process_all(self): ...

222

223

class Extractor:

224

def __init__(self, filename, offset): ...

225

def extract(self, payload_only, headers_only): ...

226

227

class Recompressor:

228

def __init__(self, filename, output, verbose=False): ...

229

def recompress(self): ...

230

```

231

232

[Command Line Tools](./cli-tools.md)

233

234

## Types

235

236

```python { .api }

237

class ArcWarcRecord:

238

"""Represents a parsed WARC/ARC record."""

239

def __init__(self, format, rec_type, rec_headers, raw_stream,

240

http_headers=None, content_type=None, length=None,

241

payload_length=-1, digest_checker=None): ...

242

def content_stream(self): ...

243

244

class Digester:

245

"""Hash digest calculator."""

246

def __init__(self, type_='sha1'): ...

247

def update(self, buff): ...

248

def __str__(self): ...

249

250

class DigestChecker:

251

"""Digest validation checker."""

252

def __init__(self, kind=None): ...

253

@property

254

def passed(self): ...

255

@property

256

def problems(self): ...

257

258

# Exception Classes

259

class ArchiveLoadFailed(Exception):

260

"""Exception for archive loading failures."""

261

def __init__(self, reason): ...

262

263

class ChunkedDataException(Exception):

264

"""Exception for chunked data parsing errors."""

265

def __init__(self, msg, data=b''): ...

266

267

class StatusAndHeadersParserException(Exception):

268

"""Exception for status/headers parsing errors."""

269

def __init__(self, msg, statusline): ...

270

271

# Constants

272

BUFF_SIZE = 16384

273

```