0
# Utilities and CLI Tools
1
2
Command-line utilities and functions for converting XML dumps to various formats, validating revision documents, and normalizing data structures. These tools provide additional processing capabilities beyond the core streaming API.
3
4
## Capabilities
5
6
### Dump to Revision Documents Conversion
7
8
Converts MediaWiki XML dumps to page-partitioned sequences of revision JSON documents for easier processing and analysis.
9
10
```python { .api }
11
def dump2revdocs(dump, verbose=False):
12
"""
13
Converts XML dumps to page-partitioned sequences of revision JSON documents.
14
15
This function processes each page in the dump and yields JSON representations
16
of all revisions. The JSON documents contain all revision metadata and content
17
in a structured format suitable for further processing or storage.
18
19
Parameters:
20
- dump: mwxml.Dump object to process
21
- verbose: Print progress information to stderr (bool, default: False)
22
Shows page titles and revision progress dots when enabled
23
24
Yields: JSON strings representing revision documents (calls revision.to_json())
25
"""
26
```
27
28
**Usage Example:**
29
30
```python
31
import mwxml
32
from mwxml.utilities import dump2revdocs
33
import json
34
35
# Process dump to JSON documents
36
dump = mwxml.Dump.from_file(open("dump.xml"))
37
38
# Convert with progress output
39
revision_docs = []
40
for json_doc in dump2revdocs(dump, verbose=True):
41
revision_doc = json.loads(json_doc)
42
revision_docs.append(revision_doc)
43
44
# Process individual revision document
45
print(f"Revision {revision_doc['id']} on page {revision_doc['page']['title']}")
46
47
# Save to file
48
with open("revisions.jsonl", "w") as f:
49
dump = mwxml.Dump.from_file(open("dump.xml"))
50
for json_doc in dump2revdocs(dump):
51
f.write(json_doc + "\n")
52
```
53
54
### Document Validation
55
56
Compares a stream of revision documents against a schema to ensure data integrity and format compliance.
57
58
```python { .api }
59
def validate(docs, schema, verbose=False):
60
"""
61
Compares a stream of revision documents against a JSON schema.
62
63
Validates revision documents to ensure they conform to expected
64
structure and data types using jsonschema validation. Documents
65
that fail validation will raise a ValidationError.
66
67
Parameters:
68
- docs: Iterable of revision document objects (parsed JSON)
69
- schema: JSON schema definition for validation (dict)
70
- verbose: Print progress information (bool, default: False)
71
72
Yields: Validated revision documents that pass schema validation
73
Raises: jsonschema.ValidationError if document doesn't match schema
74
"""
75
```
76
77
**Usage Example:**
78
79
```python
80
from mwxml.utilities import validate, dump2revdocs
81
import mwxml
82
83
# Generate revision documents
84
dump = mwxml.Dump.from_file(open("dump.xml"))
85
docs = list(dump2revdocs(dump))
86
87
# Define expected schema (example)
88
schema = {
89
"type": "object",
90
"required": ["id", "timestamp", "page"],
91
"properties": {
92
"id": {"type": "integer"},
93
"timestamp": {"type": "string"},
94
"page": {
95
"type": "object",
96
"required": ["id", "title"],
97
"properties": {
98
"id": {"type": "integer"},
99
"title": {"type": "string"}
100
}
101
}
102
}
103
}
104
105
# Validate documents
106
results = validate(docs, schema)
107
print(f"Validation results: {results}")
108
```
109
110
### Document Normalization
111
112
Converts a stream of old revision documents to documents that validate against the current schema format.
113
114
```python { .api }
115
def normalize(rev_docs, verbose=False):
116
"""
117
Converts a stream of old revision documents to current schema format.
118
119
Updates revision documents from older formats to ensure compatibility
120
with current processing pipelines and schema requirements.
121
122
Parameters:
123
- rev_docs: Iterable of revision documents in old format
124
- verbose: Print progress information (bool, default: False)
125
126
Yields: Normalized revision documents in current format
127
"""
128
```
129
130
**Usage Example:**
131
132
```python
133
from mwxml.utilities import normalize
134
import json
135
136
# Load old format documents
137
with open("old_revisions.jsonl") as f:
138
old_docs = [line.strip() for line in f]
139
140
# Normalize to current format
141
normalized_docs = list(normalize(old_docs))
142
143
# Save normalized documents
144
with open("normalized_revisions.jsonl", "w") as f:
145
for doc in normalized_docs:
146
f.write(doc + "\n")
147
148
print(f"Normalized {len(normalized_docs)} documents")
149
```
150
151
### Document Inflation
152
153
Converts a stream of flat revision documents to standard revision documents with full structure.
154
155
```python { .api }
156
def inflate(flat_jsons, verbose=False):
157
"""
158
Converts flat revision documents to standard hierarchical revision documents.
159
160
Expands compressed or flattened revision document formats by converting
161
underscore-separated keys (e.g., 'page_title') into nested dictionary
162
structures (e.g., {'page': {'title': ...}}).
163
164
Parameters:
165
- flat_jsons: Iterable of flat revision document objects (with underscore keys)
166
- verbose: Print progress information (bool, default: False)
167
168
Yields: Inflated revision documents with full hierarchical structure
169
"""
170
```
171
172
**Usage Example:**
173
174
```python
175
from mwxml.utilities import inflate
176
import json
177
178
# Load flat documents
179
with open("flat_revisions.jsonl") as f:
180
flat_docs = [line.strip() for line in f]
181
182
# Inflate to full structure
183
inflated_docs = list(inflate(flat_docs))
184
185
# Process inflated documents
186
for doc_str in inflated_docs:
187
doc = json.loads(doc_str)
188
print(f"Revision {doc['id']}: {doc['page']['title']}")
189
190
# Access full structure
191
if 'slots' in doc and 'main' in doc['slots']:
192
text_length = len(doc['slots']['main']['text']) if doc['slots']['main']['text'] else 0
193
print(f" Text length: {text_length}")
194
```
195
196
## Command Line Interface
197
198
The mwxml package provides a command-line interface for accessing utilities directly from the shell. The CLI is installed automatically with the package and accessible via the `mwxml` command.
199
200
### Main CLI Entry Point
201
202
```bash
203
# Access help
204
mwxml --help
205
206
# Available subcommands:
207
# - dump2revdocs: XML dumps to revision documents (XML → JSON)
208
# - validate: Compare revision documents against schema
209
# - normalize: Convert old revision documents to current schema
210
# - inflate: Convert flat revision documents to standard format
211
```
212
213
**CLI Architecture:**
214
215
The CLI uses a router-based architecture where each utility function has its own subcommand. All subcommands support:
216
- Input from stdin or file paths
217
- Multithreaded processing for multiple input files
218
- Optional output compression (bz2 by default)
219
- Verbose progress reporting
220
- Debug logging
221
222
### dump2revdocs Command
223
224
Converts XML dumps to revision JSON documents with various output options.
225
226
```bash
227
# Basic usage
228
mwxml dump2revdocs input.xml > output.jsonl
229
230
# Multiple files with threading
231
mwxml dump2revdocs dump1.xml dump2.xml dump3.xml --threads=4
232
233
# Output to directory with compression
234
mwxml dump2revdocs *.xml --output=/path/to/output --compress=bz2
235
236
# Verbose progress output
237
mwxml dump2revdocs large_dump.xml --verbose
238
239
# Help for specific command
240
mwxml dump2revdocs --help
241
```
242
243
**Parameters:**
244
- `input-file`: Path to MediaWiki XML dump file(s) (default: stdin)
245
- `--threads=<num>`: Number of processor threads for multiple files (default: CPU count)
246
- `--output=<path>`: Output directory with one file per input (default: stdout)
247
- `--compress=<type>`: Compression format for output files (default: bz2)
248
- `--verbose`: Print progress information to stderr (shows page titles and dots)
249
- `--debug`: Print debug logs
250
251
### validate Command
252
253
Validates a stream of JSON revision documents against a schema to ensure data integrity.
254
255
```bash
256
# Validate revision documents against schema
257
mwxml validate revisions.jsonl --schema=schema.json
258
259
# Pipe from dump2revdocs
260
mwxml dump2revdocs dump.xml | mwxml validate --schema=schema.json
261
262
# Multiple files with threading
263
mwxml validate doc1.jsonl doc2.jsonl --schema=schema.json --threads=2
264
265
# Help
266
mwxml validate --help
267
```
268
269
**Parameters:**
270
- `input-file`: Path to file containing JSON revision documents (default: stdin)
271
- `--schema=<path>`: Path to JSON schema file (required)
272
- `--threads=<num>`: Number of processor threads for multiple files
273
- `--output=<path>`: Output directory for validated documents
274
- `--compress=<type>`: Compression format for output (default: bz2)
275
- `--verbose`: Print progress information
276
- `--debug`: Print debug logs
277
278
### normalize Command
279
280
Converts old revision document formats to current schema-compliant format.
281
282
```bash
283
# Normalize old format documents
284
mwxml normalize old_revisions.jsonl > normalized.jsonl
285
286
# With compression
287
mwxml normalize old_revisions.jsonl --output=./normalized/ --compress=bz2
288
289
# Multiple files
290
mwxml normalize old1.jsonl old2.jsonl --threads=2
291
292
# Help
293
mwxml normalize --help
294
```
295
296
**Parameters:**
297
- `input-file`: Path to file containing old format revision documents (default: stdin)
298
- `--threads=<num>`: Number of processor threads for multiple files
299
- `--output=<path>`: Output directory for normalized documents
300
- `--compress=<type>`: Compression format for output (default: bz2)
301
- `--verbose`: Print progress information (shows ! for changed docs, . for unchanged)
302
- `--debug`: Print debug logs
303
304
### inflate Command
305
306
Converts flat revision documents (with underscore-separated keys) to hierarchical format.
307
308
```bash
309
# Inflate flat documents
310
mwxml inflate flat_revisions.jsonl > full_revisions.jsonl
311
312
# With output directory
313
mwxml inflate flat_revisions.jsonl --output=./inflated/
314
315
# Multiple files with threading
316
mwxml inflate flat1.jsonl flat2.jsonl --threads=2 --verbose
317
318
# Help
319
mwxml inflate --help
320
```
321
322
**Parameters:**
323
- `input-file`: Path to file containing flat revision documents (default: stdin)
324
- `--threads=<num>`: Number of processor threads for multiple files
325
- `--output=<path>`: Output directory for inflated documents
326
- `--compress=<type>`: Compression format for output (default: bz2)
327
- `--verbose`: Print progress information
328
- `--debug`: Print debug logs
329
330
## Integration Examples
331
332
### Processing Pipeline
333
334
```python
335
import mwxml
336
from mwxml.utilities import dump2revdocs, validate, normalize
337
338
# Complete processing pipeline
339
def process_dump_pipeline(xml_file, schema):
340
"""Complete dump processing with validation and normalization."""
341
342
# Step 1: Load dump
343
dump = mwxml.Dump.from_file(open(xml_file))
344
345
# Step 2: Convert to JSON documents
346
print("Converting to JSON documents...")
347
json_docs = list(dump2revdocs(dump, verbose=True))
348
349
# Step 3: Validate documents
350
print("Validating documents...")
351
validation_results = validate(json_docs, schema)
352
353
if validation_results.get('valid', False):
354
print("All documents valid!")
355
356
# Step 4: Normalize if needed
357
print("Normalizing documents...")
358
normalized_docs = list(normalize(json_docs))
359
360
return normalized_docs
361
else:
362
print(f"Validation failed: {validation_results}")
363
return None
364
365
# Usage
366
schema = {"type": "object", "required": ["id", "timestamp"]}
367
results = process_dump_pipeline("dump.xml", schema)
368
```
369
370
### Batch Processing with CLI
371
372
```bash
373
#!/bin/bash
374
# Batch processing script
375
376
# Convert all XML dumps to JSON
377
for dump in *.xml; do
378
echo "Processing $dump"
379
mwxml dump2revdocs "$dump" --compress=bz2 --output=./json_output/
380
done
381
382
# Validate all generated JSON files
383
for json_file in json_output/*.jsonl.bz2; do
384
echo "Validating $json_file"
385
bzcat "$json_file" | mwxml validate --schema=revision_schema.json
386
done
387
388
echo "Batch processing complete"
389
```