Tessl Tile for pypi/mwxml@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-processing.md distributed-processing.md index.md utilities.md

utilities.mddocs/

0
# Utilities and CLI Tools
1

2
Command-line utilities and functions for converting XML dumps to various formats, validating revision documents, and normalizing data structures. These tools provide additional processing capabilities beyond the core streaming API.
3

4
## Capabilities
5

6
### Dump to Revision Documents Conversion
7

8
Converts MediaWiki XML dumps to page-partitioned sequences of revision JSON documents for easier processing and analysis.
9

10
```python { .api }
11
def dump2revdocs(dump, verbose=False):
12
    """
13
    Converts XML dumps to page-partitioned sequences of revision JSON documents.
14
    
15
    This function processes each page in the dump and yields JSON representations
16
    of all revisions. The JSON documents contain all revision metadata and content
17
    in a structured format suitable for further processing or storage.
18
    
19
    Parameters:
20
    - dump: mwxml.Dump object to process
21
    - verbose: Print progress information to stderr (bool, default: False)
22
              Shows page titles and revision progress dots when enabled
23
    
24
    Yields: JSON strings representing revision documents (calls revision.to_json())
25
    """
26
```
27

28
**Usage Example:**
29

30
```python
31
import mwxml
32
from mwxml.utilities import dump2revdocs
33
import json
34

35
# Process dump to JSON documents
36
dump = mwxml.Dump.from_file(open("dump.xml"))
37

38
# Convert with progress output
39
revision_docs = []
40
for json_doc in dump2revdocs(dump, verbose=True):
41
    revision_doc = json.loads(json_doc)
42
    revision_docs.append(revision_doc)
43
    
44
    # Process individual revision document
45
    print(f"Revision {revision_doc['id']} on page {revision_doc['page']['title']}")
46

47
# Save to file
48
with open("revisions.jsonl", "w") as f:
49
    dump = mwxml.Dump.from_file(open("dump.xml"))
50
    for json_doc in dump2revdocs(dump):
51
        f.write(json_doc + "\n")
52
```
53

54
### Document Validation
55

56
Compares a stream of revision documents against a schema to ensure data integrity and format compliance.
57

58
```python { .api }
59
def validate(docs, schema, verbose=False):
60
    """
61
    Compares a stream of revision documents against a JSON schema.
62
    
63
    Validates revision documents to ensure they conform to expected
64
    structure and data types using jsonschema validation. Documents
65
    that fail validation will raise a ValidationError.
66
    
67
    Parameters:
68
    - docs: Iterable of revision document objects (parsed JSON)
69
    - schema: JSON schema definition for validation (dict)
70
    - verbose: Print progress information (bool, default: False)
71
    
72
    Yields: Validated revision documents that pass schema validation
73
    Raises: jsonschema.ValidationError if document doesn't match schema
74
    """
75
```
76

77
**Usage Example:**
78

79
```python
80
from mwxml.utilities import validate, dump2revdocs
81
import mwxml
82

83
# Generate revision documents
84
dump = mwxml.Dump.from_file(open("dump.xml"))
85
docs = list(dump2revdocs(dump))
86

87
# Define expected schema (example)
88
schema = {
89
    "type": "object",
90
    "required": ["id", "timestamp", "page"],
91
    "properties": {
92
        "id": {"type": "integer"},
93
        "timestamp": {"type": "string"},
94
        "page": {
95
            "type": "object",
96
            "required": ["id", "title"],
97
            "properties": {
98
                "id": {"type": "integer"},
99
                "title": {"type": "string"}
100
            }
101
        }
102
    }
103
}
104

105
# Validate documents
106
results = validate(docs, schema)
107
print(f"Validation results: {results}")
108
```
109

110
### Document Normalization
111

112
Converts a stream of old revision documents to documents that validate against the current schema format.
113

114
```python { .api }
115
def normalize(rev_docs, verbose=False):
116
    """
117
    Converts a stream of old revision documents to current schema format.
118
    
119
    Updates revision documents from older formats to ensure compatibility
120
    with current processing pipelines and schema requirements.
121
    
122
    Parameters:
123
    - rev_docs: Iterable of revision documents in old format
124
    - verbose: Print progress information (bool, default: False)
125
    
126
    Yields: Normalized revision documents in current format
127
    """
128
```
129

130
**Usage Example:**
131

132
```python
133
from mwxml.utilities import normalize
134
import json
135

136
# Load old format documents
137
with open("old_revisions.jsonl") as f:
138
    old_docs = [line.strip() for line in f]
139

140
# Normalize to current format
141
normalized_docs = list(normalize(old_docs))
142

143
# Save normalized documents
144
with open("normalized_revisions.jsonl", "w") as f:
145
    for doc in normalized_docs:
146
        f.write(doc + "\n")
147

148
print(f"Normalized {len(normalized_docs)} documents")
149
```
150

151
### Document Inflation
152

153
Converts a stream of flat revision documents to standard revision documents with full structure.
154

155
```python { .api }
156
def inflate(flat_jsons, verbose=False):
157
    """
158
    Converts flat revision documents to standard hierarchical revision documents.
159
    
160
    Expands compressed or flattened revision document formats by converting
161
    underscore-separated keys (e.g., 'page_title') into nested dictionary
162
    structures (e.g., {'page': {'title': ...}}).
163
    
164
    Parameters:
165
    - flat_jsons: Iterable of flat revision document objects (with underscore keys)
166
    - verbose: Print progress information (bool, default: False)
167
    
168
    Yields: Inflated revision documents with full hierarchical structure
169
    """
170
```
171

172
**Usage Example:**
173

174
```python
175
from mwxml.utilities import inflate
176
import json
177

178
# Load flat documents
179
with open("flat_revisions.jsonl") as f:
180
    flat_docs = [line.strip() for line in f]
181

182
# Inflate to full structure
183
inflated_docs = list(inflate(flat_docs))
184

185
# Process inflated documents
186
for doc_str in inflated_docs:
187
    doc = json.loads(doc_str)
188
    print(f"Revision {doc['id']}: {doc['page']['title']}")
189
    
190
    # Access full structure
191
    if 'slots' in doc and 'main' in doc['slots']:
192
        text_length = len(doc['slots']['main']['text']) if doc['slots']['main']['text'] else 0
193
        print(f"  Text length: {text_length}")
194
```
195

196
## Command Line Interface
197

198
The mwxml package provides a command-line interface for accessing utilities directly from the shell. The CLI is installed automatically with the package and accessible via the `mwxml` command.
199

200
### Main CLI Entry Point
201

202
```bash
203
# Access help
204
mwxml --help
205

206
# Available subcommands:
207
# - dump2revdocs: XML dumps to revision documents (XML → JSON)  
208
# - validate: Compare revision documents against schema
209
# - normalize: Convert old revision documents to current schema
210
# - inflate: Convert flat revision documents to standard format
211
```
212

213
**CLI Architecture:**
214

215
The CLI uses a router-based architecture where each utility function has its own subcommand. All subcommands support:
216
- Input from stdin or file paths
217
- Multithreaded processing for multiple input files
218
- Optional output compression (bz2 by default)
219
- Verbose progress reporting
220
- Debug logging
221

222
### dump2revdocs Command
223

224
Converts XML dumps to revision JSON documents with various output options.
225

226
```bash
227
# Basic usage
228
mwxml dump2revdocs input.xml > output.jsonl
229

230
# Multiple files with threading
231
mwxml dump2revdocs dump1.xml dump2.xml dump3.xml --threads=4
232

233
# Output to directory with compression
234
mwxml dump2revdocs *.xml --output=/path/to/output --compress=bz2
235

236
# Verbose progress output
237
mwxml dump2revdocs large_dump.xml --verbose
238

239
# Help for specific command
240
mwxml dump2revdocs --help
241
```
242

243
**Parameters:**
244
- `input-file`: Path to MediaWiki XML dump file(s) (default: stdin)
245
- `--threads=<num>`: Number of processor threads for multiple files (default: CPU count)
246
- `--output=<path>`: Output directory with one file per input (default: stdout) 
247
- `--compress=<type>`: Compression format for output files (default: bz2)
248
- `--verbose`: Print progress information to stderr (shows page titles and dots)
249
- `--debug`: Print debug logs
250

251
### validate Command
252

253
Validates a stream of JSON revision documents against a schema to ensure data integrity.
254

255
```bash
256
# Validate revision documents against schema
257
mwxml validate revisions.jsonl --schema=schema.json
258

259
# Pipe from dump2revdocs
260
mwxml dump2revdocs dump.xml | mwxml validate --schema=schema.json
261

262
# Multiple files with threading
263
mwxml validate doc1.jsonl doc2.jsonl --schema=schema.json --threads=2
264

265
# Help
266
mwxml validate --help
267
```
268

269
**Parameters:**
270
- `input-file`: Path to file containing JSON revision documents (default: stdin)
271
- `--schema=<path>`: Path to JSON schema file (required)
272
- `--threads=<num>`: Number of processor threads for multiple files
273
- `--output=<path>`: Output directory for validated documents
274
- `--compress=<type>`: Compression format for output (default: bz2)
275
- `--verbose`: Print progress information
276
- `--debug`: Print debug logs
277

278
### normalize Command
279

280
Converts old revision document formats to current schema-compliant format.
281

282
```bash
283
# Normalize old format documents
284
mwxml normalize old_revisions.jsonl > normalized.jsonl
285

286
# With compression
287
mwxml normalize old_revisions.jsonl --output=./normalized/ --compress=bz2
288

289
# Multiple files
290
mwxml normalize old1.jsonl old2.jsonl --threads=2
291

292
# Help  
293
mwxml normalize --help
294
```
295

296
**Parameters:**
297
- `input-file`: Path to file containing old format revision documents (default: stdin)
298
- `--threads=<num>`: Number of processor threads for multiple files
299
- `--output=<path>`: Output directory for normalized documents
300
- `--compress=<type>`: Compression format for output (default: bz2)
301
- `--verbose`: Print progress information (shows ! for changed docs, . for unchanged)
302
- `--debug`: Print debug logs
303

304
### inflate Command
305

306
Converts flat revision documents (with underscore-separated keys) to hierarchical format.
307

308
```bash
309
# Inflate flat documents
310
mwxml inflate flat_revisions.jsonl > full_revisions.jsonl
311

312
# With output directory
313
mwxml inflate flat_revisions.jsonl --output=./inflated/
314

315
# Multiple files with threading
316
mwxml inflate flat1.jsonl flat2.jsonl --threads=2 --verbose
317

318
# Help
319
mwxml inflate --help
320
```
321

322
**Parameters:**
323
- `input-file`: Path to file containing flat revision documents (default: stdin)
324
- `--threads=<num>`: Number of processor threads for multiple files
325
- `--output=<path>`: Output directory for inflated documents
326
- `--compress=<type>`: Compression format for output (default: bz2)
327
- `--verbose`: Print progress information
328
- `--debug`: Print debug logs
329

330
## Integration Examples
331

332
### Processing Pipeline
333

334
```python
335
import mwxml
336
from mwxml.utilities import dump2revdocs, validate, normalize
337

338
# Complete processing pipeline
339
def process_dump_pipeline(xml_file, schema):
340
    """Complete dump processing with validation and normalization."""
341
    
342
    # Step 1: Load dump
343
    dump = mwxml.Dump.from_file(open(xml_file))
344
    
345
    # Step 2: Convert to JSON documents  
346
    print("Converting to JSON documents...")
347
    json_docs = list(dump2revdocs(dump, verbose=True))
348
    
349
    # Step 3: Validate documents
350
    print("Validating documents...")
351
    validation_results = validate(json_docs, schema)
352
    
353
    if validation_results.get('valid', False):
354
        print("All documents valid!")
355
        
356
        # Step 4: Normalize if needed
357
        print("Normalizing documents...")
358
        normalized_docs = list(normalize(json_docs))
359
        
360
        return normalized_docs
361
    else:
362
        print(f"Validation failed: {validation_results}")
363
        return None
364

365
# Usage
366
schema = {"type": "object", "required": ["id", "timestamp"]}
367
results = process_dump_pipeline("dump.xml", schema)
368
```
369

370
### Batch Processing with CLI
371

372
```bash
373
#!/bin/bash
374
# Batch processing script
375

376
# Convert all XML dumps to JSON
377
for dump in *.xml; do
378
    echo "Processing $dump"
379
    mwxml dump2revdocs "$dump" --compress=bz2 --output=./json_output/
380
done
381

382
# Validate all generated JSON files
383
for json_file in json_output/*.jsonl.bz2; do
384
    echo "Validating $json_file"
385
    bzcat "$json_file" | mwxml validate --schema=revision_schema.json
386
done
387

388
echo "Batch processing complete"
389
```

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/