A set of utilities for processing MediaWiki XML dump data efficiently with streaming and distributed processing capabilities.
—
Command-line utilities and functions for converting XML dumps to various formats, validating revision documents, and normalizing data structures. These tools provide additional processing capabilities beyond the core streaming API.
Converts MediaWiki XML dumps to page-partitioned sequences of revision JSON documents for easier processing and analysis.
def dump2revdocs(dump, verbose=False):
"""
Converts XML dumps to page-partitioned sequences of revision JSON documents.
This function processes each page in the dump and yields JSON representations
of all revisions. The JSON documents contain all revision metadata and content
in a structured format suitable for further processing or storage.
Parameters:
- dump: mwxml.Dump object to process
- verbose: Print progress information to stderr (bool, default: False)
Shows page titles and revision progress dots when enabled
Yields: JSON strings representing revision documents (calls revision.to_json())
"""Usage Example:
import mwxml
from mwxml.utilities import dump2revdocs
import json
# Process dump to JSON documents
dump = mwxml.Dump.from_file(open("dump.xml"))
# Convert with progress output
revision_docs = []
for json_doc in dump2revdocs(dump, verbose=True):
revision_doc = json.loads(json_doc)
revision_docs.append(revision_doc)
# Process individual revision document
print(f"Revision {revision_doc['id']} on page {revision_doc['page']['title']}")
# Save to file
with open("revisions.jsonl", "w") as f:
dump = mwxml.Dump.from_file(open("dump.xml"))
for json_doc in dump2revdocs(dump):
f.write(json_doc + "\n")Compares a stream of revision documents against a schema to ensure data integrity and format compliance.
def validate(docs, schema, verbose=False):
"""
Compares a stream of revision documents against a JSON schema.
Validates revision documents to ensure they conform to expected
structure and data types using jsonschema validation. Documents
that fail validation will raise a ValidationError.
Parameters:
- docs: Iterable of revision document objects (parsed JSON)
- schema: JSON schema definition for validation (dict)
- verbose: Print progress information (bool, default: False)
Yields: Validated revision documents that pass schema validation
Raises: jsonschema.ValidationError if document doesn't match schema
"""Usage Example:
from mwxml.utilities import validate, dump2revdocs
import mwxml
# Generate revision documents
dump = mwxml.Dump.from_file(open("dump.xml"))
docs = list(dump2revdocs(dump))
# Define expected schema (example)
schema = {
"type": "object",
"required": ["id", "timestamp", "page"],
"properties": {
"id": {"type": "integer"},
"timestamp": {"type": "string"},
"page": {
"type": "object",
"required": ["id", "title"],
"properties": {
"id": {"type": "integer"},
"title": {"type": "string"}
}
}
}
}
# Validate documents
results = validate(docs, schema)
print(f"Validation results: {results}")Converts a stream of old revision documents to documents that validate against the current schema format.
def normalize(rev_docs, verbose=False):
"""
Converts a stream of old revision documents to current schema format.
Updates revision documents from older formats to ensure compatibility
with current processing pipelines and schema requirements.
Parameters:
- rev_docs: Iterable of revision documents in old format
- verbose: Print progress information (bool, default: False)
Yields: Normalized revision documents in current format
"""Usage Example:
from mwxml.utilities import normalize
import json
# Load old format documents
with open("old_revisions.jsonl") as f:
old_docs = [line.strip() for line in f]
# Normalize to current format
normalized_docs = list(normalize(old_docs))
# Save normalized documents
with open("normalized_revisions.jsonl", "w") as f:
for doc in normalized_docs:
f.write(doc + "\n")
print(f"Normalized {len(normalized_docs)} documents")Converts a stream of flat revision documents to standard revision documents with full structure.
def inflate(flat_jsons, verbose=False):
"""
Converts flat revision documents to standard hierarchical revision documents.
Expands compressed or flattened revision document formats by converting
underscore-separated keys (e.g., 'page_title') into nested dictionary
structures (e.g., {'page': {'title': ...}}).
Parameters:
- flat_jsons: Iterable of flat revision document objects (with underscore keys)
- verbose: Print progress information (bool, default: False)
Yields: Inflated revision documents with full hierarchical structure
"""Usage Example:
from mwxml.utilities import inflate
import json
# Load flat documents
with open("flat_revisions.jsonl") as f:
flat_docs = [line.strip() for line in f]
# Inflate to full structure
inflated_docs = list(inflate(flat_docs))
# Process inflated documents
for doc_str in inflated_docs:
doc = json.loads(doc_str)
print(f"Revision {doc['id']}: {doc['page']['title']}")
# Access full structure
if 'slots' in doc and 'main' in doc['slots']:
text_length = len(doc['slots']['main']['text']) if doc['slots']['main']['text'] else 0
print(f" Text length: {text_length}")The mwxml package provides a command-line interface for accessing utilities directly from the shell. The CLI is installed automatically with the package and accessible via the mwxml command.
# Access help
mwxml --help
# Available subcommands:
# - dump2revdocs: XML dumps to revision documents (XML → JSON)
# - validate: Compare revision documents against schema
# - normalize: Convert old revision documents to current schema
# - inflate: Convert flat revision documents to standard formatCLI Architecture:
The CLI uses a router-based architecture where each utility function has its own subcommand. All subcommands support:
Converts XML dumps to revision JSON documents with various output options.
# Basic usage
mwxml dump2revdocs input.xml > output.jsonl
# Multiple files with threading
mwxml dump2revdocs dump1.xml dump2.xml dump3.xml --threads=4
# Output to directory with compression
mwxml dump2revdocs *.xml --output=/path/to/output --compress=bz2
# Verbose progress output
mwxml dump2revdocs large_dump.xml --verbose
# Help for specific command
mwxml dump2revdocs --helpParameters:
input-file: Path to MediaWiki XML dump file(s) (default: stdin)--threads=<num>: Number of processor threads for multiple files (default: CPU count)--output=<path>: Output directory with one file per input (default: stdout)--compress=<type>: Compression format for output files (default: bz2)--verbose: Print progress information to stderr (shows page titles and dots)--debug: Print debug logsValidates a stream of JSON revision documents against a schema to ensure data integrity.
# Validate revision documents against schema
mwxml validate revisions.jsonl --schema=schema.json
# Pipe from dump2revdocs
mwxml dump2revdocs dump.xml | mwxml validate --schema=schema.json
# Multiple files with threading
mwxml validate doc1.jsonl doc2.jsonl --schema=schema.json --threads=2
# Help
mwxml validate --helpParameters:
input-file: Path to file containing JSON revision documents (default: stdin)--schema=<path>: Path to JSON schema file (required)--threads=<num>: Number of processor threads for multiple files--output=<path>: Output directory for validated documents--compress=<type>: Compression format for output (default: bz2)--verbose: Print progress information--debug: Print debug logsConverts old revision document formats to current schema-compliant format.
# Normalize old format documents
mwxml normalize old_revisions.jsonl > normalized.jsonl
# With compression
mwxml normalize old_revisions.jsonl --output=./normalized/ --compress=bz2
# Multiple files
mwxml normalize old1.jsonl old2.jsonl --threads=2
# Help
mwxml normalize --helpParameters:
input-file: Path to file containing old format revision documents (default: stdin)--threads=<num>: Number of processor threads for multiple files--output=<path>: Output directory for normalized documents--compress=<type>: Compression format for output (default: bz2)--verbose: Print progress information (shows ! for changed docs, . for unchanged)--debug: Print debug logsConverts flat revision documents (with underscore-separated keys) to hierarchical format.
# Inflate flat documents
mwxml inflate flat_revisions.jsonl > full_revisions.jsonl
# With output directory
mwxml inflate flat_revisions.jsonl --output=./inflated/
# Multiple files with threading
mwxml inflate flat1.jsonl flat2.jsonl --threads=2 --verbose
# Help
mwxml inflate --helpParameters:
input-file: Path to file containing flat revision documents (default: stdin)--threads=<num>: Number of processor threads for multiple files--output=<path>: Output directory for inflated documents--compress=<type>: Compression format for output (default: bz2)--verbose: Print progress information--debug: Print debug logsimport mwxml
from mwxml.utilities import dump2revdocs, validate, normalize
# Complete processing pipeline
def process_dump_pipeline(xml_file, schema):
"""Complete dump processing with validation and normalization."""
# Step 1: Load dump
dump = mwxml.Dump.from_file(open(xml_file))
# Step 2: Convert to JSON documents
print("Converting to JSON documents...")
json_docs = list(dump2revdocs(dump, verbose=True))
# Step 3: Validate documents
print("Validating documents...")
validation_results = validate(json_docs, schema)
if validation_results.get('valid', False):
print("All documents valid!")
# Step 4: Normalize if needed
print("Normalizing documents...")
normalized_docs = list(normalize(json_docs))
return normalized_docs
else:
print(f"Validation failed: {validation_results}")
return None
# Usage
schema = {"type": "object", "required": ["id", "timestamp"]}
results = process_dump_pipeline("dump.xml", schema)#!/bin/bash
# Batch processing script
# Convert all XML dumps to JSON
for dump in *.xml; do
echo "Processing $dump"
mwxml dump2revdocs "$dump" --compress=bz2 --output=./json_output/
done
# Validate all generated JSON files
for json_file in json_output/*.jsonl.bz2; do
echo "Validating $json_file"
bzcat "$json_file" | mwxml validate --schema=revision_schema.json
done
echo "Batch processing complete"Install with Tessl CLI
npx tessl i tessl/pypi-mwxml