tessl/pypi-mwxml

A set of utilities for processing MediaWiki XML dump data efficiently with streaming and distributed processing capabilities.

—

Pending

Overview

Eval results

Files

Utilities and CLI Tools

Name: tessl/pypi-mwxml
Author: tessl

Command-line utilities and functions for converting XML dumps to various formats, validating revision documents, and normalizing data structures. These tools provide additional processing capabilities beyond the core streaming API.

Capabilities

Dump to Revision Documents Conversion

Converts MediaWiki XML dumps to page-partitioned sequences of revision JSON documents for easier processing and analysis.

def dump2revdocs(dump, verbose=False):
    """
    Converts XML dumps to page-partitioned sequences of revision JSON documents.
    
    This function processes each page in the dump and yields JSON representations
    of all revisions. The JSON documents contain all revision metadata and content
    in a structured format suitable for further processing or storage.
    
    Parameters:
    - dump: mwxml.Dump object to process
    - verbose: Print progress information to stderr (bool, default: False)
              Shows page titles and revision progress dots when enabled
    
    Yields: JSON strings representing revision documents (calls revision.to_json())
    """

Usage Example:

import mwxml
from mwxml.utilities import dump2revdocs
import json

# Process dump to JSON documents
dump = mwxml.Dump.from_file(open("dump.xml"))

# Convert with progress output
revision_docs = []
for json_doc in dump2revdocs(dump, verbose=True):
    revision_doc = json.loads(json_doc)
    revision_docs.append(revision_doc)
    
    # Process individual revision document
    print(f"Revision {revision_doc['id']} on page {revision_doc['page']['title']}")

# Save to file
with open("revisions.jsonl", "w") as f:
    dump = mwxml.Dump.from_file(open("dump.xml"))
    for json_doc in dump2revdocs(dump):
        f.write(json_doc + "\n")

Document Validation

Compares a stream of revision documents against a schema to ensure data integrity and format compliance.

def validate(docs, schema, verbose=False):
    """
    Compares a stream of revision documents against a JSON schema.
    
    Validates revision documents to ensure they conform to expected
    structure and data types using jsonschema validation. Documents
    that fail validation will raise a ValidationError.
    
    Parameters:
    - docs: Iterable of revision document objects (parsed JSON)
    - schema: JSON schema definition for validation (dict)
    - verbose: Print progress information (bool, default: False)
    
    Yields: Validated revision documents that pass schema validation
    Raises: jsonschema.ValidationError if document doesn't match schema
    """

Usage Example:

from mwxml.utilities import validate, dump2revdocs
import mwxml

# Generate revision documents
dump = mwxml.Dump.from_file(open("dump.xml"))
docs = list(dump2revdocs(dump))

# Define expected schema (example)
schema = {
    "type": "object",
    "required": ["id", "timestamp", "page"],
    "properties": {
        "id": {"type": "integer"},
        "timestamp": {"type": "string"},
        "page": {
            "type": "object",
            "required": ["id", "title"],
            "properties": {
                "id": {"type": "integer"},
                "title": {"type": "string"}
            }
        }
    }
}

# Validate documents
results = validate(docs, schema)
print(f"Validation results: {results}")

Document Normalization

Converts a stream of old revision documents to documents that validate against the current schema format.

def normalize(rev_docs, verbose=False):
    """
    Converts a stream of old revision documents to current schema format.
    
    Updates revision documents from older formats to ensure compatibility
    with current processing pipelines and schema requirements.
    
    Parameters:
    - rev_docs: Iterable of revision documents in old format
    - verbose: Print progress information (bool, default: False)
    
    Yields: Normalized revision documents in current format
    """

Usage Example:

from mwxml.utilities import normalize
import json

# Load old format documents
with open("old_revisions.jsonl") as f:
    old_docs = [line.strip() for line in f]

# Normalize to current format
normalized_docs = list(normalize(old_docs))

# Save normalized documents
with open("normalized_revisions.jsonl", "w") as f:
    for doc in normalized_docs:
        f.write(doc + "\n")

print(f"Normalized {len(normalized_docs)} documents")

Document Inflation

Converts a stream of flat revision documents to standard revision documents with full structure.

def inflate(flat_jsons, verbose=False):
    """
    Converts flat revision documents to standard hierarchical revision documents.
    
    Expands compressed or flattened revision document formats by converting
    underscore-separated keys (e.g., 'page_title') into nested dictionary
    structures (e.g., {'page': {'title': ...}}).
    
    Parameters:
    - flat_jsons: Iterable of flat revision document objects (with underscore keys)
    - verbose: Print progress information (bool, default: False)
    
    Yields: Inflated revision documents with full hierarchical structure
    """

Usage Example:

from mwxml.utilities import inflate
import json

# Load flat documents
with open("flat_revisions.jsonl") as f:
    flat_docs = [line.strip() for line in f]

# Inflate to full structure
inflated_docs = list(inflate(flat_docs))

# Process inflated documents
for doc_str in inflated_docs:
    doc = json.loads(doc_str)
    print(f"Revision {doc['id']}: {doc['page']['title']}")
    
    # Access full structure
    if 'slots' in doc and 'main' in doc['slots']:
        text_length = len(doc['slots']['main']['text']) if doc['slots']['main']['text'] else 0
        print(f"  Text length: {text_length}")

Command Line Interface

The mwxml package provides a command-line interface for accessing utilities directly from the shell. The CLI is installed automatically with the package and accessible via the mwxml command.

Main CLI Entry Point

# Access help
mwxml --help

# Available subcommands:
# - dump2revdocs: XML dumps to revision documents (XML → JSON)  
# - validate: Compare revision documents against schema
# - normalize: Convert old revision documents to current schema
# - inflate: Convert flat revision documents to standard format

CLI Architecture:

The CLI uses a router-based architecture where each utility function has its own subcommand. All subcommands support:

Input from stdin or file paths
Multithreaded processing for multiple input files
Optional output compression (bz2 by default)
Verbose progress reporting
Debug logging

dump2revdocs Command

Converts XML dumps to revision JSON documents with various output options.

# Basic usage
mwxml dump2revdocs input.xml > output.jsonl

# Multiple files with threading
mwxml dump2revdocs dump1.xml dump2.xml dump3.xml --threads=4

# Output to directory with compression
mwxml dump2revdocs *.xml --output=/path/to/output --compress=bz2

# Verbose progress output
mwxml dump2revdocs large_dump.xml --verbose

# Help for specific command
mwxml dump2revdocs --help

Parameters:

input-file: Path to MediaWiki XML dump file(s) (default: stdin)
--threads=<num>: Number of processor threads for multiple files (default: CPU count)
--output=<path>: Output directory with one file per input (default: stdout)
--compress=<type>: Compression format for output files (default: bz2)
--verbose: Print progress information to stderr (shows page titles and dots)
--debug: Print debug logs

validate Command

Validates a stream of JSON revision documents against a schema to ensure data integrity.

# Validate revision documents against schema
mwxml validate revisions.jsonl --schema=schema.json

# Pipe from dump2revdocs
mwxml dump2revdocs dump.xml | mwxml validate --schema=schema.json

# Multiple files with threading
mwxml validate doc1.jsonl doc2.jsonl --schema=schema.json --threads=2

# Help
mwxml validate --help

Parameters:

input-file: Path to file containing JSON revision documents (default: stdin)
--schema=<path>: Path to JSON schema file (required)
--threads=<num>: Number of processor threads for multiple files
--output=<path>: Output directory for validated documents
--compress=<type>: Compression format for output (default: bz2)
--verbose: Print progress information
--debug: Print debug logs

normalize Command

Converts old revision document formats to current schema-compliant format.

# Normalize old format documents
mwxml normalize old_revisions.jsonl > normalized.jsonl

# With compression
mwxml normalize old_revisions.jsonl --output=./normalized/ --compress=bz2

# Multiple files
mwxml normalize old1.jsonl old2.jsonl --threads=2

# Help  
mwxml normalize --help

Parameters:

input-file: Path to file containing old format revision documents (default: stdin)
--threads=<num>: Number of processor threads for multiple files
--output=<path>: Output directory for normalized documents
--compress=<type>: Compression format for output (default: bz2)
--verbose: Print progress information (shows ! for changed docs, . for unchanged)
--debug: Print debug logs

inflate Command

Converts flat revision documents (with underscore-separated keys) to hierarchical format.

# Inflate flat documents
mwxml inflate flat_revisions.jsonl > full_revisions.jsonl

# With output directory
mwxml inflate flat_revisions.jsonl --output=./inflated/

# Multiple files with threading
mwxml inflate flat1.jsonl flat2.jsonl --threads=2 --verbose

# Help
mwxml inflate --help

Parameters:

input-file: Path to file containing flat revision documents (default: stdin)
--threads=<num>: Number of processor threads for multiple files
--output=<path>: Output directory for inflated documents
--compress=<type>: Compression format for output (default: bz2)
--verbose: Print progress information
--debug: Print debug logs

Integration Examples

Processing Pipeline

import mwxml
from mwxml.utilities import dump2revdocs, validate, normalize

# Complete processing pipeline
def process_dump_pipeline(xml_file, schema):
    """Complete dump processing with validation and normalization."""
    
    # Step 1: Load dump
    dump = mwxml.Dump.from_file(open(xml_file))
    
    # Step 2: Convert to JSON documents  
    print("Converting to JSON documents...")
    json_docs = list(dump2revdocs(dump, verbose=True))
    
    # Step 3: Validate documents
    print("Validating documents...")
    validation_results = validate(json_docs, schema)
    
    if validation_results.get('valid', False):
        print("All documents valid!")
        
        # Step 4: Normalize if needed
        print("Normalizing documents...")
        normalized_docs = list(normalize(json_docs))
        
        return normalized_docs
    else:
        print(f"Validation failed: {validation_results}")
        return None

# Usage
schema = {"type": "object", "required": ["id", "timestamp"]}
results = process_dump_pipeline("dump.xml", schema)

Batch Processing with CLI

#!/bin/bash
# Batch processing script

# Convert all XML dumps to JSON
for dump in *.xml; do
    echo "Processing $dump"
    mwxml dump2revdocs "$dump" --compress=bz2 --output=./json_output/
done

# Validate all generated JSON files
for json_file in json_output/*.jsonl.bz2; do
    echo "Validating $json_file"
    bzcat "$json_file" | mwxml validate --schema=revision_schema.json
done

echo "Batch processing complete"

Install with Tessl CLI