or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

core-processing.mddistributed-processing.mdindex.mdutilities.md
tile.json

tessl/pypi-mwxml

A set of utilities for processing MediaWiki XML dump data efficiently with streaming and distributed processing capabilities.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/mwxml@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-mwxml@0.3.0

index.mddocs/

mwxml

A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.

Key Features:

  • Memory-efficient streaming XML parsing
  • Iterator-based API for large dump files
  • Multiprocessing support for parallel processing
  • Command-line utilities for common tasks
  • Complete type definitions and error handling
  • Support for both page dumps and log dumps

Package Information

  • Package Name: mwxml
  • Language: Python
  • Installation: pip install mwxml
  • Documentation: https://pythonhosted.org/mwxml

Core Imports

import mwxml

Most common imports for working with XML dumps:

from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, map

For utilities and processing functions:

from mwxml.utilities import dump2revdocs, validate, normalize, inflate

Basic Usage

import mwxml

# Load and process a MediaWiki XML dump
dump = mwxml.Dump.from_file(open("dump.xml"))

# Access site information
print(dump.site_info.name, dump.site_info.dbname)

# Iterate through pages and revisions
for page in dump:
    print(f"Page: {page.title} (ID: {page.id})")
    for revision in page:
        print(f"  Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")
        if revision.slots and revision.slots.main and revision.slots.main.text:
            print(f"    Text length: {len(revision.slots.main.text)}")

# Alternative: Direct page access
for page in dump.pages:
    for revision in page:
        print(f"Page {page.id}, Revision {revision.id}")

# Process log items if present
for log_item in dump.log_items:
    print(f"Log: {log_item.type} - {log_item.action}")

Architecture

The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:

  • Dump: Top-level container with site metadata and item iterators
  • SiteInfo: Site configuration, namespaces, and metadata from <siteinfo> blocks
  • Page: Page metadata with revision iterators for efficient memory usage
  • Revision: Individual revision data with user, timestamp, content, and metadata
  • LogItem: Log entry data for administrative actions and events
  • Distributed Processing: Parallel processing across multiple dump files using multiprocessing

This design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.

Capabilities

Core XML Processing

Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.

class Dump:
    @classmethod
    def from_file(cls, f): ...
    @classmethod
    def from_page_xml(cls, page_xml): ...
    def __iter__(self): ...

class Page:
    def __iter__(self): ...
    @classmethod
    def from_element(cls, element, namespace_map=None): ...

class Revision:
    @classmethod
    def from_element(cls, element): ...

class SiteInfo:
    @classmethod
    def from_element(cls, element): ...

Core Processing

Distributed Processing

Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.

def map(process, paths, threads=None):
    """
    Distributed processing strategy for XML files.
    
    Parameters:
    - process: Function that takes (Dump, path) and yields results
    - paths: Iterable of file paths to process
    - threads: Number of processing threads (optional)
    
    Yields: Results from process function
    """

Distributed Processing

Utilities and CLI Tools

Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.

def dump2revdocs(dump, verbose=False):
    """
    Convert XML dumps to revision JSON documents.
    
    Parameters:
    - dump: mwxml.Dump object to process
    - verbose: Print progress information (bool, default: False)
    
    Yields: JSON strings representing revision documents
    """

def validate(docs, schema, verbose=False): 
    """
    Validate revision documents against schema.
    
    Parameters:
    - docs: Iterable of revision document objects
    - schema: Schema definition for validation
    - verbose: Print progress information (bool, default: False)
    
    Yields: Validated revision documents
    """

def normalize(rev_docs, verbose=False):
    """
    Convert old revision documents to current schema format.
    
    Parameters:
    - rev_docs: Iterable of revision documents in old format
    - verbose: Print progress information (bool, default: False)
    
    Yields: Normalized revision documents
    """

def inflate(flat_jsons, verbose=False):
    """
    Convert flat revision documents to standard format.
    
    Parameters:
    - flat_jsons: Iterable of flat/compressed revision documents
    - verbose: Print progress information (bool, default: False)
    
    Yields: Inflated revision documents with full structure
    """

Utilities

Types

class SiteInfo:
    """Site metadata from <siteinfo> block."""
    name: str | None
    dbname: str | None  
    base: str | None
    generator: str | None
    case: str | None
    namespaces: list[Namespace] | None

class Namespace:
    """Namespace information."""  
    id: int
    name: str
    case: str | None

class Page:
    """
    Page metadata (inherits from mwtypes.Page).
    Contains page information and revision iterator.
    """
    id: int
    title: str
    namespace: int
    redirect: str | None
    restrictions: list[str]

class Revision:
    """
    Revision metadata and content (inherits from mwtypes.Revision).
    Contains revision information and content slots.
    """
    id: int
    timestamp: Timestamp
    user: User | None
    minor: bool
    parent_id: int | None
    comment: str | None
    deleted: Deleted
    slots: Slots

class LogItem:
    """Log entry for administrative actions (inherits from mwtypes.LogItem)."""
    id: int
    timestamp: Timestamp
    comment: str | None
    user: User | None
    page: Page | None
    type: str | None
    action: str | None
    text: str | None
    params: str | None
    deleted: Deleted

class User:
    """User information (inherits from mwtypes.User)."""
    id: int | None
    text: str | None

class Content:
    """Content metadata and text for revision slots (inherits from mwtypes.Content)."""
    role: str | None
    origin: str | None
    model: str | None
    format: str | None
    text: str | None
    sha1: str | None
    deleted: bool
    bytes: int | None
    id: str | None
    location: str | None

class Slots:
    """Container for revision content slots (inherits from mwtypes.Slots)."""
    main: Content | None
    contents: dict[str, Content]
    sha1: str | None

class Deleted:
    """Deletion status information."""
    comment: bool
    text: bool
    user: bool

class Timestamp:
    """Timestamp type from mwtypes."""
    pass

class MalformedXML(Exception):
    """Thrown when XML dump file is not formatted as expected."""