or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-processing.md distributed-processing.md index.md utilities.md

tile.json

tessl/pypi-mwxml

A set of utilities for processing MediaWiki XML dump data efficiently with streaming and distributed processing capabilities.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/mwxml@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-mwxml@0.3.0

mwxml

A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.

Key Features:

Memory-efficient streaming XML parsing
Iterator-based API for large dump files
Multiprocessing support for parallel processing
Command-line utilities for common tasks
Complete type definitions and error handling
Support for both page dumps and log dumps

Package Information

Package Name: mwxml
Language: Python
Installation: pip install mwxml
Documentation: https://pythonhosted.org/mwxml

Core Imports

import mwxml

Most common imports for working with XML dumps:

from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, map

For utilities and processing functions:

from mwxml.utilities import dump2revdocs, validate, normalize, inflate

Basic Usage

import mwxml

# Load and process a MediaWiki XML dump
dump = mwxml.Dump.from_file(open("dump.xml"))

# Access site information
print(dump.site_info.name, dump.site_info.dbname)

# Iterate through pages and revisions
for page in dump:
    print(f"Page: {page.title} (ID: {page.id})")
    for revision in page:
        print(f"  Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")
        if revision.slots and revision.slots.main and revision.slots.main.text:
            print(f"    Text length: {len(revision.slots.main.text)}")

# Alternative: Direct page access
for page in dump.pages:
    for revision in page:
        print(f"Page {page.id}, Revision {revision.id}")

# Process log items if present
for log_item in dump.log_items:
    print(f"Log: {log_item.type} - {log_item.action}")

Architecture

The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:

Dump: Top-level container with site metadata and item iterators
SiteInfo: Site configuration, namespaces, and metadata from <siteinfo> blocks
Page: Page metadata with revision iterators for efficient memory usage
Revision: Individual revision data with user, timestamp, content, and metadata
LogItem: Log entry data for administrative actions and events
Distributed Processing: Parallel processing across multiple dump files using multiprocessing

This design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.

Capabilities

Core XML Processing

Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.

class Dump:
    @classmethod
    def from_file(cls, f): ...
    @classmethod
    def from_page_xml(cls, page_xml): ...
    def __iter__(self): ...

class Page:
    def __iter__(self): ...
    @classmethod
    def from_element(cls, element, namespace_map=None): ...

class Revision:
    @classmethod
    def from_element(cls, element): ...

class SiteInfo:
    @classmethod
    def from_element(cls, element): ...

Core Processing

Distributed Processing

Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.

def map(process, paths, threads=None):
    """
    Distributed processing strategy for XML files.
    
    Parameters:
    - process: Function that takes (Dump, path) and yields results
    - paths: Iterable of file paths to process
    - threads: Number of processing threads (optional)
    
    Yields: Results from process function
    """

Distributed Processing

Utilities and CLI Tools

Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.

def dump2revdocs(dump, verbose=False):
    """
    Convert XML dumps to revision JSON documents.
    
    Parameters:
    - dump: mwxml.Dump object to process
    - verbose: Print progress information (bool, default: False)
    
    Yields: JSON strings representing revision documents
    """

def validate(docs, schema, verbose=False): 
    """
    Validate revision documents against schema.
    
    Parameters:
    - docs: Iterable of revision document objects
    - schema: Schema definition for validation
    - verbose: Print progress information (bool, default: False)
    
    Yields: Validated revision documents
    """

def normalize(rev_docs, verbose=False):
    """
    Convert old revision documents to current schema format.
    
    Parameters:
    - rev_docs: Iterable of revision documents in old format
    - verbose: Print progress information (bool, default: False)
    
    Yields: Normalized revision documents
    """

def inflate(flat_jsons, verbose=False):
    """
    Convert flat revision documents to standard format.
    
    Parameters:
    - flat_jsons: Iterable of flat/compressed revision documents
    - verbose: Print progress information (bool, default: False)
    
    Yields: Inflated revision documents with full structure
    """

Utilities

Types

class SiteInfo:
    """Site metadata from <siteinfo> block."""
    name: str | None
    dbname: str | None  
    base: str | None
    generator: str | None
    case: str | None
    namespaces: list[Namespace] | None

class Namespace:
    """Namespace information."""  
    id: int
    name: str
    case: str | None

class Page:
    """
    Page metadata (inherits from mwtypes.Page).
    Contains page information and revision iterator.
    """
    id: int
    title: str
    namespace: int
    redirect: str | None
    restrictions: list[str]

class Revision:
    """
    Revision metadata and content (inherits from mwtypes.Revision).
    Contains revision information and content slots.
    """
    id: int
    timestamp: Timestamp
    user: User | None
    minor: bool
    parent_id: int | None
    comment: str | None
    deleted: Deleted
    slots: Slots

class LogItem:
    """Log entry for administrative actions (inherits from mwtypes.LogItem)."""
    id: int
    timestamp: Timestamp
    comment: str | None
    user: User | None
    page: Page | None
    type: str | None
    action: str | None
    text: str | None
    params: str | None
    deleted: Deleted

class User:
    """User information (inherits from mwtypes.User)."""
    id: int | None
    text: str | None

class Content:
    """Content metadata and text for revision slots (inherits from mwtypes.Content)."""
    role: str | None
    origin: str | None
    model: str | None
    format: str | None
    text: str | None
    sha1: str | None
    deleted: bool
    bytes: int | None
    id: str | None
    location: str | None

class Slots:
    """Container for revision content slots (inherits from mwtypes.Slots)."""
    main: Content | None
    contents: dict[str, Content]
    sha1: str | None

class Deleted:
    """Deletion status information."""
    comment: bool
    text: bool
    user: bool

class Timestamp:
    """Timestamp type from mwtypes."""
    pass

class MalformedXML(Exception):
    """Thrown when XML dump file is not formatted as expected."""

Version

Tile

Files

tessl/pypi-mwxml

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

mwxml

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Core XML Processing

Distributed Processing

Utilities and CLI Tools

Types

index.mddocs/