A set of utilities for processing MediaWiki XML dump data efficiently with streaming and distributed processing capabilities.
npx @tessl/cli install tessl/pypi-mwxml@0.3.0A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.
Key Features:
pip install mwxmlimport mwxmlMost common imports for working with XML dumps:
from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, mapFor utilities and processing functions:
from mwxml.utilities import dump2revdocs, validate, normalize, inflateimport mwxml
# Load and process a MediaWiki XML dump
dump = mwxml.Dump.from_file(open("dump.xml"))
# Access site information
print(dump.site_info.name, dump.site_info.dbname)
# Iterate through pages and revisions
for page in dump:
print(f"Page: {page.title} (ID: {page.id})")
for revision in page:
print(f" Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")
if revision.slots and revision.slots.main and revision.slots.main.text:
print(f" Text length: {len(revision.slots.main.text)}")
# Alternative: Direct page access
for page in dump.pages:
for revision in page:
print(f"Page {page.id}, Revision {revision.id}")
# Process log items if present
for log_item in dump.log_items:
print(f"Log: {log_item.type} - {log_item.action}")The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:
<siteinfo> blocksThis design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.
Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.
class Dump:
@classmethod
def from_file(cls, f): ...
@classmethod
def from_page_xml(cls, page_xml): ...
def __iter__(self): ...
class Page:
def __iter__(self): ...
@classmethod
def from_element(cls, element, namespace_map=None): ...
class Revision:
@classmethod
def from_element(cls, element): ...
class SiteInfo:
@classmethod
def from_element(cls, element): ...Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.
def map(process, paths, threads=None):
"""
Distributed processing strategy for XML files.
Parameters:
- process: Function that takes (Dump, path) and yields results
- paths: Iterable of file paths to process
- threads: Number of processing threads (optional)
Yields: Results from process function
"""Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.
def dump2revdocs(dump, verbose=False):
"""
Convert XML dumps to revision JSON documents.
Parameters:
- dump: mwxml.Dump object to process
- verbose: Print progress information (bool, default: False)
Yields: JSON strings representing revision documents
"""
def validate(docs, schema, verbose=False):
"""
Validate revision documents against schema.
Parameters:
- docs: Iterable of revision document objects
- schema: Schema definition for validation
- verbose: Print progress information (bool, default: False)
Yields: Validated revision documents
"""
def normalize(rev_docs, verbose=False):
"""
Convert old revision documents to current schema format.
Parameters:
- rev_docs: Iterable of revision documents in old format
- verbose: Print progress information (bool, default: False)
Yields: Normalized revision documents
"""
def inflate(flat_jsons, verbose=False):
"""
Convert flat revision documents to standard format.
Parameters:
- flat_jsons: Iterable of flat/compressed revision documents
- verbose: Print progress information (bool, default: False)
Yields: Inflated revision documents with full structure
"""class SiteInfo:
"""Site metadata from <siteinfo> block."""
name: str | None
dbname: str | None
base: str | None
generator: str | None
case: str | None
namespaces: list[Namespace] | None
class Namespace:
"""Namespace information."""
id: int
name: str
case: str | None
class Page:
"""
Page metadata (inherits from mwtypes.Page).
Contains page information and revision iterator.
"""
id: int
title: str
namespace: int
redirect: str | None
restrictions: list[str]
class Revision:
"""
Revision metadata and content (inherits from mwtypes.Revision).
Contains revision information and content slots.
"""
id: int
timestamp: Timestamp
user: User | None
minor: bool
parent_id: int | None
comment: str | None
deleted: Deleted
slots: Slots
class LogItem:
"""Log entry for administrative actions (inherits from mwtypes.LogItem)."""
id: int
timestamp: Timestamp
comment: str | None
user: User | None
page: Page | None
type: str | None
action: str | None
text: str | None
params: str | None
deleted: Deleted
class User:
"""User information (inherits from mwtypes.User)."""
id: int | None
text: str | None
class Content:
"""Content metadata and text for revision slots (inherits from mwtypes.Content)."""
role: str | None
origin: str | None
model: str | None
format: str | None
text: str | None
sha1: str | None
deleted: bool
bytes: int | None
id: str | None
location: str | None
class Slots:
"""Container for revision content slots (inherits from mwtypes.Slots)."""
main: Content | None
contents: dict[str, Content]
sha1: str | None
class Deleted:
"""Deletion status information."""
comment: bool
text: bool
user: bool
class Timestamp:
"""Timestamp type from mwtypes."""
pass
class MalformedXML(Exception):
"""Thrown when XML dump file is not formatted as expected."""