0
# mwxml
1
2
A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.
3
4
**Key Features:**
5
- Memory-efficient streaming XML parsing
6
- Iterator-based API for large dump files
7
- Multiprocessing support for parallel processing
8
- Command-line utilities for common tasks
9
- Complete type definitions and error handling
10
- Support for both page dumps and log dumps
11
12
## Package Information
13
14
- **Package Name**: mwxml
15
- **Language**: Python
16
- **Installation**: `pip install mwxml`
17
- **Documentation**: https://pythonhosted.org/mwxml
18
19
## Core Imports
20
21
```python
22
import mwxml
23
```
24
25
Most common imports for working with XML dumps:
26
27
```python
28
from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, map
29
```
30
31
For utilities and processing functions:
32
33
```python
34
from mwxml.utilities import dump2revdocs, validate, normalize, inflate
35
```
36
37
## Basic Usage
38
39
```python
40
import mwxml
41
42
# Load and process a MediaWiki XML dump
43
dump = mwxml.Dump.from_file(open("dump.xml"))
44
45
# Access site information
46
print(dump.site_info.name, dump.site_info.dbname)
47
48
# Iterate through pages and revisions
49
for page in dump:
50
print(f"Page: {page.title} (ID: {page.id})")
51
for revision in page:
52
print(f" Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")
53
if revision.slots and revision.slots.main and revision.slots.main.text:
54
print(f" Text length: {len(revision.slots.main.text)}")
55
56
# Alternative: Direct page access
57
for page in dump.pages:
58
for revision in page:
59
print(f"Page {page.id}, Revision {revision.id}")
60
61
# Process log items if present
62
for log_item in dump.log_items:
63
print(f"Log: {log_item.type} - {log_item.action}")
64
```
65
66
## Architecture
67
68
The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:
69
70
- **Dump**: Top-level container with site metadata and item iterators
71
- **SiteInfo**: Site configuration, namespaces, and metadata from `<siteinfo>` blocks
72
- **Page**: Page metadata with revision iterators for efficient memory usage
73
- **Revision**: Individual revision data with user, timestamp, content, and metadata
74
- **LogItem**: Log entry data for administrative actions and events
75
- **Distributed Processing**: Parallel processing across multiple dump files using multiprocessing
76
77
This design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.
78
79
## Capabilities
80
81
### Core XML Processing
82
83
Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.
84
85
```python { .api }
86
class Dump:
87
@classmethod
88
def from_file(cls, f): ...
89
@classmethod
90
def from_page_xml(cls, page_xml): ...
91
def __iter__(self): ...
92
93
class Page:
94
def __iter__(self): ...
95
@classmethod
96
def from_element(cls, element, namespace_map=None): ...
97
98
class Revision:
99
@classmethod
100
def from_element(cls, element): ...
101
102
class SiteInfo:
103
@classmethod
104
def from_element(cls, element): ...
105
```
106
107
[Core Processing](./core-processing.md)
108
109
### Distributed Processing
110
111
Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.
112
113
```python { .api }
114
def map(process, paths, threads=None):
115
"""
116
Distributed processing strategy for XML files.
117
118
Parameters:
119
- process: Function that takes (Dump, path) and yields results
120
- paths: Iterable of file paths to process
121
- threads: Number of processing threads (optional)
122
123
Yields: Results from process function
124
"""
125
```
126
127
[Distributed Processing](./distributed-processing.md)
128
129
### Utilities and CLI Tools
130
131
Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.
132
133
```python { .api }
134
def dump2revdocs(dump, verbose=False):
135
"""
136
Convert XML dumps to revision JSON documents.
137
138
Parameters:
139
- dump: mwxml.Dump object to process
140
- verbose: Print progress information (bool, default: False)
141
142
Yields: JSON strings representing revision documents
143
"""
144
145
def validate(docs, schema, verbose=False):
146
"""
147
Validate revision documents against schema.
148
149
Parameters:
150
- docs: Iterable of revision document objects
151
- schema: Schema definition for validation
152
- verbose: Print progress information (bool, default: False)
153
154
Yields: Validated revision documents
155
"""
156
157
def normalize(rev_docs, verbose=False):
158
"""
159
Convert old revision documents to current schema format.
160
161
Parameters:
162
- rev_docs: Iterable of revision documents in old format
163
- verbose: Print progress information (bool, default: False)
164
165
Yields: Normalized revision documents
166
"""
167
168
def inflate(flat_jsons, verbose=False):
169
"""
170
Convert flat revision documents to standard format.
171
172
Parameters:
173
- flat_jsons: Iterable of flat/compressed revision documents
174
- verbose: Print progress information (bool, default: False)
175
176
Yields: Inflated revision documents with full structure
177
"""
178
```
179
180
[Utilities](./utilities.md)
181
182
## Types
183
184
```python { .api }
185
class SiteInfo:
186
"""Site metadata from <siteinfo> block."""
187
name: str | None
188
dbname: str | None
189
base: str | None
190
generator: str | None
191
case: str | None
192
namespaces: list[Namespace] | None
193
194
class Namespace:
195
"""Namespace information."""
196
id: int
197
name: str
198
case: str | None
199
200
class Page:
201
"""
202
Page metadata (inherits from mwtypes.Page).
203
Contains page information and revision iterator.
204
"""
205
id: int
206
title: str
207
namespace: int
208
redirect: str | None
209
restrictions: list[str]
210
211
class Revision:
212
"""
213
Revision metadata and content (inherits from mwtypes.Revision).
214
Contains revision information and content slots.
215
"""
216
id: int
217
timestamp: Timestamp
218
user: User | None
219
minor: bool
220
parent_id: int | None
221
comment: str | None
222
deleted: Deleted
223
slots: Slots
224
225
class LogItem:
226
"""Log entry for administrative actions (inherits from mwtypes.LogItem)."""
227
id: int
228
timestamp: Timestamp
229
comment: str | None
230
user: User | None
231
page: Page | None
232
type: str | None
233
action: str | None
234
text: str | None
235
params: str | None
236
deleted: Deleted
237
238
class User:
239
"""User information (inherits from mwtypes.User)."""
240
id: int | None
241
text: str | None
242
243
class Content:
244
"""Content metadata and text for revision slots (inherits from mwtypes.Content)."""
245
role: str | None
246
origin: str | None
247
model: str | None
248
format: str | None
249
text: str | None
250
sha1: str | None
251
deleted: bool
252
bytes: int | None
253
id: str | None
254
location: str | None
255
256
class Slots:
257
"""Container for revision content slots (inherits from mwtypes.Slots)."""
258
main: Content | None
259
contents: dict[str, Content]
260
sha1: str | None
261
262
class Deleted:
263
"""Deletion status information."""
264
comment: bool
265
text: bool
266
user: bool
267
268
class Timestamp:
269
"""Timestamp type from mwtypes."""
270
pass
271
272
class MalformedXML(Exception):
273
"""Thrown when XML dump file is not formatted as expected."""
274
```