Tessl Tile for pypi/mwxml@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-processing.md distributed-processing.md index.md utilities.md

index.mddocs/

0
# mwxml
1

2
A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.
3

4
**Key Features:**
5
- Memory-efficient streaming XML parsing
6
- Iterator-based API for large dump files  
7
- Multiprocessing support for parallel processing
8
- Command-line utilities for common tasks
9
- Complete type definitions and error handling
10
- Support for both page dumps and log dumps
11

12
## Package Information
13

14
- **Package Name**: mwxml  
15
- **Language**: Python
16
- **Installation**: `pip install mwxml`
17
- **Documentation**: https://pythonhosted.org/mwxml
18

19
## Core Imports
20

21
```python
22
import mwxml
23
```
24

25
Most common imports for working with XML dumps:
26

27
```python
28
from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, map
29
```
30

31
For utilities and processing functions:
32

33
```python
34
from mwxml.utilities import dump2revdocs, validate, normalize, inflate
35
```
36

37
## Basic Usage
38

39
```python
40
import mwxml
41

42
# Load and process a MediaWiki XML dump
43
dump = mwxml.Dump.from_file(open("dump.xml"))
44

45
# Access site information
46
print(dump.site_info.name, dump.site_info.dbname)
47

48
# Iterate through pages and revisions
49
for page in dump:
50
    print(f"Page: {page.title} (ID: {page.id})")
51
    for revision in page:
52
        print(f"  Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")
53
        if revision.slots and revision.slots.main and revision.slots.main.text:
54
            print(f"    Text length: {len(revision.slots.main.text)}")
55

56
# Alternative: Direct page access
57
for page in dump.pages:
58
    for revision in page:
59
        print(f"Page {page.id}, Revision {revision.id}")
60

61
# Process log items if present
62
for log_item in dump.log_items:
63
    print(f"Log: {log_item.type} - {log_item.action}")
64
```
65

66
## Architecture
67

68
The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:
69

70
- **Dump**: Top-level container with site metadata and item iterators
71
- **SiteInfo**: Site configuration, namespaces, and metadata from `<siteinfo>` blocks
72
- **Page**: Page metadata with revision iterators for efficient memory usage
73
- **Revision**: Individual revision data with user, timestamp, content, and metadata
74
- **LogItem**: Log entry data for administrative actions and events
75
- **Distributed Processing**: Parallel processing across multiple dump files using multiprocessing
76

77
This design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.
78

79
## Capabilities
80

81
### Core XML Processing
82

83
Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.
84

85
```python { .api }
86
class Dump:
87
    @classmethod
88
    def from_file(cls, f): ...
89
    @classmethod
90
    def from_page_xml(cls, page_xml): ...
91
    def __iter__(self): ...
92

93
class Page:
94
    def __iter__(self): ...
95
    @classmethod
96
    def from_element(cls, element, namespace_map=None): ...
97

98
class Revision:
99
    @classmethod
100
    def from_element(cls, element): ...
101

102
class SiteInfo:
103
    @classmethod
104
    def from_element(cls, element): ...
105
```
106

107
[Core Processing](./core-processing.md)
108

109
### Distributed Processing
110

111
Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.
112

113
```python { .api }
114
def map(process, paths, threads=None):
115
    """
116
    Distributed processing strategy for XML files.
117
    
118
    Parameters:
119
    - process: Function that takes (Dump, path) and yields results
120
    - paths: Iterable of file paths to process
121
    - threads: Number of processing threads (optional)
122
    
123
    Yields: Results from process function
124
    """
125
```
126

127
[Distributed Processing](./distributed-processing.md)
128

129
### Utilities and CLI Tools
130

131
Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.
132

133
```python { .api }
134
def dump2revdocs(dump, verbose=False):
135
    """
136
    Convert XML dumps to revision JSON documents.
137
    
138
    Parameters:
139
    - dump: mwxml.Dump object to process
140
    - verbose: Print progress information (bool, default: False)
141
    
142
    Yields: JSON strings representing revision documents
143
    """
144

145
def validate(docs, schema, verbose=False): 
146
    """
147
    Validate revision documents against schema.
148
    
149
    Parameters:
150
    - docs: Iterable of revision document objects
151
    - schema: Schema definition for validation
152
    - verbose: Print progress information (bool, default: False)
153
    
154
    Yields: Validated revision documents
155
    """
156

157
def normalize(rev_docs, verbose=False):
158
    """
159
    Convert old revision documents to current schema format.
160
    
161
    Parameters:
162
    - rev_docs: Iterable of revision documents in old format
163
    - verbose: Print progress information (bool, default: False)
164
    
165
    Yields: Normalized revision documents
166
    """
167

168
def inflate(flat_jsons, verbose=False):
169
    """
170
    Convert flat revision documents to standard format.
171
    
172
    Parameters:
173
    - flat_jsons: Iterable of flat/compressed revision documents
174
    - verbose: Print progress information (bool, default: False)
175
    
176
    Yields: Inflated revision documents with full structure
177
    """
178
```
179

180
[Utilities](./utilities.md)
181

182
## Types
183

184
```python { .api }
185
class SiteInfo:
186
    """Site metadata from <siteinfo> block."""
187
    name: str | None
188
    dbname: str | None  
189
    base: str | None
190
    generator: str | None
191
    case: str | None
192
    namespaces: list[Namespace] | None
193

194
class Namespace:
195
    """Namespace information."""  
196
    id: int
197
    name: str
198
    case: str | None
199

200
class Page:
201
    """
202
    Page metadata (inherits from mwtypes.Page).
203
    Contains page information and revision iterator.
204
    """
205
    id: int
206
    title: str
207
    namespace: int
208
    redirect: str | None
209
    restrictions: list[str]
210

211
class Revision:
212
    """
213
    Revision metadata and content (inherits from mwtypes.Revision).
214
    Contains revision information and content slots.
215
    """
216
    id: int
217
    timestamp: Timestamp
218
    user: User | None
219
    minor: bool
220
    parent_id: int | None
221
    comment: str | None
222
    deleted: Deleted
223
    slots: Slots
224

225
class LogItem:
226
    """Log entry for administrative actions (inherits from mwtypes.LogItem)."""
227
    id: int
228
    timestamp: Timestamp
229
    comment: str | None
230
    user: User | None
231
    page: Page | None
232
    type: str | None
233
    action: str | None
234
    text: str | None
235
    params: str | None
236
    deleted: Deleted
237

238
class User:
239
    """User information (inherits from mwtypes.User)."""
240
    id: int | None
241
    text: str | None
242

243
class Content:
244
    """Content metadata and text for revision slots (inherits from mwtypes.Content)."""
245
    role: str | None
246
    origin: str | None
247
    model: str | None
248
    format: str | None
249
    text: str | None
250
    sha1: str | None
251
    deleted: bool
252
    bytes: int | None
253
    id: str | None
254
    location: str | None
255

256
class Slots:
257
    """Container for revision content slots (inherits from mwtypes.Slots)."""
258
    main: Content | None
259
    contents: dict[str, Content]
260
    sha1: str | None
261

262
class Deleted:
263
    """Deletion status information."""
264
    comment: bool
265
    text: bool
266
    user: bool
267

268
class Timestamp:
269
    """Timestamp type from mwtypes."""
270
    pass
271

272
class MalformedXML(Exception):
273
    """Thrown when XML dump file is not formatted as expected."""
274
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/