or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-processing.mddistributed-processing.mdindex.mdutilities.md

index.mddocs/

0

# mwxml

1

2

A comprehensive collection of utilities for efficiently processing MediaWiki's XML database dumps, addressing both performance and complexity concerns of streaming XML parsing. It enables memory-efficient stream processing with a simple iterator strategy that abstracts the XML structure into logical components: a Dump contains SiteInfo and iterators of Pages/LogItems, with Pages containing metadata and iterators of Revisions.

3

4

**Key Features:**

5

- Memory-efficient streaming XML parsing

6

- Iterator-based API for large dump files

7

- Multiprocessing support for parallel processing

8

- Command-line utilities for common tasks

9

- Complete type definitions and error handling

10

- Support for both page dumps and log dumps

11

12

## Package Information

13

14

- **Package Name**: mwxml

15

- **Language**: Python

16

- **Installation**: `pip install mwxml`

17

- **Documentation**: https://pythonhosted.org/mwxml

18

19

## Core Imports

20

21

```python

22

import mwxml

23

```

24

25

Most common imports for working with XML dumps:

26

27

```python

28

from mwxml import Dump, Page, Revision, SiteInfo, Namespace, LogItem, map

29

```

30

31

For utilities and processing functions:

32

33

```python

34

from mwxml.utilities import dump2revdocs, validate, normalize, inflate

35

```

36

37

## Basic Usage

38

39

```python

40

import mwxml

41

42

# Load and process a MediaWiki XML dump

43

dump = mwxml.Dump.from_file(open("dump.xml"))

44

45

# Access site information

46

print(dump.site_info.name, dump.site_info.dbname)

47

48

# Iterate through pages and revisions

49

for page in dump:

50

print(f"Page: {page.title} (ID: {page.id})")

51

for revision in page:

52

print(f" Revision {revision.id} by {revision.user.text if revision.user else 'Anonymous'}")

53

if revision.slots and revision.slots.main and revision.slots.main.text:

54

print(f" Text length: {len(revision.slots.main.text)}")

55

56

# Alternative: Direct page access

57

for page in dump.pages:

58

for revision in page:

59

print(f"Page {page.id}, Revision {revision.id}")

60

61

# Process log items if present

62

for log_item in dump.log_items:

63

print(f"Log: {log_item.type} - {log_item.action}")

64

```

65

66

## Architecture

67

68

The mwxml library implements a streaming XML parser that transforms complex MediaWiki dump structures into simple Python iterators:

69

70

- **Dump**: Top-level container with site metadata and item iterators

71

- **SiteInfo**: Site configuration, namespaces, and metadata from `<siteinfo>` blocks

72

- **Page**: Page metadata with revision iterators for efficient memory usage

73

- **Revision**: Individual revision data with user, timestamp, content, and metadata

74

- **LogItem**: Log entry data for administrative actions and events

75

- **Distributed Processing**: Parallel processing across multiple dump files using multiprocessing

76

77

This design enables processing of multi-gigabyte XML dumps with minimal memory footprint while providing simple Python iteration patterns.

78

79

## Capabilities

80

81

### Core XML Processing

82

83

Essential classes for parsing MediaWiki XML dumps into structured Python objects with streaming iteration support.

84

85

```python { .api }

86

class Dump:

87

@classmethod

88

def from_file(cls, f): ...

89

@classmethod

90

def from_page_xml(cls, page_xml): ...

91

def __iter__(self): ...

92

93

class Page:

94

def __iter__(self): ...

95

@classmethod

96

def from_element(cls, element, namespace_map=None): ...

97

98

class Revision:

99

@classmethod

100

def from_element(cls, element): ...

101

102

class SiteInfo:

103

@classmethod

104

def from_element(cls, element): ...

105

```

106

107

[Core Processing](./core-processing.md)

108

109

### Distributed Processing

110

111

Parallel processing functionality for handling multiple XML dump files simultaneously using multiprocessing to overcome Python's GIL limitations.

112

113

```python { .api }

114

def map(process, paths, threads=None):

115

"""

116

Distributed processing strategy for XML files.

117

118

Parameters:

119

- process: Function that takes (Dump, path) and yields results

120

- paths: Iterable of file paths to process

121

- threads: Number of processing threads (optional)

122

123

Yields: Results from process function

124

"""

125

```

126

127

[Distributed Processing](./distributed-processing.md)

128

129

### Utilities and CLI Tools

130

131

Command-line utilities and functions for converting XML dumps to various formats and validating/normalizing revision documents.

132

133

```python { .api }

134

def dump2revdocs(dump, verbose=False):

135

"""

136

Convert XML dumps to revision JSON documents.

137

138

Parameters:

139

- dump: mwxml.Dump object to process

140

- verbose: Print progress information (bool, default: False)

141

142

Yields: JSON strings representing revision documents

143

"""

144

145

def validate(docs, schema, verbose=False):

146

"""

147

Validate revision documents against schema.

148

149

Parameters:

150

- docs: Iterable of revision document objects

151

- schema: Schema definition for validation

152

- verbose: Print progress information (bool, default: False)

153

154

Yields: Validated revision documents

155

"""

156

157

def normalize(rev_docs, verbose=False):

158

"""

159

Convert old revision documents to current schema format.

160

161

Parameters:

162

- rev_docs: Iterable of revision documents in old format

163

- verbose: Print progress information (bool, default: False)

164

165

Yields: Normalized revision documents

166

"""

167

168

def inflate(flat_jsons, verbose=False):

169

"""

170

Convert flat revision documents to standard format.

171

172

Parameters:

173

- flat_jsons: Iterable of flat/compressed revision documents

174

- verbose: Print progress information (bool, default: False)

175

176

Yields: Inflated revision documents with full structure

177

"""

178

```

179

180

[Utilities](./utilities.md)

181

182

## Types

183

184

```python { .api }

185

class SiteInfo:

186

"""Site metadata from <siteinfo> block."""

187

name: str | None

188

dbname: str | None

189

base: str | None

190

generator: str | None

191

case: str | None

192

namespaces: list[Namespace] | None

193

194

class Namespace:

195

"""Namespace information."""

196

id: int

197

name: str

198

case: str | None

199

200

class Page:

201

"""

202

Page metadata (inherits from mwtypes.Page).

203

Contains page information and revision iterator.

204

"""

205

id: int

206

title: str

207

namespace: int

208

redirect: str | None

209

restrictions: list[str]

210

211

class Revision:

212

"""

213

Revision metadata and content (inherits from mwtypes.Revision).

214

Contains revision information and content slots.

215

"""

216

id: int

217

timestamp: Timestamp

218

user: User | None

219

minor: bool

220

parent_id: int | None

221

comment: str | None

222

deleted: Deleted

223

slots: Slots

224

225

class LogItem:

226

"""Log entry for administrative actions (inherits from mwtypes.LogItem)."""

227

id: int

228

timestamp: Timestamp

229

comment: str | None

230

user: User | None

231

page: Page | None

232

type: str | None

233

action: str | None

234

text: str | None

235

params: str | None

236

deleted: Deleted

237

238

class User:

239

"""User information (inherits from mwtypes.User)."""

240

id: int | None

241

text: str | None

242

243

class Content:

244

"""Content metadata and text for revision slots (inherits from mwtypes.Content)."""

245

role: str | None

246

origin: str | None

247

model: str | None

248

format: str | None

249

text: str | None

250

sha1: str | None

251

deleted: bool

252

bytes: int | None

253

id: str | None

254

location: str | None

255

256

class Slots:

257

"""Container for revision content slots (inherits from mwtypes.Slots)."""

258

main: Content | None

259

contents: dict[str, Content]

260

sha1: str | None

261

262

class Deleted:

263

"""Deletion status information."""

264

comment: bool

265

text: bool

266

user: bool

267

268

class Timestamp:

269

"""Timestamp type from mwtypes."""

270

pass

271

272

class MalformedXML(Exception):

273

"""Thrown when XML dump file is not formatted as expected."""

274

```