or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-processing.mddistributed-processing.mdindex.mdutilities.md

utilities.mddocs/

0

# Utilities and CLI Tools

1

2

Command-line utilities and functions for converting XML dumps to various formats, validating revision documents, and normalizing data structures. These tools provide additional processing capabilities beyond the core streaming API.

3

4

## Capabilities

5

6

### Dump to Revision Documents Conversion

7

8

Converts MediaWiki XML dumps to page-partitioned sequences of revision JSON documents for easier processing and analysis.

9

10

```python { .api }

11

def dump2revdocs(dump, verbose=False):

12

"""

13

Converts XML dumps to page-partitioned sequences of revision JSON documents.

14

15

This function processes each page in the dump and yields JSON representations

16

of all revisions. The JSON documents contain all revision metadata and content

17

in a structured format suitable for further processing or storage.

18

19

Parameters:

20

- dump: mwxml.Dump object to process

21

- verbose: Print progress information to stderr (bool, default: False)

22

Shows page titles and revision progress dots when enabled

23

24

Yields: JSON strings representing revision documents (calls revision.to_json())

25

"""

26

```

27

28

**Usage Example:**

29

30

```python

31

import mwxml

32

from mwxml.utilities import dump2revdocs

33

import json

34

35

# Process dump to JSON documents

36

dump = mwxml.Dump.from_file(open("dump.xml"))

37

38

# Convert with progress output

39

revision_docs = []

40

for json_doc in dump2revdocs(dump, verbose=True):

41

revision_doc = json.loads(json_doc)

42

revision_docs.append(revision_doc)

43

44

# Process individual revision document

45

print(f"Revision {revision_doc['id']} on page {revision_doc['page']['title']}")

46

47

# Save to file

48

with open("revisions.jsonl", "w") as f:

49

dump = mwxml.Dump.from_file(open("dump.xml"))

50

for json_doc in dump2revdocs(dump):

51

f.write(json_doc + "\n")

52

```

53

54

### Document Validation

55

56

Compares a stream of revision documents against a schema to ensure data integrity and format compliance.

57

58

```python { .api }

59

def validate(docs, schema, verbose=False):

60

"""

61

Compares a stream of revision documents against a JSON schema.

62

63

Validates revision documents to ensure they conform to expected

64

structure and data types using jsonschema validation. Documents

65

that fail validation will raise a ValidationError.

66

67

Parameters:

68

- docs: Iterable of revision document objects (parsed JSON)

69

- schema: JSON schema definition for validation (dict)

70

- verbose: Print progress information (bool, default: False)

71

72

Yields: Validated revision documents that pass schema validation

73

Raises: jsonschema.ValidationError if document doesn't match schema

74

"""

75

```

76

77

**Usage Example:**

78

79

```python

80

from mwxml.utilities import validate, dump2revdocs

81

import mwxml

82

83

# Generate revision documents

84

dump = mwxml.Dump.from_file(open("dump.xml"))

85

docs = list(dump2revdocs(dump))

86

87

# Define expected schema (example)

88

schema = {

89

"type": "object",

90

"required": ["id", "timestamp", "page"],

91

"properties": {

92

"id": {"type": "integer"},

93

"timestamp": {"type": "string"},

94

"page": {

95

"type": "object",

96

"required": ["id", "title"],

97

"properties": {

98

"id": {"type": "integer"},

99

"title": {"type": "string"}

100

}

101

}

102

}

103

}

104

105

# Validate documents

106

results = validate(docs, schema)

107

print(f"Validation results: {results}")

108

```

109

110

### Document Normalization

111

112

Converts a stream of old revision documents to documents that validate against the current schema format.

113

114

```python { .api }

115

def normalize(rev_docs, verbose=False):

116

"""

117

Converts a stream of old revision documents to current schema format.

118

119

Updates revision documents from older formats to ensure compatibility

120

with current processing pipelines and schema requirements.

121

122

Parameters:

123

- rev_docs: Iterable of revision documents in old format

124

- verbose: Print progress information (bool, default: False)

125

126

Yields: Normalized revision documents in current format

127

"""

128

```

129

130

**Usage Example:**

131

132

```python

133

from mwxml.utilities import normalize

134

import json

135

136

# Load old format documents

137

with open("old_revisions.jsonl") as f:

138

old_docs = [line.strip() for line in f]

139

140

# Normalize to current format

141

normalized_docs = list(normalize(old_docs))

142

143

# Save normalized documents

144

with open("normalized_revisions.jsonl", "w") as f:

145

for doc in normalized_docs:

146

f.write(doc + "\n")

147

148

print(f"Normalized {len(normalized_docs)} documents")

149

```

150

151

### Document Inflation

152

153

Converts a stream of flat revision documents to standard revision documents with full structure.

154

155

```python { .api }

156

def inflate(flat_jsons, verbose=False):

157

"""

158

Converts flat revision documents to standard hierarchical revision documents.

159

160

Expands compressed or flattened revision document formats by converting

161

underscore-separated keys (e.g., 'page_title') into nested dictionary

162

structures (e.g., {'page': {'title': ...}}).

163

164

Parameters:

165

- flat_jsons: Iterable of flat revision document objects (with underscore keys)

166

- verbose: Print progress information (bool, default: False)

167

168

Yields: Inflated revision documents with full hierarchical structure

169

"""

170

```

171

172

**Usage Example:**

173

174

```python

175

from mwxml.utilities import inflate

176

import json

177

178

# Load flat documents

179

with open("flat_revisions.jsonl") as f:

180

flat_docs = [line.strip() for line in f]

181

182

# Inflate to full structure

183

inflated_docs = list(inflate(flat_docs))

184

185

# Process inflated documents

186

for doc_str in inflated_docs:

187

doc = json.loads(doc_str)

188

print(f"Revision {doc['id']}: {doc['page']['title']}")

189

190

# Access full structure

191

if 'slots' in doc and 'main' in doc['slots']:

192

text_length = len(doc['slots']['main']['text']) if doc['slots']['main']['text'] else 0

193

print(f" Text length: {text_length}")

194

```

195

196

## Command Line Interface

197

198

The mwxml package provides a command-line interface for accessing utilities directly from the shell. The CLI is installed automatically with the package and accessible via the `mwxml` command.

199

200

### Main CLI Entry Point

201

202

```bash

203

# Access help

204

mwxml --help

205

206

# Available subcommands:

207

# - dump2revdocs: XML dumps to revision documents (XML → JSON)

208

# - validate: Compare revision documents against schema

209

# - normalize: Convert old revision documents to current schema

210

# - inflate: Convert flat revision documents to standard format

211

```

212

213

**CLI Architecture:**

214

215

The CLI uses a router-based architecture where each utility function has its own subcommand. All subcommands support:

216

- Input from stdin or file paths

217

- Multithreaded processing for multiple input files

218

- Optional output compression (bz2 by default)

219

- Verbose progress reporting

220

- Debug logging

221

222

### dump2revdocs Command

223

224

Converts XML dumps to revision JSON documents with various output options.

225

226

```bash

227

# Basic usage

228

mwxml dump2revdocs input.xml > output.jsonl

229

230

# Multiple files with threading

231

mwxml dump2revdocs dump1.xml dump2.xml dump3.xml --threads=4

232

233

# Output to directory with compression

234

mwxml dump2revdocs *.xml --output=/path/to/output --compress=bz2

235

236

# Verbose progress output

237

mwxml dump2revdocs large_dump.xml --verbose

238

239

# Help for specific command

240

mwxml dump2revdocs --help

241

```

242

243

**Parameters:**

244

- `input-file`: Path to MediaWiki XML dump file(s) (default: stdin)

245

- `--threads=<num>`: Number of processor threads for multiple files (default: CPU count)

246

- `--output=<path>`: Output directory with one file per input (default: stdout)

247

- `--compress=<type>`: Compression format for output files (default: bz2)

248

- `--verbose`: Print progress information to stderr (shows page titles and dots)

249

- `--debug`: Print debug logs

250

251

### validate Command

252

253

Validates a stream of JSON revision documents against a schema to ensure data integrity.

254

255

```bash

256

# Validate revision documents against schema

257

mwxml validate revisions.jsonl --schema=schema.json

258

259

# Pipe from dump2revdocs

260

mwxml dump2revdocs dump.xml | mwxml validate --schema=schema.json

261

262

# Multiple files with threading

263

mwxml validate doc1.jsonl doc2.jsonl --schema=schema.json --threads=2

264

265

# Help

266

mwxml validate --help

267

```

268

269

**Parameters:**

270

- `input-file`: Path to file containing JSON revision documents (default: stdin)

271

- `--schema=<path>`: Path to JSON schema file (required)

272

- `--threads=<num>`: Number of processor threads for multiple files

273

- `--output=<path>`: Output directory for validated documents

274

- `--compress=<type>`: Compression format for output (default: bz2)

275

- `--verbose`: Print progress information

276

- `--debug`: Print debug logs

277

278

### normalize Command

279

280

Converts old revision document formats to current schema-compliant format.

281

282

```bash

283

# Normalize old format documents

284

mwxml normalize old_revisions.jsonl > normalized.jsonl

285

286

# With compression

287

mwxml normalize old_revisions.jsonl --output=./normalized/ --compress=bz2

288

289

# Multiple files

290

mwxml normalize old1.jsonl old2.jsonl --threads=2

291

292

# Help

293

mwxml normalize --help

294

```

295

296

**Parameters:**

297

- `input-file`: Path to file containing old format revision documents (default: stdin)

298

- `--threads=<num>`: Number of processor threads for multiple files

299

- `--output=<path>`: Output directory for normalized documents

300

- `--compress=<type>`: Compression format for output (default: bz2)

301

- `--verbose`: Print progress information (shows ! for changed docs, . for unchanged)

302

- `--debug`: Print debug logs

303

304

### inflate Command

305

306

Converts flat revision documents (with underscore-separated keys) to hierarchical format.

307

308

```bash

309

# Inflate flat documents

310

mwxml inflate flat_revisions.jsonl > full_revisions.jsonl

311

312

# With output directory

313

mwxml inflate flat_revisions.jsonl --output=./inflated/

314

315

# Multiple files with threading

316

mwxml inflate flat1.jsonl flat2.jsonl --threads=2 --verbose

317

318

# Help

319

mwxml inflate --help

320

```

321

322

**Parameters:**

323

- `input-file`: Path to file containing flat revision documents (default: stdin)

324

- `--threads=<num>`: Number of processor threads for multiple files

325

- `--output=<path>`: Output directory for inflated documents

326

- `--compress=<type>`: Compression format for output (default: bz2)

327

- `--verbose`: Print progress information

328

- `--debug`: Print debug logs

329

330

## Integration Examples

331

332

### Processing Pipeline

333

334

```python

335

import mwxml

336

from mwxml.utilities import dump2revdocs, validate, normalize

337

338

# Complete processing pipeline

339

def process_dump_pipeline(xml_file, schema):

340

"""Complete dump processing with validation and normalization."""

341

342

# Step 1: Load dump

343

dump = mwxml.Dump.from_file(open(xml_file))

344

345

# Step 2: Convert to JSON documents

346

print("Converting to JSON documents...")

347

json_docs = list(dump2revdocs(dump, verbose=True))

348

349

# Step 3: Validate documents

350

print("Validating documents...")

351

validation_results = validate(json_docs, schema)

352

353

if validation_results.get('valid', False):

354

print("All documents valid!")

355

356

# Step 4: Normalize if needed

357

print("Normalizing documents...")

358

normalized_docs = list(normalize(json_docs))

359

360

return normalized_docs

361

else:

362

print(f"Validation failed: {validation_results}")

363

return None

364

365

# Usage

366

schema = {"type": "object", "required": ["id", "timestamp"]}

367

results = process_dump_pipeline("dump.xml", schema)

368

```

369

370

### Batch Processing with CLI

371

372

```bash

373

#!/bin/bash

374

# Batch processing script

375

376

# Convert all XML dumps to JSON

377

for dump in *.xml; do

378

echo "Processing $dump"

379

mwxml dump2revdocs "$dump" --compress=bz2 --output=./json_output/

380

done

381

382

# Validate all generated JSON files

383

for json_file in json_output/*.jsonl.bz2; do

384

echo "Validating $json_file"

385

bzcat "$json_file" | mwxml validate --schema=revision_schema.json

386

done

387

388

echo "Batch processing complete"

389

```