or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

etree-core.mdhtml-processing.mdindex.mdobjectify-api.mdutility-modules.mdvalidation.mdxpath-xslt.md

index.mddocs/

0

# lxml

1

2

A comprehensive Python library for processing XML and HTML documents. lxml combines the speed and feature completeness of libxml2 and libxslt with the simplicity of Python's ElementTree API, providing fast, standards-compliant XML/HTML processing with extensive validation, transformation, and manipulation capabilities.

3

4

## Package Information

5

6

- **Package Name**: lxml

7

- **Language**: Python

8

- **Installation**: `pip install lxml`

9

- **Documentation**: https://lxml.de/

10

- **Requirements**: Python 3.8+

11

12

## Core Imports

13

14

The library provides multiple APIs optimized for different use cases:

15

16

```python

17

# Core XML/HTML processing (ElementTree-compatible)

18

from lxml import etree

19

20

# Object-oriented XML API with Python data type mapping

21

from lxml import objectify

22

23

# HTML-specific processing with form/link handling

24

from lxml import html

25

26

# Schema validation

27

from lxml.isoschematron import Schematron

28

29

# CSS selector support

30

from lxml.cssselect import CSSSelector

31

```

32

33

## Basic Usage

34

35

### XML Processing

36

37

```python

38

from lxml import etree

39

40

# Parse XML from string

41

xml_data = """

42

<bookstore>

43

<book id="1">

44

<title>Python Guide</title>

45

<author>Jane Smith</author>

46

<price>29.99</price>

47

</book>

48

<book id="2">

49

<title>XML Processing</title>

50

<author>John Doe</author>

51

<price>34.95</price>

52

</book>

53

</bookstore>

54

"""

55

56

root = etree.fromstring(xml_data)

57

58

# Find elements using XPath

59

books = root.xpath('//book[@id="1"]')

60

print(books[0].find('title').text) # "Python Guide"

61

62

# Create new elements

63

new_book = etree.SubElement(root, 'book', id="3")

64

etree.SubElement(new_book, 'title').text = "Advanced Topics"

65

etree.SubElement(new_book, 'author').text = "Alice Johnson"

66

etree.SubElement(new_book, 'price').text = "39.99"

67

68

# Serialize back to XML

69

print(etree.tostring(root, pretty_print=True, encoding='unicode'))

70

```

71

72

### HTML Processing

73

74

```python

75

from lxml import html

76

77

# Parse HTML

78

html_content = """

79

<html>

80

<head><title>Example Page</title></head>

81

<body>

82

<form action="/submit" method="post">

83

<input type="text" name="username" value="john">

84

<input type="password" name="password">

85

<button type="submit">Login</button>

86

</form>

87

<div class="content">

88

<a href="https://example.com">External Link</a>

89

<a href="/internal">Internal Link</a>

90

</div>

91

</body>

92

</html>

93

"""

94

95

doc = html.fromstring(html_content)

96

97

# Find form elements

98

form = doc.forms[0]

99

print(form.fields) # Form field dictionary

100

101

# Process links

102

html.make_links_absolute(doc, base_url='https://mysite.com')

103

for element, attribute, link, pos in html.iterlinks(doc):

104

print(f"{element.tag}.{attribute}: {link}")

105

```

106

107

### Object-Oriented API

108

109

```python

110

from lxml import objectify

111

112

# Parse XML into Python objects

113

xml_data = """

114

<data>

115

<items>

116

<item>

117

<name>Widget</name>

118

<price>19.99</price>

119

<available>true</available>

120

</item>

121

</items>

122

</data>

123

"""

124

125

root = objectify.fromstring(xml_data)

126

127

# Access as Python attributes

128

print(root.items.item.name) # "Widget"

129

print(root.items.item.price) # 19.99 (automatically converted to float)

130

print(root.items.item.available) # True (automatically converted to bool)

131

132

# Add new data

133

root.items.item.category = "Electronics"

134

print(objectify.dump(root))

135

```

136

137

## Architecture

138

139

lxml provides multiple complementary APIs built on a common foundation:

140

141

- **etree**: Low-level ElementTree-compatible API for precise XML/HTML control

142

- **objectify**: High-level Pythonic API with automatic type conversion

143

- **html**: Specialized HTML processing with web-specific features

144

- **Validation**: Multiple schema languages (DTD, RelaxNG, XML Schema, Schematron)

145

- **Processing**: XPath queries, XSLT transformations, canonicalization

146

147

The library's Cython implementation provides C-level performance while maintaining Python's ease of use, making it suitable for both simple scripts and high-performance applications processing large XML documents.

148

149

## Capabilities

150

151

### Core XML/HTML Processing

152

153

Low-level ElementTree-compatible API providing comprehensive XML and HTML parsing, manipulation, and serialization with full namespace support, error handling, and memory-efficient processing.

154

155

```python { .api }

156

# Parsing functions

157

def parse(source, parser=None, base_url=None): ...

158

def fromstring(text, parser=None, base_url=None): ...

159

def XML(text, parser=None, base_url=None): ...

160

def HTML(text, parser=None, base_url=None): ...

161

162

# Core classes

163

class Element: ...

164

class ElementTree: ...

165

class XMLParser: ...

166

class HTMLParser: ...

167

168

# Serialization

169

def tostring(element_or_tree, encoding=None, method='xml', pretty_print=False): ...

170

```

171

172

[Core XML/HTML Processing](./etree-core.md)

173

174

### Object-Oriented XML API

175

176

Pythonic XML processing that automatically converts XML data to Python objects with proper data types, providing intuitive attribute-based access and manipulation while maintaining full XML structure.

177

178

```python { .api }

179

# Parsing functions

180

def parse(source, parser=None, base_url=None): ...

181

def fromstring(text, parser=None, base_url=None): ...

182

183

# Core classes

184

class ObjectifiedElement: ...

185

class DataElement: ...

186

class ElementMaker: ...

187

188

# Type annotation functions

189

def annotate(element_or_tree, **kwargs): ...

190

def deannotate(element_or_tree, **kwargs): ...

191

```

192

193

[Object-Oriented XML API](./objectify-api.md)

194

195

### HTML Processing

196

197

Specialized HTML document processing with web-specific features including form handling, link processing, CSS class manipulation, and HTML5 parsing support.

198

199

```python { .api }

200

# HTML parsing

201

def parse(filename_or_url, parser=None, base_url=None): ...

202

def fromstring(html, base_url=None, parser=None): ...

203

def document_fromstring(html, parser=None, ensure_head_body=False): ...

204

205

# Link processing

206

def make_links_absolute(element, base_url=None): ...

207

def iterlinks(element): ...

208

def rewrite_links(element, link_repl_func): ...

209

210

# Form handling

211

def submit_form(form, extra_values=None, open_http=None): ...

212

```

213

214

[HTML Processing](./html-processing.md)

215

216

### Schema Validation

217

218

Comprehensive XML validation support including DTD, RelaxNG, W3C XML Schema, and ISO Schematron with detailed error reporting and custom validation rules.

219

220

```python { .api }

221

class DTD: ...

222

class RelaxNG: ...

223

class XMLSchema: ...

224

225

# Schematron validation

226

class Schematron: ...

227

def extract_xsd(element): ...

228

def extract_rng(element): ...

229

```

230

231

[Schema Validation](./validation.md)

232

233

### XPath and XSLT Processing

234

235

Advanced XML querying and transformation capabilities with XPath 1.0/2.0 evaluation, XSLT 1.0 stylesheets, extension functions, and namespace handling.

236

237

```python { .api }

238

class XPath: ...

239

class XPathEvaluator: ...

240

class XSLT: ...

241

class XSLTAccessControl: ...

242

243

# Utility functions

244

def canonicalize(xml_data, **options): ...

245

```

246

247

[XPath and XSLT Processing](./xpath-xslt.md)

248

249

### Utility Modules

250

251

Additional functionality including SAX interface compatibility, CSS selector support, element builders, XInclude processing, and namespace management.

252

253

```python { .api }

254

# SAX interface

255

class ElementTreeContentHandler: ...

256

def saxify(element_or_tree, content_handler): ...

257

258

# CSS selectors

259

class CSSSelector: ...

260

261

# Element builders

262

class ElementMaker: ...

263

264

# Development utilities

265

def get_include(): ...

266

```

267

268

[Utility Modules](./utility-modules.md)

269

270

## Error Handling

271

272

lxml provides a comprehensive exception hierarchy for precise error handling:

273

274

```python { .api }

275

class LxmlError(Exception): ...

276

class XMLSyntaxError(LxmlError): ...

277

class DTDError(LxmlError): ...

278

class RelaxNGError(LxmlError): ...

279

class XMLSchemaError(LxmlError): ...

280

class XPathError(LxmlError): ...

281

class XSLTError(LxmlError): ...

282

```

283

284

All validation and processing functions raise specific exceptions with detailed error messages and line number information when available.

285

286

## Types

287

288

### Core Types

289

290

```python { .api }

291

class Element:

292

"""XML element with tag, attributes, text content, and children."""

293

tag: str

294

text: str | None

295

tail: str | None

296

attrib: dict[str, str]

297

298

def find(self, path: str, namespaces: dict[str, str] = None) -> Element | None: ...

299

def findall(self, path: str, namespaces: dict[str, str] = None) -> list[Element]: ...

300

def xpath(self, path: str, **kwargs) -> list: ...

301

def get(self, key: str, default: str = None) -> str | None: ...

302

def set(self, key: str, value: str) -> None: ...

303

304

class ElementTree:

305

"""Document tree with root element and document-level operations."""

306

def getroot(self) -> Element: ...

307

def write(self, file, encoding: str = None, xml_declaration: bool = None): ...

308

def xpath(self, path: str, **kwargs) -> list: ...

309

310

class QName:

311

"""Qualified name with namespace URI and local name."""

312

def __init__(self, text_or_uri_or_element, tag: str = None): ...

313

localname: str

314

namespace: str | None

315

text: str

316

```

317

318

### Parser Types

319

320

```python { .api }

321

class XMLParser:

322

"""Configurable XML parser with validation and error handling options."""

323

def __init__(self, encoding: str = None, remove_blank_text: bool = False,

324

remove_comments: bool = False, remove_pis: bool = False,

325

strip_cdata: bool = True, recover: bool = False, **kwargs): ...

326

327

class HTMLParser:

328

"""Lenient HTML parser with automatic error recovery."""

329

def __init__(self, encoding: str = None, remove_blank_text: bool = False,

330

remove_comments: bool = False, **kwargs): ...

331

332

ParserType = XMLParser | HTMLParser

333

```