0
# Parsers and Emitters
1
2
Low-level classes for advanced parsing and emission control, enabling custom conversion workflows and specialized markup processing. These classes provide the foundation for all conversion functions and allow fine-grained control over the parsing and output generation process.
3
4
## Capabilities
5
6
### Creole Parser
7
8
Parse Creole markup into document tree structure for processing.
9
10
```python { .api }
11
class CreoleParser:
12
def __init__(self, markup_string: str, block_rules: tuple = None,
13
blog_line_breaks: bool = True, debug: bool = False): ...
14
def parse(self) -> DocNode: ...
15
```
16
17
**Parameters:**
18
- `markup_string`: Creole markup text to parse
19
- `block_rules`: Custom block-level parsing rules
20
- `blog_line_breaks`: Use blog-style (True) vs wiki-style (False) line breaks
21
- `debug`: Enable debug output
22
23
**Usage Examples:**
24
25
```python
26
from creole.parser.creol2html_parser import CreoleParser
27
28
# Basic parsing
29
parser = CreoleParser("This is **bold** text")
30
document = parser.parse()
31
32
# Custom block rules
33
from creole.parser.creol2html_rules import BlockRules
34
custom_rules = BlockRules()
35
parser = CreoleParser(markup, block_rules=custom_rules)
36
document = parser.parse()
37
38
# Debug mode
39
parser = CreoleParser(markup, debug=True)
40
document = parser.parse()
41
if debug:
42
document.debug() # Print document tree structure
43
```
44
45
### HTML Parser
46
47
Parse HTML markup into document tree structure for conversion to other formats.
48
49
```python { .api }
50
class HtmlParser:
51
def __init__(self, debug: bool = False): ...
52
def feed(self, html_string: str) -> DocNode: ...
53
def debug(self): ...
54
```
55
56
**Parameters:**
57
- `debug`: Enable debug output and tree visualization
58
59
**Usage Examples:**
60
61
```python
62
from creole.parser.html_parser import HtmlParser
63
64
# Basic HTML parsing
65
parser = HtmlParser()
66
document = parser.feed('<p>Hello <strong>world</strong></p>')
67
68
# Debug mode
69
parser = HtmlParser(debug=True)
70
document = parser.feed(html_content)
71
parser.debug() # Print parsing debug information
72
```
73
74
### HTML Emitter
75
76
Convert document tree to HTML output with macro support and formatting options.
77
78
```python { .api }
79
class HtmlEmitter:
80
def __init__(self, document: DocNode, macros: dict = None,
81
verbose: int = None, stderr = None, strict: bool = False): ...
82
def emit(self) -> str: ...
83
```
84
85
**Parameters:**
86
- `document`: Document tree to convert
87
- `macros`: Dictionary of macro functions
88
- `verbose`: Verbosity level for output
89
- `stderr`: Error output stream
90
- `strict`: Enable strict Creole 1.0 compliance
91
92
**Usage Examples:**
93
94
```python
95
from creole.emitter.creol2html_emitter import HtmlEmitter
96
from creole.parser.creol2html_parser import CreoleParser
97
98
# Parse and emit HTML
99
parser = CreoleParser("**bold** text")
100
document = parser.parse()
101
emitter = HtmlEmitter(document)
102
html = emitter.emit()
103
104
# With macros
105
def code_macro(ext, text):
106
return f'<pre><code class="{ext}">{text}</code></pre>'
107
108
macros = {'code': code_macro}
109
emitter = HtmlEmitter(document, macros=macros)
110
html = emitter.emit()
111
112
# Strict mode
113
emitter = HtmlEmitter(document, strict=True)
114
html = emitter.emit()
115
```
116
117
### Creole Emitter
118
119
Convert document tree to Creole markup output with unknown tag handling.
120
121
```python { .api }
122
class CreoleEmitter:
123
def __init__(self, document: DocNode, debug: bool = False,
124
unknown_emit = None, strict: bool = False): ...
125
def emit(self) -> str: ...
126
```
127
128
**Parameters:**
129
- `document`: Document tree to convert
130
- `debug`: Enable debug output
131
- `unknown_emit`: Handler function for unknown HTML tags
132
- `strict`: Enable strict Creole output mode
133
134
**Usage Examples:**
135
136
```python
137
from creole.emitter.html2creole_emitter import CreoleEmitter
138
from creole.parser.html_parser import HtmlParser
139
from creole.shared.unknown_tags import transparent_unknown_nodes
140
141
# Parse HTML and emit Creole
142
parser = HtmlParser()
143
document = parser.feed('<p><strong>bold</strong> text</p>')
144
emitter = CreoleEmitter(document)
145
creole = emitter.emit()
146
147
# Handle unknown tags
148
emitter = CreoleEmitter(document, unknown_emit=transparent_unknown_nodes)
149
creole = emitter.emit()
150
151
# Debug mode
152
emitter = CreoleEmitter(document, debug=True)
153
creole = emitter.emit()
154
```
155
156
### ReStructuredText Emitter
157
158
Convert document tree to ReStructuredText markup with reference link handling.
159
160
```python { .api }
161
class ReStructuredTextEmitter:
162
def __init__(self, document: DocNode, debug: bool = False,
163
unknown_emit = None): ...
164
def emit(self) -> str: ...
165
```
166
167
**Parameters:**
168
- `document`: Document tree to convert
169
- `debug`: Enable debug output
170
- `unknown_emit`: Handler function for unknown HTML tags
171
172
**Usage Examples:**
173
174
```python
175
from creole.emitter.html2rest_emitter import ReStructuredTextEmitter
176
from creole.parser.html_parser import HtmlParser
177
178
# Parse HTML and emit ReStructuredText
179
parser = HtmlParser()
180
document = parser.feed('<h1>Title</h1><p>Content with <a href="http://example.com">link</a></p>')
181
emitter = ReStructuredTextEmitter(document)
182
rest = emitter.emit()
183
# Returns ReStructuredText with proper heading underlines and reference links
184
```
185
186
### Textile Emitter
187
188
Convert document tree to Textile markup format.
189
190
```python { .api }
191
class TextileEmitter:
192
def __init__(self, document: DocNode, debug: bool = False,
193
unknown_emit = None): ...
194
def emit(self) -> str: ...
195
```
196
197
**Parameters:**
198
- `document`: Document tree to convert
199
- `debug`: Enable debug output
200
- `unknown_emit`: Handler function for unknown HTML tags
201
202
**Usage Examples:**
203
204
```python
205
from creole.emitter.html2textile_emitter import TextileEmitter
206
from creole.parser.html_parser import HtmlParser
207
208
# Parse HTML and emit Textile
209
parser = HtmlParser()
210
document = parser.feed('<p><strong>bold</strong> and <em>italic</em></p>')
211
emitter = TextileEmitter(document)
212
textile = emitter.emit()
213
# Returns: '*bold* and __italic__'
214
```
215
216
## Document Tree Structure
217
218
### DocNode Class
219
220
The document tree node that represents markup elements and hierarchy.
221
222
```python { .api }
223
class DocNode:
224
def __init__(self, kind: str = None, parent = None): ...
225
def debug(self): ...
226
def append(self, child): ...
227
def get_text(self) -> str: ...
228
```
229
230
**Properties:**
231
- `kind`: Node type (e.g., 'document', 'paragraph', 'strong', 'link')
232
- `parent`: Parent node reference
233
- `children`: List of child nodes
234
- `content`: Text content for leaf nodes
235
- `attrs`: Dictionary of node attributes
236
237
**Usage Examples:**
238
239
```python
240
from creole.shared.document_tree import DocNode
241
242
# Create document structure
243
doc = DocNode('document')
244
para = DocNode('paragraph', parent=doc)
245
doc.append(para)
246
247
bold = DocNode('strong', parent=para)
248
bold.content = 'bold text'
249
para.append(bold)
250
251
# Debug tree structure
252
doc.debug()
253
```
254
255
## Advanced Usage Patterns
256
257
### Custom Parser-Emitter Workflow
258
259
```python
260
from creole.parser.creol2html_parser import CreoleParser
261
from creole.emitter.html2rest_emitter import ReStructuredTextEmitter
262
263
# Parse Creole and emit ReStructuredText directly
264
parser = CreoleParser("= Heading =\n\nThis is **bold** text")
265
document = parser.parse()
266
emitter = ReStructuredTextEmitter(document)
267
rest_output = emitter.emit()
268
```
269
270
### Document Tree Manipulation
271
272
```python
273
# Parse, modify, and emit
274
parser = CreoleParser("Original text")
275
document = parser.parse()
276
277
# Modify document tree
278
for node in document.children:
279
if node.kind == 'strong':
280
node.kind = 'emphasis' # Change bold to italic
281
282
emitter = HtmlEmitter(document)
283
modified_html = emitter.emit()
284
```
285
286
## HTML Processing Utilities
287
288
### HTML Entity Decoder
289
290
Utility class for converting HTML entities to Unicode characters.
291
292
```python { .api }
293
class Deentity:
294
def __init__(self): ...
295
def replace_all(self, content: str) -> str: ...
296
def replace_number(self, text: str) -> str: ...
297
def replace_hex(self, text: str) -> str: ...
298
def replace_named(self, text: str) -> str: ...
299
```
300
301
**Usage Examples:**
302
303
```python
304
from creole.html_tools.deentity import Deentity
305
306
# Create decoder instance
307
decoder = Deentity()
308
309
# Convert all types of HTML entities
310
html_text = "<p>Hello & welcome — — "
311
clean_text = decoder.replace_all(html_text)
312
# Returns: '<p>Hello & welcome — — \xa0'
313
314
# Convert specific entity types
315
decoder.replace_number("62") # Returns: '>'
316
decoder.replace_hex("3E") # Returns: '>'
317
decoder.replace_named("amp") # Returns: '&'
318
```
319
320
### HTML Whitespace Stripper
321
322
Remove unnecessary whitespace from HTML while preserving structure.
323
324
```python { .api }
325
def strip_html(html_code: str) -> str: ...
326
```
327
328
**Usage Examples:**
329
330
```python
331
from creole.html_tools.strip_html import strip_html
332
333
# Clean up HTML whitespace
334
messy_html = ' <p> one \n two </p>'
335
clean_html = strip_html(messy_html)
336
# Returns: '<p>one two</p>'
337
338
# Preserves important spacing around inline elements
339
html = 'one <i>two \n <strong> \n three \n </strong></i>'
340
clean = strip_html(html)
341
# Returns: 'one <i>two <strong>three</strong> </i>'
342
```
343
344
## Unknown Tag Handlers
345
346
Functions for handling unknown HTML tags during conversion.
347
348
```python { .api }
349
def raise_unknown_node(emitter, node): ...
350
def use_html_macro(emitter, node): ...
351
def preformat_unknown_nodes(emitter, node): ...
352
def escape_unknown_nodes(emitter, node): ...
353
def transparent_unknown_nodes(emitter, node): ...
354
```
355
356
**Usage Examples:**
357
358
```python
359
from creole.shared.unknown_tags import (
360
transparent_unknown_nodes, escape_unknown_nodes,
361
raise_unknown_node, use_html_macro
362
)
363
from creole import html2creole
364
365
# Different ways to handle unknown tags
366
html = '<p>Text with <unknown>content</unknown></p>'
367
368
# Remove tags, keep content (default)
369
creole = html2creole(html, unknown_emit=transparent_unknown_nodes)
370
# Returns: 'Text with content'
371
372
# Escape unknown tags as text
373
creole = html2creole(html, unknown_emit=escape_unknown_nodes)
374
# Returns: 'Text with <unknown>content</unknown>'
375
376
# Raise error on unknown tags
377
try:
378
creole = html2creole(html, unknown_emit=raise_unknown_node)
379
except NotImplementedError:
380
print("Unknown tag encountered")
381
382
# Wrap in HTML macro
383
creole = html2creole(html, unknown_emit=use_html_macro)
384
# Returns: 'Text with <<html>><unknown>content</unknown><</html>>'
385
```
386
387
### Error Handling and Debugging
388
389
All parser and emitter classes support debug mode for troubleshooting:
390
391
```python
392
# Enable debugging
393
parser = CreoleParser(markup, debug=True)
394
document = parser.parse()
395
document.debug() # Print tree structure
396
397
emitter = HtmlEmitter(document, verbose=2)
398
html = emitter.emit() # Verbose output during emission
399
```