0
# Text Processing
1
2
Utilities for converting between formats, handling Markdown/RST conversion, and processing notebook content. These functions provide the text transformation capabilities needed for converting notebook markup to Sphinx-compatible formats.
3
4
## Capabilities
5
6
### Markdown to RST Conversion
7
8
Core function for converting Markdown text to reStructuredText with LaTeX math support and custom filters.
9
10
```python { .api }
11
def markdown2rst(text):
12
"""
13
Convert a Markdown string to reST via pandoc.
14
15
This is very similar to nbconvert.filters.markdown.markdown2rst(),
16
except that it uses a pandoc filter to convert raw LaTeX blocks to
17
"math" directives (instead of "raw:: latex" directives).
18
19
Parameters:
20
- text: str, Markdown text to convert
21
22
Returns:
23
str: Converted reStructuredText with proper math directive formatting,
24
image definitions, and citation processing
25
"""
26
```
27
28
Usage example:
29
30
```python
31
from nbsphinx import markdown2rst
32
33
# Convert Markdown with math to RST
34
markdown_text = """
35
# My Title
36
37
This is some text with inline math $x = y + z$ and display math:
38
39
$$
40
\\int_0^\\infty e^{-x} dx = 1
41
$$
42
43

44
"""
45
46
rst_text = markdown2rst(markdown_text)
47
print(rst_text)
48
# Output includes proper RST math directives and image handling
49
```
50
51
### Pandoc Wrapper
52
53
Direct interface to pandoc for format conversion with optional filter functions.
54
55
```python { .api }
56
def pandoc(source, fmt, to, filter_func=None):
57
"""
58
Convert a string in format `from` to format `to` via pandoc.
59
60
This is based on nbconvert.utils.pandoc.pandoc() and extended to
61
allow passing a filter function.
62
63
Parameters:
64
- source: str, source text to convert
65
- fmt: str, input format ('markdown', 'html', etc.)
66
- to: str, output format ('rst', 'latex', etc.)
67
- filter_func: callable, optional filter function for JSON processing
68
69
Returns:
70
str: Converted text in target format
71
"""
72
```
73
74
Usage example:
75
76
```python
77
from nbsphinx import pandoc
78
79
# Basic conversion
80
html_text = "<p>Hello <strong>world</strong></p>"
81
rst_text = pandoc(html_text, 'html', 'rst')
82
83
# With custom filter
84
def my_filter(json_text):
85
# Custom processing of pandoc JSON AST
86
return json_text
87
88
rst_text = pandoc(html_text, 'html', 'rst', filter_func=my_filter)
89
```
90
91
### Legacy Compatibility
92
93
Compatibility wrapper for older nbconvert versions.
94
95
```python { .api }
96
def convert_pandoc(text, from_format, to_format):
97
"""
98
Simple wrapper for markdown2rst.
99
100
In nbconvert version 5.0, the use of markdown2rst in the RST
101
template was replaced by the new filter function convert_pandoc.
102
103
Parameters:
104
- text: str, text to convert
105
- from_format: str, input format (must be 'markdown')
106
- to_format: str, output format (must be 'rst')
107
108
Returns:
109
str: Converted reStructuredText
110
111
Raises:
112
ValueError: If formats other than markdown->rst are requested
113
"""
114
```
115
116
### HTML Parsing
117
118
Specialized HTML parsers for handling citations and images in notebook content.
119
120
```python { .api }
121
class CitationParser(html.parser.HTMLParser):
122
"""
123
HTML parser for citation elements.
124
125
Processes HTML elements with citation data attributes
126
and converts them to Sphinx citation references.
127
128
Methods:
129
- handle_starttag(tag, attrs): Process opening tags
130
- handle_endtag(tag): Process closing tags
131
- handle_startendtag(tag, attrs): Process self-closing tags
132
- reset(): Reset parser state
133
134
Attributes:
135
- starttag: str, current opening tag
136
- endtag: str, current closing tag
137
- cite: str, formatted citation reference
138
"""
139
140
class ImgParser(html.parser.HTMLParser):
141
"""
142
Turn HTML <img> tags into raw RST blocks.
143
144
Converts HTML image elements to reStructuredText image directives
145
with proper attribute handling and data URI support.
146
147
Methods:
148
- handle_starttag(tag, attrs): Process opening img tags
149
- handle_startendtag(tag, attrs): Process self-closing img tags
150
- reset(): Reset parser state
151
152
Attributes:
153
- obj: dict, pandoc AST object for the image
154
- definition: str, RST image directive definition
155
"""
156
```
157
158
Usage example:
159
160
```python
161
from nbsphinx import CitationParser, ImgParser
162
163
# Parse citations
164
citation_html = '<span data-cite="author2023">Citation text</span>'
165
parser = CitationParser()
166
parser.feed(citation_html)
167
print(parser.cite) # :cite:`author2023`
168
169
# Parse images
170
img_html = '<img src="plot.png" alt="My Plot" width="500">'
171
img_parser = ImgParser()
172
img_parser.feed(img_html)
173
print(img_parser.definition) # RST image directive
174
```
175
176
### Utility Functions
177
178
Helper functions for text processing and content extraction.
179
180
```python { .api }
181
def _extract_gallery_or_toctree(cell):
182
"""
183
Extract links from Markdown cell and create gallery/toctree.
184
185
Parameters:
186
- cell: NotebookNode, notebook cell with gallery metadata
187
188
Returns:
189
str: RST directive for gallery or toctree
190
"""
191
192
def _get_empty_lines(text):
193
"""
194
Get number of empty lines before and after code.
195
196
Parameters:
197
- text: str, code text to analyze
198
199
Returns:
200
tuple: (before, after) - number of empty lines
201
"""
202
203
def _get_output_type(output):
204
"""
205
Choose appropriate output data types for HTML and LaTeX.
206
207
Parameters:
208
- output: NotebookNode, notebook output cell
209
210
Returns:
211
tuple: (html_datatype, latex_datatype) - appropriate MIME types
212
"""
213
214
def _local_file_from_reference(node, document):
215
"""
216
Get local file path from document reference node.
217
218
Parameters:
219
- node: docutils node with reference
220
- document: docutils document containing the node
221
222
Returns:
223
str: Local file path or None if not a local file reference
224
"""
225
```
226
227
## Format Constants
228
229
Pre-defined MIME type priorities for different output formats.
230
231
```python { .api }
232
# Display data priority for HTML output
233
DISPLAY_DATA_PRIORITY_HTML = (
234
'application/vnd.jupyter.widget-state+json',
235
'application/vnd.jupyter.widget-view+json',
236
'application/javascript',
237
'text/html',
238
'text/markdown',
239
'image/svg+xml',
240
'text/latex',
241
'image/png',
242
'image/jpeg',
243
'text/plain',
244
)
245
246
# Display data priority for LaTeX output
247
DISPLAY_DATA_PRIORITY_LATEX = (
248
'text/latex',
249
'application/pdf',
250
'image/png',
251
'image/jpeg',
252
'image/svg+xml',
253
'text/markdown',
254
'text/plain',
255
)
256
257
# Thumbnail MIME type mappings
258
THUMBNAIL_MIME_TYPES = {
259
'image/svg+xml': '.svg',
260
'image/png': '.png',
261
'image/jpeg': '.jpg',
262
}
263
```
264
265
These constants control how different types of notebook output are prioritized and processed for display in HTML and LaTeX formats.