or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcustom-directives.mdindex.mdnotebook-processing.mdsphinx-extension.mdtext-processing.md

text-processing.mddocs/

0

# Text Processing

1

2

Utilities for converting between formats, handling Markdown/RST conversion, and processing notebook content. These functions provide the text transformation capabilities needed for converting notebook markup to Sphinx-compatible formats.

3

4

## Capabilities

5

6

### Markdown to RST Conversion

7

8

Core function for converting Markdown text to reStructuredText with LaTeX math support and custom filters.

9

10

```python { .api }

11

def markdown2rst(text):

12

"""

13

Convert a Markdown string to reST via pandoc.

14

15

This is very similar to nbconvert.filters.markdown.markdown2rst(),

16

except that it uses a pandoc filter to convert raw LaTeX blocks to

17

"math" directives (instead of "raw:: latex" directives).

18

19

Parameters:

20

- text: str, Markdown text to convert

21

22

Returns:

23

str: Converted reStructuredText with proper math directive formatting,

24

image definitions, and citation processing

25

"""

26

```

27

28

Usage example:

29

30

```python

31

from nbsphinx import markdown2rst

32

33

# Convert Markdown with math to RST

34

markdown_text = """

35

# My Title

36

37

This is some text with inline math $x = y + z$ and display math:

38

39

$$

40

\\int_0^\\infty e^{-x} dx = 1

41

$$

42

43

![Image](image.png)

44

"""

45

46

rst_text = markdown2rst(markdown_text)

47

print(rst_text)

48

# Output includes proper RST math directives and image handling

49

```

50

51

### Pandoc Wrapper

52

53

Direct interface to pandoc for format conversion with optional filter functions.

54

55

```python { .api }

56

def pandoc(source, fmt, to, filter_func=None):

57

"""

58

Convert a string in format `from` to format `to` via pandoc.

59

60

This is based on nbconvert.utils.pandoc.pandoc() and extended to

61

allow passing a filter function.

62

63

Parameters:

64

- source: str, source text to convert

65

- fmt: str, input format ('markdown', 'html', etc.)

66

- to: str, output format ('rst', 'latex', etc.)

67

- filter_func: callable, optional filter function for JSON processing

68

69

Returns:

70

str: Converted text in target format

71

"""

72

```

73

74

Usage example:

75

76

```python

77

from nbsphinx import pandoc

78

79

# Basic conversion

80

html_text = "<p>Hello <strong>world</strong></p>"

81

rst_text = pandoc(html_text, 'html', 'rst')

82

83

# With custom filter

84

def my_filter(json_text):

85

# Custom processing of pandoc JSON AST

86

return json_text

87

88

rst_text = pandoc(html_text, 'html', 'rst', filter_func=my_filter)

89

```

90

91

### Legacy Compatibility

92

93

Compatibility wrapper for older nbconvert versions.

94

95

```python { .api }

96

def convert_pandoc(text, from_format, to_format):

97

"""

98

Simple wrapper for markdown2rst.

99

100

In nbconvert version 5.0, the use of markdown2rst in the RST

101

template was replaced by the new filter function convert_pandoc.

102

103

Parameters:

104

- text: str, text to convert

105

- from_format: str, input format (must be 'markdown')

106

- to_format: str, output format (must be 'rst')

107

108

Returns:

109

str: Converted reStructuredText

110

111

Raises:

112

ValueError: If formats other than markdown->rst are requested

113

"""

114

```

115

116

### HTML Parsing

117

118

Specialized HTML parsers for handling citations and images in notebook content.

119

120

```python { .api }

121

class CitationParser(html.parser.HTMLParser):

122

"""

123

HTML parser for citation elements.

124

125

Processes HTML elements with citation data attributes

126

and converts them to Sphinx citation references.

127

128

Methods:

129

- handle_starttag(tag, attrs): Process opening tags

130

- handle_endtag(tag): Process closing tags

131

- handle_startendtag(tag, attrs): Process self-closing tags

132

- reset(): Reset parser state

133

134

Attributes:

135

- starttag: str, current opening tag

136

- endtag: str, current closing tag

137

- cite: str, formatted citation reference

138

"""

139

140

class ImgParser(html.parser.HTMLParser):

141

"""

142

Turn HTML <img> tags into raw RST blocks.

143

144

Converts HTML image elements to reStructuredText image directives

145

with proper attribute handling and data URI support.

146

147

Methods:

148

- handle_starttag(tag, attrs): Process opening img tags

149

- handle_startendtag(tag, attrs): Process self-closing img tags

150

- reset(): Reset parser state

151

152

Attributes:

153

- obj: dict, pandoc AST object for the image

154

- definition: str, RST image directive definition

155

"""

156

```

157

158

Usage example:

159

160

```python

161

from nbsphinx import CitationParser, ImgParser

162

163

# Parse citations

164

citation_html = '<span data-cite="author2023">Citation text</span>'

165

parser = CitationParser()

166

parser.feed(citation_html)

167

print(parser.cite) # :cite:`author2023`

168

169

# Parse images

170

img_html = '<img src="plot.png" alt="My Plot" width="500">'

171

img_parser = ImgParser()

172

img_parser.feed(img_html)

173

print(img_parser.definition) # RST image directive

174

```

175

176

### Utility Functions

177

178

Helper functions for text processing and content extraction.

179

180

```python { .api }

181

def _extract_gallery_or_toctree(cell):

182

"""

183

Extract links from Markdown cell and create gallery/toctree.

184

185

Parameters:

186

- cell: NotebookNode, notebook cell with gallery metadata

187

188

Returns:

189

str: RST directive for gallery or toctree

190

"""

191

192

def _get_empty_lines(text):

193

"""

194

Get number of empty lines before and after code.

195

196

Parameters:

197

- text: str, code text to analyze

198

199

Returns:

200

tuple: (before, after) - number of empty lines

201

"""

202

203

def _get_output_type(output):

204

"""

205

Choose appropriate output data types for HTML and LaTeX.

206

207

Parameters:

208

- output: NotebookNode, notebook output cell

209

210

Returns:

211

tuple: (html_datatype, latex_datatype) - appropriate MIME types

212

"""

213

214

def _local_file_from_reference(node, document):

215

"""

216

Get local file path from document reference node.

217

218

Parameters:

219

- node: docutils node with reference

220

- document: docutils document containing the node

221

222

Returns:

223

str: Local file path or None if not a local file reference

224

"""

225

```

226

227

## Format Constants

228

229

Pre-defined MIME type priorities for different output formats.

230

231

```python { .api }

232

# Display data priority for HTML output

233

DISPLAY_DATA_PRIORITY_HTML = (

234

'application/vnd.jupyter.widget-state+json',

235

'application/vnd.jupyter.widget-view+json',

236

'application/javascript',

237

'text/html',

238

'text/markdown',

239

'image/svg+xml',

240

'text/latex',

241

'image/png',

242

'image/jpeg',

243

'text/plain',

244

)

245

246

# Display data priority for LaTeX output

247

DISPLAY_DATA_PRIORITY_LATEX = (

248

'text/latex',

249

'application/pdf',

250

'image/png',

251

'image/jpeg',

252

'image/svg+xml',

253

'text/markdown',

254

'text/plain',

255

)

256

257

# Thumbnail MIME type mappings

258

THUMBNAIL_MIME_TYPES = {

259

'image/svg+xml': '.svg',

260

'image/png': '.png',

261

'image/jpeg': '.jpg',

262

}

263

```

264

265

These constants control how different types of notebook output are prioritized and processed for display in HTML and LaTeX formats.