0
# NH3
1
2
High-performance HTML sanitization library providing Python bindings to the Rust ammonia crate. NH3 delivers fast, secure HTML cleaning with comprehensive configuration options, approximately 20x faster than alternatives like bleach while maintaining security and flexibility.
3
4
## Package Information
5
6
- **Package Name**: nh3
7
- **Language**: Python
8
- **Backend**: Rust (via PyO3/maturin)
9
- **Installation**: `pip install nh3`
10
11
## Core Imports
12
13
```python
14
import nh3
15
```
16
17
All functionality is available at the module level. For type hints:
18
19
```python
20
from typing import Callable, Dict, Optional, Set
21
```
22
23
## Basic Usage
24
25
```python
26
import nh3
27
28
# Basic HTML sanitization
29
html = '<script>alert("xss")</script><p>Safe <b>content</b></p>'
30
clean_html = nh3.clean(html)
31
print(clean_html) # Output: '<p>Safe <b>content</b></p>'
32
33
# Text escaping
34
text = 'User input with <dangerous> characters & symbols'
35
escaped = nh3.clean_text(text)
36
print(escaped) # Output: 'User input with <dangerous> characters & symbols'
37
38
# Check if string contains HTML
39
has_html = nh3.is_html('<p>HTML content</p>') # True
40
has_no_html = nh3.is_html('Plain text') # False
41
42
# Using a reusable cleaner with custom configuration
43
cleaner = nh3.Cleaner(
44
tags={'p', 'b', 'i', 'strong', 'em'},
45
attributes={'*': {'class', 'id'}},
46
strip_comments=True
47
)
48
result = cleaner.clean('<p class="text">Safe <script>evil()</script> content</p>')
49
print(result) # Output: '<p class="text">Safe content</p>'
50
```
51
52
## Capabilities
53
54
### HTML Sanitization
55
56
Primary function for cleaning HTML content with extensive configuration options for allowed tags, attributes, URL schemes, and content filtering.
57
58
```python { .api }
59
def clean(
60
html: str,
61
tags: Optional[Set[str]] = None,
62
clean_content_tags: Optional[Set[str]] = None,
63
attributes: Optional[Dict[str, Set[str]]] = None,
64
attribute_filter: Optional[Callable[[str, str, str], Optional[str]]] = None,
65
strip_comments: bool = True,
66
link_rel: Optional[str] = "noopener noreferrer",
67
generic_attribute_prefixes: Optional[Set[str]] = None,
68
tag_attribute_values: Optional[Dict[str, Dict[str, Set[str]]]] = None,
69
set_tag_attribute_values: Optional[Dict[str, Dict[str, str]]] = None,
70
url_schemes: Optional[Set[str]] = None,
71
allowed_classes: Optional[Dict[str, Set[str]]] = None,
72
filter_style_properties: Optional[Set[str]] = None
73
) -> str:
74
"""
75
Sanitize an HTML fragment according to the given options.
76
77
Parameters:
78
- html: Input HTML fragment to sanitize
79
- tags: Set of allowed HTML tags (defaults to ALLOWED_TAGS)
80
- clean_content_tags: Tags whose contents are completely removed
81
- attributes: Allowed attributes per tag ('*' key for any tag)
82
- attribute_filter: Callback for custom attribute processing
83
- strip_comments: Whether to remove HTML comments
84
- link_rel: Rel attribute value added to links
85
- generic_attribute_prefixes: Attribute prefixes allowed on any tag
86
- tag_attribute_values: Allowed attribute values per tag
87
- set_tag_attribute_values: Required attribute values per tag
88
- url_schemes: Permitted URL schemes for href/src attributes
89
- allowed_classes: Allowed CSS classes per tag
90
- filter_style_properties: Allowed CSS properties in style attributes
91
92
Returns:
93
Sanitized HTML fragment as string
94
"""
95
```
96
97
**Usage Examples:**
98
99
```python
100
# Allow only specific tags
101
nh3.clean('<div><p>Text</p><script>evil()</script></div>', tags={'p'})
102
# Result: '<p>Text</p>'
103
104
# Remove script/style content completely
105
nh3.clean('<style>body{}</style><p>Text</p>', clean_content_tags={'style'})
106
# Result: '<p>Text</p>'
107
108
# Custom attribute filtering
109
def filter_classes(tag, attr, value):
110
if tag == 'div' and attr == 'class':
111
allowed = {'container', 'wrapper'}
112
classes = set(value.split())
113
filtered = classes.intersection(allowed)
114
return ' '.join(filtered) if filtered else None
115
return value
116
117
nh3.clean('<div class="container evil">text</div>',
118
attributes={'div': {'class'}},
119
attribute_filter=filter_classes)
120
# Result: '<div class="container">text</div>'
121
122
# Allow data attributes with prefixes
123
nh3.clean('<div data-id="123" onclick="evil()">text</div>',
124
generic_attribute_prefixes={'data-'})
125
# Result: '<div data-id="123">text</div>'
126
127
# Control URL schemes
128
nh3.clean('<a href="javascript:alert()">link</a>', url_schemes={'https', 'http'})
129
# Result: '<a>link</a>'
130
131
# Filter CSS properties
132
nh3.clean('<p style="color:red;display:none">text</p>',
133
attributes={'p': {'style'}},
134
filter_style_properties={'color'})
135
# Result: '<p style="color:red">text</p>'
136
```
137
138
### Text Escaping
139
140
Converts arbitrary strings to HTML-safe text by escaping special characters, equivalent to html.escape() but with more aggressive escaping for maximum security.
141
142
```python { .api }
143
def clean_text(html: str) -> str:
144
"""
145
Turn an arbitrary string into unformatted HTML by escaping special characters.
146
147
Parameters:
148
- html: Input string to escape
149
150
Returns:
151
HTML-escaped string safe for display in HTML context
152
"""
153
```
154
155
**Usage Examples:**
156
157
```python
158
# Basic text escaping
159
nh3.clean_text('Price: $5 & up')
160
# Result: 'Price: $5 & up'
161
162
# JavaScript injection prevention
163
nh3.clean_text('"); alert("xss");//')
164
# Result: '"); alert("xss");//'
165
166
# HTML tag neutralization
167
nh3.clean_text('<script>alert("hello")</script>')
168
# Result: '<script>alert("hello")</script>'
169
```
170
171
### HTML Detection
172
173
Determines whether a string contains HTML syntax through full parsing, useful for conditional processing of user input.
174
175
```python { .api }
176
def is_html(html: str) -> bool:
177
"""
178
Determine if a given string contains HTML syntax.
179
180
Parameters:
181
- html: Input string to analyze
182
183
Returns:
184
True if string contains HTML syntax (including invalid HTML), False otherwise
185
"""
186
```
187
188
**Usage Examples:**
189
190
```python
191
# Valid HTML detection
192
nh3.is_html('<p>Hello world</p>') # True
193
nh3.is_html('<br>') # True
194
195
# Invalid HTML still detected
196
nh3.is_html('<invalid-tag>') # True
197
nh3.is_html('Vec::<u8>::new()') # True (angle brackets detected)
198
199
# Plain text
200
nh3.is_html('Hello world') # False
201
nh3.is_html('Price: $5 & up') # False
202
```
203
204
### Reusable Cleaner
205
206
Class-based interface for creating configured sanitizers that can be reused multiple times, providing better performance for repeated sanitization with the same settings.
207
208
```python { .api }
209
class Cleaner:
210
def __init__(
211
self,
212
tags: Optional[Set[str]] = None,
213
clean_content_tags: Optional[Set[str]] = None,
214
attributes: Optional[Dict[str, Set[str]]] = None,
215
attribute_filter: Optional[Callable[[str, str, str], Optional[str]]] = None,
216
strip_comments: bool = True,
217
link_rel: Optional[str] = "noopener noreferrer",
218
generic_attribute_prefixes: Optional[Set[str]] = None,
219
tag_attribute_values: Optional[Dict[str, Dict[str, Set[str]]]] = None,
220
set_tag_attribute_values: Optional[Dict[str, Dict[str, str]]] = None,
221
url_schemes: Optional[Set[str]] = None,
222
allowed_classes: Optional[Dict[str, Set[str]]] = None,
223
filter_style_properties: Optional[Set[str]] = None
224
) -> None:
225
"""
226
Create a reusable sanitizer with the given configuration.
227
228
Parameters: Same as clean() function parameters
229
"""
230
231
def clean(self, html: str) -> str:
232
"""
233
Sanitize HTML using the configured options.
234
235
Parameters:
236
- html: Input HTML fragment to sanitize
237
238
Returns:
239
Sanitized HTML fragment as string
240
"""
241
```
242
243
**Usage Examples:**
244
245
```python
246
# Create a cleaner for blog content
247
blog_cleaner = nh3.Cleaner(
248
tags={'p', 'br', 'strong', 'em', 'a', 'ul', 'ol', 'li'},
249
attributes={
250
'a': {'href', 'title'},
251
'*': {'class'}
252
},
253
allowed_classes={
254
'p': {'highlight', 'quote'},
255
'a': {'external-link'}
256
},
257
url_schemes={'http', 'https', 'mailto'}
258
)
259
260
# Reuse the cleaner for multiple inputs
261
user_content1 = blog_cleaner.clean('<p class="highlight">Safe content</p>')
262
user_content2 = blog_cleaner.clean('<script>evil()</script><p>More content</p>')
263
264
# Create a strict cleaner for user comments
265
comment_cleaner = nh3.Cleaner(
266
tags={'p', 'br'},
267
attributes={},
268
strip_comments=True,
269
link_rel=None
270
)
271
272
safe_comment = comment_cleaner.clean('<p>User comment with <a>no links</a></p>')
273
# Result: '<p>User comment with no links</p>'
274
```
275
276
## Default Constants
277
278
Pre-configured sets of allowed tags, attributes, and URL schemes based on secure defaults from the ammonia library.
279
280
```python { .api }
281
ALLOWED_TAGS: Set[str]
282
# Default set of allowed HTML tags including: a, abbr, acronym, area, article, aside,
283
# b, bdi, bdo, blockquote, br, button, caption, center, cite, code, col, colgroup,
284
# data, datalist, dd, del, details, dfn, div, dl, dt, em, fieldset, figcaption,
285
# figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, i, img, input,
286
# ins, kbd, keygen, label, legend, li, main, map, mark, meter, nav, ol, optgroup,
287
# option, output, p, pre, progress, q, rp, rt, ruby, s, samp, section, select,
288
# small, span, strong, sub, summary, sup, table, tbody, td, textarea, tfoot, th,
289
# thead, time, tr, u, ul, var, wbr
290
291
ALLOWED_ATTRIBUTES: Dict[str, Set[str]]
292
# Default mapping of allowed attributes per tag, includes common safe attributes
293
# like href for links, src for images, type for inputs, etc.
294
295
ALLOWED_URL_SCHEMES: Set[str]
296
# Default set of allowed URL schemes: http, https, mailto
297
```
298
299
**Usage Examples:**
300
301
```python
302
# Inspect default allowed tags
303
print('p' in nh3.ALLOWED_TAGS) # True
304
print('script' in nh3.ALLOWED_TAGS) # False
305
306
# Extend default attributes
307
from copy import deepcopy
308
custom_attributes = deepcopy(nh3.ALLOWED_ATTRIBUTES)
309
custom_attributes['div'].add('data-id')
310
custom_attributes['*'] = {'class', 'id'}
311
312
# Use extended configuration
313
result = nh3.clean('<div class="box" data-id="123">content</div>',
314
attributes=custom_attributes)
315
316
# Remove tags using set operations
317
restricted_tags = nh3.ALLOWED_TAGS - {'b', 'i'}
318
nh3.clean('<b><i>text</i></b><p>paragraph</p>', tags=restricted_tags)
319
# Result: 'text<p>paragraph</p>'
320
321
# Remove URL schemes using set operations
322
safe_schemes = nh3.ALLOWED_URL_SCHEMES - {'tel'}
323
nh3.clean('<a href="tel:+1">Call</a> or <a href="mailto:me">email</a>',
324
url_schemes=safe_schemes)
325
# Result: '<a rel="noopener noreferrer">Call</a> or <a href="mailto:me" rel="noopener noreferrer">email</a>'
326
327
# Check default URL schemes
328
print('https' in nh3.ALLOWED_URL_SCHEMES) # True
329
print('javascript' in nh3.ALLOWED_URL_SCHEMES) # False
330
```
331
332
## Advanced Configuration
333
334
### Attribute Filtering
335
336
The `attribute_filter` parameter accepts a callable that receives three string parameters (tag, attribute, value) and can return a modified value or None to remove the attribute entirely.
337
338
```python
339
def smart_class_filter(tag, attr, value):
340
"""Example: Only allow specific CSS classes"""
341
if attr == 'class':
342
allowed_classes = {
343
'p': {'intro', 'highlight', 'quote'},
344
'div': {'container', 'wrapper', 'sidebar'},
345
'a': {'external', 'internal'}
346
}
347
if tag in allowed_classes:
348
classes = set(value.split())
349
filtered = classes.intersection(allowed_classes[tag])
350
return ' '.join(sorted(filtered)) if filtered else None
351
return value
352
353
# Apply the filter
354
result = nh3.clean(
355
'<div class="container evil"><p class="intro spam">Text</p></div>',
356
attributes={'div': {'class'}, 'p': {'class'}},
357
attribute_filter=smart_class_filter
358
)
359
# Result: '<div class="container"><p class="intro">Text</p></div>'
360
```
361
362
### Tag Attribute Values
363
364
Control which specific values are allowed for attributes on specific tags.
365
366
```python
367
# Only allow specific form input types
368
result = nh3.clean(
369
'<input type="text"><input type="password"><input type="file">',
370
tags={'input'},
371
tag_attribute_values={
372
'input': {
373
'type': {'text', 'email', 'password', 'number'}
374
}
375
}
376
)
377
# Result: '<input type="text"><input type="password"><input>'
378
```
379
380
### Set Tag Attribute Values
381
382
Automatically add or override attribute values on specific tags.
383
384
```python
385
# Always add target="_blank" to external links
386
result = nh3.clean(
387
'<a href="https://example.com">Link</a>',
388
tags={'a'},
389
attributes={'a': {'href', 'target'}},
390
set_tag_attribute_values={
391
'a': {'target': '_blank'}
392
},
393
link_rel='noopener noreferrer'
394
)
395
# Result: '<a href="https://example.com" target="_blank" rel="noopener noreferrer">Link</a>'
396
```
397
398
## Error Handling
399
400
NH3 follows Python conventions for error handling:
401
402
- **Invalid attribute_filter**: Raises `TypeError` if the provided callback is not callable. Exceptions raised within the callback are handled as unraisable exceptions and logged, allowing processing to continue
403
- **Malformed HTML**: Processed with best-effort parsing, invalid elements are removed
404
- **Invalid CSS**: When style filtering is enabled, invalid declarations and @rules are removed, leaving only syntactically valid CSS declarations that are normalized (e.g., whitespace standardization)
405
- **Thread Safety**: All operations are thread-safe and release the GIL during processing
406
407
## Performance Characteristics
408
409
- **Speed**: Approximately 20x faster than bleach for typical HTML sanitization
410
- **Memory**: Efficient streaming processing with minimal memory overhead
411
- **Threading**: Thread-safe operations with GIL release during Rust processing
412
- **Scalability**: Suitable for high-throughput applications and large HTML documents
413
414
## Module Attributes
415
416
```python { .api }
417
__version__: str
418
# Package version string (e.g., "0.3.0")
419
```