0
# Bleach
1
2
An easy safelist-based HTML-sanitizing tool that escapes or strips markup and attributes from untrusted HTML content. Bleach uses an allowlist approach to remove malicious content while preserving safe, intended HTML elements. It can also safely linkify text, applying more comprehensive filters than Django's urlize filter.
3
4
## Package Information
5
6
- **Package Name**: bleach
7
- **Language**: Python
8
- **Installation**: `pip install bleach`
9
- **Optional Dependencies**: `pip install bleach[css]` (for CSS sanitization with tinycss2)
10
11
## Core Imports
12
13
```python
14
import bleach
15
```
16
17
For main functions:
18
19
```python
20
from bleach import clean, linkify
21
```
22
23
For classes:
24
25
```python
26
from bleach.sanitizer import Cleaner, BleachSanitizerFilter, attribute_filter_factory
27
from bleach.linkifier import Linker, LinkifyFilter
28
from bleach.css_sanitizer import CSSSanitizer
29
```
30
31
For callbacks:
32
33
```python
34
from bleach.callbacks import nofollow, target_blank
35
```
36
37
For constants and utilities:
38
39
```python
40
from bleach.sanitizer import ALLOWED_TAGS, ALLOWED_ATTRIBUTES, ALLOWED_PROTOCOLS
41
from bleach.sanitizer import INVISIBLE_CHARACTERS, INVISIBLE_CHARACTERS_RE, INVISIBLE_REPLACEMENT_CHAR
42
from bleach.linkifier import DEFAULT_CALLBACKS, build_url_re, build_email_re, TLDS, URL_RE, EMAIL_RE, PROTO_RE
43
from bleach.css_sanitizer import ALLOWED_CSS_PROPERTIES, ALLOWED_SVG_PROPERTIES
44
from bleach import html5lib_shim # For HTML_TAGS constant
45
from bleach import __version__, __releasedate__
46
```
47
48
## Basic Usage
49
50
```python
51
import bleach
52
53
# Basic HTML sanitization - removes unsafe tags and attributes
54
unsafe_html = '<script>alert("XSS")</script><p onclick="evil()">Hello <b>world</b></p>'
55
safe_html = bleach.clean(unsafe_html)
56
# Result: '<script>alert("XSS")</script><p>Hello <b>world</b></p>'
57
58
# Linkification - converts URLs to clickable links
59
text_with_urls = 'Visit https://example.com for more info!'
60
linked_text = bleach.linkify(text_with_urls)
61
# Result: 'Visit <a href="https://example.com" rel="nofollow">https://example.com</a> for more info!'
62
63
# Combined cleaning and linkifying
64
unsafe_text = 'Check out http://evil.com<script>alert("bad")</script>'
65
safe_linked = bleach.linkify(bleach.clean(unsafe_text))
66
```
67
68
## Capabilities
69
70
### HTML Sanitization
71
72
Cleans HTML fragments by removing or escaping malicious content using an allowlist-based approach.
73
74
```python { .api }
75
def clean(
76
text: str,
77
tags: frozenset = ALLOWED_TAGS,
78
attributes: dict = ALLOWED_ATTRIBUTES,
79
protocols: frozenset = ALLOWED_PROTOCOLS,
80
strip: bool = False,
81
strip_comments: bool = True,
82
css_sanitizer: CSSSanitizer = None
83
) -> str:
84
"""
85
Clean an HTML fragment of malicious content and return it.
86
87
Parameters:
88
- text: the HTML text to clean
89
- tags: set of allowed tags; defaults to ALLOWED_TAGS
90
- attributes: allowed attributes; can be callable, list or dict; defaults to ALLOWED_ATTRIBUTES
91
- protocols: allowed list of protocols for links; defaults to ALLOWED_PROTOCOLS
92
- strip: whether to strip disallowed elements instead of escaping
93
- strip_comments: whether to strip HTML comments
94
- css_sanitizer: instance with sanitize_css method for style attributes
95
96
Returns:
97
Cleaned text as unicode string
98
"""
99
```
100
101
### URL Linkification
102
103
Converts URL-like strings in HTML fragments to clickable links while preserving existing links and structure.
104
105
```python { .api }
106
def linkify(
107
text: str,
108
callbacks: list = DEFAULT_CALLBACKS,
109
skip_tags: set = None,
110
parse_email: bool = False
111
) -> str:
112
"""
113
Convert URL-like strings in an HTML fragment to links.
114
115
Parameters:
116
- text: the text to linkify
117
- callbacks: list of callbacks to run when adjusting tag attributes
118
- skip_tags: set of tags to skip linkifying contents of
119
- parse_email: whether to linkify email addresses
120
121
Returns:
122
Linkified text as unicode string
123
"""
124
```
125
126
### Advanced HTML Cleaning
127
128
Configurable HTML cleaner for repeated use with consistent settings.
129
130
```python { .api }
131
class Cleaner:
132
"""
133
Cleaner for cleaning HTML fragments of malicious content.
134
Not thread-safe - create separate instances per thread.
135
"""
136
137
def __init__(
138
self,
139
tags: frozenset = ALLOWED_TAGS,
140
attributes: dict = ALLOWED_ATTRIBUTES,
141
protocols: frozenset = ALLOWED_PROTOCOLS,
142
strip: bool = False,
143
strip_comments: bool = True,
144
filters: list = None,
145
css_sanitizer: CSSSanitizer = None
146
):
147
"""
148
Initialize a Cleaner instance.
149
150
Parameters:
151
- tags: set of allowed tags
152
- attributes: allowed attributes configuration
153
- protocols: allowed protocols for links
154
- strip: whether to strip disallowed elements
155
- strip_comments: whether to strip HTML comments
156
- filters: list of additional html5lib filters
157
- css_sanitizer: CSS sanitizer instance
158
"""
159
160
def clean(self, text: str) -> str:
161
"""
162
Clean the specified HTML text.
163
164
Parameters:
165
- text: HTML text to clean
166
167
Returns:
168
Cleaned HTML text
169
"""
170
```
171
172
### Advanced URL Linkification
173
174
Configurable URL linkifier for repeated use with consistent settings.
175
176
```python { .api }
177
class Linker:
178
"""
179
Convert URL-like strings in HTML fragments to links with configuration.
180
"""
181
182
def __init__(
183
self,
184
callbacks: list = DEFAULT_CALLBACKS,
185
skip_tags: set = None,
186
parse_email: bool = False,
187
url_re = URL_RE,
188
email_re = EMAIL_RE,
189
recognized_tags = html5lib_shim.HTML_TAGS
190
):
191
"""
192
Create a Linker instance.
193
194
Parameters:
195
- callbacks: list of callbacks for adjusting tag attributes
196
- skip_tags: set of tags to skip linkifying contents of
197
- parse_email: whether to linkify email addresses
198
- url_re: custom URL matching regex
199
- email_re: custom email matching regex
200
- recognized_tags: set of recognized HTML tags
201
"""
202
203
def linkify(self, text: str) -> str:
204
"""
205
Linkify the specified text.
206
207
Parameters:
208
- text: text to linkify
209
210
Returns:
211
Linkified text
212
213
Raises:
214
TypeError: if text is not a string type
215
"""
216
```
217
218
### Advanced Linkification Filter
219
220
HTML filter for linkifying during html5lib parsing, commonly used with Cleaner filters.
221
222
```python { .api }
223
class LinkifyFilter(html5lib_shim.Filter):
224
"""
225
HTML filter that linkifies text during html5lib parsing.
226
Can be used with Cleaner filters for combined cleaning and linkification.
227
"""
228
229
def __init__(
230
self,
231
source,
232
callbacks: list = DEFAULT_CALLBACKS,
233
skip_tags: set = None,
234
parse_email: bool = False,
235
url_re = URL_RE,
236
email_re = EMAIL_RE
237
):
238
"""
239
Create a LinkifyFilter instance.
240
241
Parameters:
242
- source: html5lib TreeWalker stream
243
- callbacks: list of callbacks for adjusting tag attributes
244
- skip_tags: set of tags to skip linkifying contents of
245
- parse_email: whether to linkify email addresses
246
- url_re: custom URL matching regex
247
- email_re: custom email matching regex
248
"""
249
```
250
251
### HTML Sanitization Filter
252
253
HTML filter for sanitizing content during html5lib parsing, commonly used with other filters.
254
255
```python { .api }
256
class BleachSanitizerFilter(html5lib_shim.SanitizerFilter):
257
"""
258
HTML filter that sanitizes HTML during html5lib parsing.
259
Can be used with other html5lib filters for custom processing.
260
"""
261
262
def __init__(
263
self,
264
source,
265
allowed_tags: frozenset = ALLOWED_TAGS,
266
attributes = ALLOWED_ATTRIBUTES,
267
allowed_protocols: frozenset = ALLOWED_PROTOCOLS,
268
attr_val_is_uri = html5lib_shim.attr_val_is_uri,
269
svg_attr_val_allows_ref = html5lib_shim.svg_attr_val_allows_ref,
270
svg_allow_local_href = html5lib_shim.svg_allow_local_href,
271
strip_disallowed_tags: bool = False,
272
strip_html_comments: bool = True,
273
css_sanitizer: CSSSanitizer = None
274
):
275
"""
276
Create a BleachSanitizerFilter instance.
277
278
Parameters:
279
- source: html5lib TreeWalker stream
280
- allowed_tags: set of allowed tags
281
- attributes: allowed attributes configuration
282
- allowed_protocols: allowed protocols for links
283
- attr_val_is_uri: set of attributes that have URI values
284
- svg_attr_val_allows_ref: set of SVG attributes that can have references
285
- svg_allow_local_href: set of SVG elements that can have local hrefs
286
- strip_disallowed_tags: whether to strip disallowed tags
287
- strip_html_comments: whether to strip HTML comments
288
- css_sanitizer: CSS sanitizer instance
289
"""
290
```
291
292
### CSS Sanitization
293
294
Sanitizes CSS declarations in style attributes and style elements.
295
296
```python { .api }
297
class CSSSanitizer:
298
"""
299
CSS sanitizer for cleaning style attributes and style text.
300
"""
301
302
def __init__(
303
self,
304
allowed_css_properties: frozenset = ALLOWED_CSS_PROPERTIES,
305
allowed_svg_properties: frozenset = ALLOWED_SVG_PROPERTIES
306
):
307
"""
308
Initialize CSS sanitizer.
309
310
Parameters:
311
- allowed_css_properties: set of allowed CSS properties
312
- allowed_svg_properties: set of allowed SVG properties
313
"""
314
315
def sanitize_css(self, style: str) -> str:
316
"""
317
Sanitize CSS declarations.
318
319
Parameters:
320
- style: CSS declarations string
321
322
Returns:
323
Sanitized CSS string
324
"""
325
```
326
327
### Linkification Callbacks
328
329
Callback functions for customizing link attributes during linkification.
330
331
```python { .api }
332
def nofollow(attrs: dict, new: bool = False) -> dict:
333
"""
334
Add rel="nofollow" to links (except mailto links).
335
336
Parameters:
337
- attrs: link attributes dictionary
338
- new: whether this is a new link
339
340
Returns:
341
Modified attributes dictionary
342
"""
343
344
def target_blank(attrs: dict, new: bool = False) -> dict:
345
"""
346
Add target="_blank" to links (except mailto links).
347
348
Parameters:
349
- attrs: link attributes dictionary
350
- new: whether this is a new link
351
352
Returns:
353
Modified attributes dictionary
354
"""
355
```
356
357
### Attribute Filter Factory
358
359
Utility function for creating attribute filter functions from various attribute configurations.
360
361
```python { .api }
362
def attribute_filter_factory(attributes) -> callable:
363
"""
364
Generate attribute filter function for the given attributes configuration.
365
366
The attributes value can be a callable, dict, or list. This returns a filter
367
function appropriate to the attributes value.
368
369
Parameters:
370
- attributes: attribute configuration (callable, dict, or list)
371
372
Returns:
373
Filter function that takes (tag, attr, value) and returns bool
374
375
Raises:
376
ValueError: if attributes is not a callable, list, or dict
377
"""
378
```
379
380
### URL and Email Pattern Building
381
382
Functions for creating custom URL and email matching patterns.
383
384
```python { .api }
385
def build_url_re(
386
tlds: list = TLDS,
387
protocols = html5lib_shim.allowed_protocols
388
) -> re.Pattern:
389
"""
390
Build URL regex with custom TLDs and protocols.
391
392
Parameters:
393
- tlds: list of top-level domains
394
- protocols: set of allowed protocols
395
396
Returns:
397
Compiled regex pattern for URL matching
398
"""
399
400
def build_email_re(tlds: list = TLDS) -> re.Pattern:
401
"""
402
Build email regex with custom TLDs.
403
404
Parameters:
405
- tlds: list of top-level domains
406
407
Returns:
408
Compiled regex pattern for email matching
409
"""
410
```
411
412
## Constants
413
414
### Default Sanitization Settings
415
416
```python { .api }
417
# Default allowed HTML tags
418
ALLOWED_TAGS: frozenset = frozenset((
419
"a", "abbr", "acronym", "b", "blockquote", "code",
420
"em", "i", "li", "ol", "strong", "ul"
421
))
422
423
# Default allowed attributes by tag
424
ALLOWED_ATTRIBUTES: dict = {
425
"a": ["href", "title"],
426
"abbr": ["title"],
427
"acronym": ["title"]
428
}
429
430
# Default allowed protocols for links
431
ALLOWED_PROTOCOLS: frozenset = frozenset(("http", "https", "mailto"))
432
433
# Invisible character handling (requires: from itertools import chain)
434
INVISIBLE_CHARACTERS: str = "".join([chr(c) for c in chain(range(0, 9), range(11, 13), range(14, 32))])
435
INVISIBLE_CHARACTERS_RE: re.Pattern = re.compile("[" + INVISIBLE_CHARACTERS + "]", re.UNICODE)
436
INVISIBLE_REPLACEMENT_CHAR: str = "?"
437
```
438
439
### Default Linkification Settings
440
441
```python { .api }
442
# Default linkification callbacks
443
DEFAULT_CALLBACKS: list = [nofollow]
444
445
# Top-level domains for URL detection
446
TLDS: list = [
447
"ac", "ad", "ae", "aero", "af", "ag", "ai", "al", "am", "an", "ao", "aq", "ar", "arpa", "as", "asia", "at", "au", "aw", "ax", "az",
448
"ba", "bb", "bd", "be", "bf", "bg", "bh", "bi", "biz", "bj", "bm", "bn", "bo", "br", "bs", "bt", "bv", "bw", "by", "bz",
449
"ca", "cat", "cc", "cd", "cf", "cg", "ch", "ci", "ck", "cl", "cm", "cn", "co", "com", "coop", "cr", "cu", "cv", "cx", "cy", "cz",
450
"de", "dj", "dk", "dm", "do", "dz", "ec", "edu", "ee", "eg", "er", "es", "et", "eu", "fi", "fj", "fk", "fm", "fo", "fr",
451
"ga", "gb", "gd", "ge", "gf", "gg", "gh", "gi", "gl", "gm", "gn", "gov", "gp", "gq", "gr", "gs", "gt", "gu", "gw", "gy",
452
"hk", "hm", "hn", "hr", "ht", "hu", "id", "ie", "il", "im", "in", "info", "int", "io", "iq", "ir", "is", "it",
453
"je", "jm", "jo", "jobs", "jp", "ke", "kg", "kh", "ki", "km", "kn", "kp", "kr", "kw", "ky", "kz",
454
"la", "lb", "lc", "li", "lk", "lr", "ls", "lt", "lu", "lv", "ly", "ma", "mc", "md", "me", "mg", "mh", "mil", "mk", "ml", "mm", "mn", "mo", "mobi", "mp", "mq", "mr", "ms", "mt", "mu", "museum", "mv", "mw", "mx", "my", "mz",
455
"na", "name", "nc", "ne", "net", "nf", "ng", "ni", "nl", "no", "np", "nr", "nu", "nz", "om", "org",
456
"pa", "pe", "pf", "pg", "ph", "pk", "pl", "pm", "pn", "post", "pr", "pro", "ps", "pt", "pw", "py",
457
"qa", "re", "ro", "rs", "ru", "rw", "sa", "sb", "sc", "sd", "se", "sg", "sh", "si", "sj", "sk", "sl", "sm", "sn", "so", "sr", "ss", "st", "su", "sv", "sx", "sy", "sz",
458
"tc", "td", "tel", "tf", "tg", "th", "tj", "tk", "tl", "tm", "tn", "to", "tp", "tr", "travel", "tt", "tv", "tw", "tz",
459
"ua", "ug", "uk", "us", "uy", "uz", "va", "vc", "ve", "vg", "vi", "vn", "vu", "wf", "ws", "xn", "xxx", "ye", "yt", "yu", "za", "zm", "zw"
460
]
461
462
# Default URL matching regex
463
URL_RE: re.Pattern = build_url_re()
464
465
# Default email matching regex
466
EMAIL_RE: re.Pattern = build_email_re()
467
468
# Protocol matching regex for URL detection
469
PROTO_RE: re.Pattern = re.compile(r"^[\w-]+:/{0,3}", re.IGNORECASE)
470
```
471
472
### CSS Sanitization Settings
473
474
```python { .api }
475
# Allowed CSS properties
476
ALLOWED_CSS_PROPERTIES: frozenset = frozenset((
477
"azimuth", "background-color", "border-bottom-color", "border-collapse",
478
"border-color", "border-left-color", "border-right-color", "border-top-color",
479
"clear", "color", "cursor", "direction", "display", "elevation", "float",
480
"font", "font-family", "font-size", "font-style", "font-variant", "font-weight",
481
"height", "letter-spacing", "line-height", "overflow", "pause", "pause-after",
482
"pause-before", "pitch", "pitch-range", "richness", "speak", "speak-header",
483
"speak-numeral", "speak-punctuation", "speech-rate", "stress", "text-align",
484
"text-decoration", "text-indent", "unicode-bidi", "vertical-align",
485
"voice-family", "volume", "white-space", "width"
486
))
487
488
# Allowed SVG properties
489
ALLOWED_SVG_PROPERTIES: frozenset = frozenset((
490
"fill", "fill-opacity", "fill-rule", "stroke", "stroke-width",
491
"stroke-linecap", "stroke-linejoin", "stroke-opacity"
492
))
493
```
494
495
### Package Version Information
496
497
```python { .api }
498
# Package version string
499
__version__: str = "6.2.0"
500
501
# Release date in YYYYMMDD format
502
__releasedate__: str = "20241029"
503
```
504
505
## Warning Classes
506
507
```python { .api }
508
class NoCssSanitizerWarning(UserWarning):
509
"""
510
Warning raised when CSS sanitization is needed but no CSS sanitizer is configured.
511
"""
512
```
513
514
## Usage Examples
515
516
### Custom Sanitization Rules
517
518
```python
519
import bleach
520
from bleach.sanitizer import Cleaner
521
522
# Custom allowed tags and attributes
523
custom_tags = ['p', 'strong', 'em', 'a', 'img']
524
custom_attributes = {
525
'a': ['href', 'title'],
526
'img': ['src', 'alt', 'width', 'height']
527
}
528
529
# Create reusable cleaner
530
cleaner = Cleaner(
531
tags=custom_tags,
532
attributes=custom_attributes,
533
strip=True # Remove disallowed tags entirely
534
)
535
536
# Clean multiple texts with same rules
537
safe_text1 = cleaner.clean(untrusted_html1)
538
safe_text2 = cleaner.clean(untrusted_html2)
539
```
540
541
### CSS Sanitization
542
543
```python
544
import bleach
545
from bleach.css_sanitizer import CSSSanitizer
546
547
# Create CSS sanitizer
548
css_sanitizer = CSSSanitizer(
549
allowed_css_properties=bleach.css_sanitizer.ALLOWED_CSS_PROPERTIES
550
)
551
552
# Clean HTML with CSS sanitization
553
html_with_styles = '<p style="color: red; background: javascript:alert();">Text</p>'
554
safe_html = bleach.clean(
555
html_with_styles,
556
tags=['p'],
557
attributes={'p': ['style']},
558
css_sanitizer=css_sanitizer
559
)
560
# Result: '<p style="color: red;">Text</p>'
561
```
562
563
### Custom Linkification
564
565
```python
566
import bleach
567
from bleach.linkifier import Linker
568
from bleach.callbacks import target_blank, nofollow
569
570
# Custom linkifier with multiple callbacks
571
linker = Linker(
572
callbacks=[nofollow, target_blank],
573
skip_tags={'pre', 'code'}, # Don't linkify in code blocks
574
parse_email=True
575
)
576
577
text = 'Email me at user@example.com or visit https://example.org'
578
linked = linker.linkify(text)
579
# Result includes both rel="nofollow" and target="_blank"
580
```
581
582
### Combined Operations
583
584
```python
585
import bleach
586
from bleach.sanitizer import Cleaner
587
from bleach.linkifier import Linker, LinkifyFilter
588
589
# Clean and linkify in single pass using LinkifyFilter
590
cleaner = Cleaner(
591
tags=['p', 'a', 'strong'],
592
attributes={'a': ['href', 'rel', 'target']},
593
filters=[LinkifyFilter()] # Linkify during cleaning
594
)
595
596
unsafe_text = '<script>alert("xss")</script><p>Visit https://example.com</p>'
597
result = cleaner.clean(unsafe_text)
598
```