Python Bidi layout wrapping the Rust crate unicode-bidi
npx @tessl/cli install tessl/pypi-python-bidi@0.6.00
# Python Bidi
1
2
Python BiDi provides bi-directional (BiDi) text layout support for Python applications, enabling correct display of mixed left-to-right and right-to-left text (such as Arabic, Hebrew mixed with English). The library offers two implementations: a high-performance Rust-based implementation (default) and a pure Python implementation for compatibility.
3
4
## Architecture
5
6
Python-bidi uses a dual-implementation approach to provide both performance and compatibility:
7
8
- **Rust Implementation (Default)**: High-performance implementation using the `unicode-bidi` Rust crate, compiled as a Python extension module (`.bidi`). Implements a more recent version of the Unicode BiDi algorithm.
9
- **Pure Python Implementation**: Compatible fallback implementation in pure Python, implementing Unicode BiDi algorithm version 5. Provides additional debugging features and internal API access.
10
- **Unified API**: Both implementations expose the same primary functions (`get_display`, `get_base_level`) with identical behavior for standard use cases.
11
- **Automatic Selection**: The default import (`from bidi import`) uses the Rust implementation, while the Python implementation is explicitly accessible via `from bidi.algorithm import`.
12
13
## Package Information
14
15
- **Package Name**: python-bidi
16
- **Language**: Python
17
- **Installation**: `pip install python-bidi`
18
19
## Core Imports
20
21
Main API (Rust-based implementation):
22
23
```python
24
from bidi import get_display, get_base_level
25
```
26
27
Pure Python implementation:
28
29
```python
30
from bidi.algorithm import get_display, get_base_level
31
```
32
33
## Basic Usage
34
35
```python
36
from bidi import get_display
37
38
# Hebrew text example
39
hebrew_text = "שלום"
40
display_text = get_display(hebrew_text)
41
print(display_text) # Outputs correctly ordered text for display
42
43
# Mixed text with numbers
44
mixed_text = "1 2 3 ניסיון"
45
display_text = get_display(mixed_text)
46
print(display_text) # "ןויסינ 3 2 1"
47
48
# Working with bytes and encoding
49
hebrew_bytes = "שלם".encode('utf-8')
50
display_bytes = get_display(hebrew_bytes, encoding='utf-8')
51
print(display_bytes.decode('utf-8'))
52
53
# Override base direction
54
text = "hello world"
55
rtl_display = get_display(text, base_dir='R')
56
print(rtl_display)
57
58
# Debug mode to see algorithm steps
59
debug_output = get_display("hello שלום", debug=True)
60
# Outputs algorithm steps to stderr
61
```
62
63
## Capabilities
64
65
### Text Layout Processing
66
67
Converts logical text order to visual display order according to the Unicode BiDi algorithm.
68
69
```python { .api }
70
def get_display(
71
str_or_bytes: StrOrBytes,
72
encoding: str = "utf-8",
73
base_dir: Optional[str] = None,
74
debug: bool = False
75
) -> StrOrBytes:
76
"""
77
Convert text from logical order to visual display order.
78
79
Args:
80
str_or_bytes: Input text as string or bytes
81
encoding: Encoding to use if input is bytes (default: "utf-8")
82
base_dir: Override base direction ('L' for LTR, 'R' for RTL)
83
debug: Enable debug output to stderr (default: False)
84
85
Returns:
86
Processed text in same type as input (str or bytes)
87
"""
88
```
89
90
### Base Direction Detection
91
92
Determines the base paragraph direction of text.
93
94
```python { .api }
95
def get_base_level(text: str) -> int:
96
"""
97
Get the base embedding level of the first paragraph in text.
98
99
Args:
100
text: Input text string
101
102
Returns:
103
Base level (0 for LTR, 1 for RTL)
104
"""
105
```
106
107
### Pure Python Implementation
108
109
For compatibility or when Rust implementation is not available, use the pure Python implementation.
110
111
```python { .api }
112
# From bidi.algorithm module
113
def get_display(
114
str_or_bytes: StrOrBytes,
115
encoding: str = "utf-8",
116
upper_is_rtl: bool = False,
117
base_dir: Optional[str] = None,
118
debug: bool = False
119
) -> StrOrBytes:
120
"""
121
Pure Python implementation of BiDi text layout.
122
123
Args:
124
str_or_bytes: Input text as string or bytes
125
encoding: Encoding to use if input is bytes (default: "utf-8")
126
upper_is_rtl: Treat uppercase chars as strong RTL for debugging (default: False)
127
base_dir: Override base direction ('L' for LTR, 'R' for RTL)
128
debug: Enable debug output to stderr (default: False)
129
130
Returns:
131
Processed text in same type as input (str or bytes)
132
"""
133
134
def get_base_level(text, upper_is_rtl: bool = False) -> int:
135
"""
136
Get base embedding level using Python implementation.
137
138
Args:
139
text: Input text string
140
upper_is_rtl: Treat uppercase chars as strong RTL for debugging (default: False)
141
142
Returns:
143
Base level (0 for LTR, 1 for RTL)
144
"""
145
```
146
147
### Internal Algorithm Functions
148
149
For advanced usage, the Python implementation exposes internal algorithm functions.
150
151
```python { .api }
152
def get_empty_storage() -> dict:
153
"""
154
Return empty storage skeleton for testing and advanced usage.
155
156
Returns:
157
Dictionary with keys: base_level, base_dir, chars, runs
158
"""
159
160
def get_embedding_levels(text, storage, upper_is_rtl: bool = False, debug: bool = False):
161
"""
162
Get paragraph embedding levels and populate storage with character data.
163
164
Args:
165
text: Input text string
166
storage: Storage dictionary from get_empty_storage()
167
upper_is_rtl: Treat uppercase chars as strong RTL (default: False)
168
debug: Enable debug output (default: False)
169
"""
170
171
def debug_storage(storage, base_info: bool = False, chars: bool = True, runs: bool = False):
172
"""
173
Display debug information for storage object.
174
175
Args:
176
storage: Storage dictionary
177
base_info: Show base level and direction info (default: False)
178
chars: Show character data (default: True)
179
runs: Show level runs (default: False)
180
"""
181
```
182
183
### Mirror Character Mappings
184
185
Access to Unicode character mirroring data.
186
187
```python { .api }
188
from bidi.mirror import MIRRORED
189
190
# MIRRORED is a dictionary mapping characters to their mirrored versions
191
# Example: MIRRORED['('] == ')'
192
```
193
194
### Command Line Interface
195
196
Use `pybidi` command for text processing from the command line.
197
198
```bash
199
# Basic usage
200
pybidi "your text here"
201
202
# Read from stdin
203
echo "your text here" | pybidi
204
205
# Use Rust implementation (default is Python)
206
pybidi -r "your text here"
207
208
# Override base direction
209
pybidi -b R "your text here"
210
211
# Enable debug output
212
pybidi -d "your text here"
213
214
# Specify encoding
215
pybidi -e utf-8 "your text here"
216
217
# For Python implementation, treat uppercase as RTL (debugging)
218
pybidi -u "Your Text HERE"
219
```
220
221
## Version Information
222
223
Access version information for the package:
224
225
```python { .api }
226
from bidi import VERSION, VERSION_TUPLE
227
228
# VERSION is a string like "0.6.0"
229
# VERSION_TUPLE is a tuple like (0, 6, 0)
230
```
231
232
## Main Function API
233
234
The package provides a main function for command-line usage:
235
236
```python { .api }
237
from bidi import main
238
239
def main():
240
"""
241
Command-line interface function for pybidi.
242
243
Processes command line arguments and applies BiDi algorithm to input text.
244
Used by the pybidi console script. Reads from arguments or stdin,
245
supports all CLI options (encoding, base direction, debug, etc.).
246
247
Returns:
248
None (outputs processed text to stdout)
249
"""
250
```
251
252
## Types
253
254
```python { .api }
255
from typing import Union, Optional, List, Dict, Any
256
from collections import deque
257
258
# Type aliases used in the API
259
StrOrBytes = Union[str, bytes]
260
261
# Storage structure (Python implementation)
262
Storage = Dict[str, Any] # Contains:
263
# {
264
# "base_level": int, # Base embedding level (0 for LTR, 1 for RTL)
265
# "base_dir": str, # Base direction ('L' or 'R')
266
# "chars": List[Dict], # Character data with level, type, original type
267
# "runs": deque # Level runs for processing
268
# }
269
270
# Character object structure (within Storage["chars"])
271
Character = Dict[str, Union[str, int]] # Contains:
272
# {
273
# "ch": str, # The character
274
# "level": int, # Embedding level
275
# "type": str, # BiDi character type
276
# "orig": str # Original BiDi character type
277
# }
278
```
279
280
## Implementation Differences
281
282
### Rust Implementation (Default)
283
- Higher performance
284
- Implements more recent Unicode BiDi algorithm
285
- Access via `from bidi import get_display, get_base_level` (uses compiled `.bidi` module)
286
- Does NOT support `upper_is_rtl` parameter
287
- Debug output: Formatted debug representation of internal BidiInfo structure
288
- Limited to main API functions
289
290
### Python Implementation
291
- Pure Python compatibility
292
- Implements Unicode BiDi algorithm v5
293
- Access via `from bidi.algorithm import get_display, get_base_level`
294
- Supports `upper_is_rtl` parameter for debugging
295
- Exposes internal algorithm functions for advanced usage
296
- Debug output: Detailed step-by-step algorithm information to stderr
297
- Suitable for educational purposes or when Rust implementation unavailable
298
299
## Error Handling
300
301
Both implementations handle common error cases gracefully:
302
303
### Common Error Conditions:
304
- **Invalid encodings**: Raise standard Python `UnicodeDecodeError` or `UnicodeEncodeError`
305
- **Empty or None text inputs**: Handled safely, return empty string or raise `ValueError`
306
- **Invalid `base_dir` values**: Rust implementation raises `ValueError` for values other than 'L', 'R', or None
307
- **Malformed Unicode text**: Processed according to Unicode BiDi algorithm specifications
308
309
### Rust Implementation Specific:
310
- **Empty paragraphs**: `get_base_level_inner()` raises `ValueError` for text with no paragraphs
311
- **Invalid base_dir**: Raises `ValueError` with message "base_dir can be 'L', 'R' or None"
312
313
### Python Implementation Specific:
314
- **Assertion errors**: Internal algorithm functions may raise `AssertionError` for invalid character types
315
- **Debug mode**: Outputs debugging information to `sys.stderr`, does not raise exceptions
316
317
### Encoding Support:
318
Supports any encoding that Python's `str.encode()` and `bytes.decode()` support, including:
319
- UTF-8 (default)
320
- UTF-16, UTF-32
321
- ASCII, Latin-1
322
- Windows code pages (cp1252, cp1255 for Hebrew)
323
- ISO encodings (iso-8859-1, iso-8859-8 for Hebrew)
324
325
## Usage Examples
326
327
### Processing Mixed Language Text
328
329
```python
330
from bidi import get_display
331
332
# English with Hebrew
333
text = "Hello שלום World"
334
display = get_display(text)
335
print(display) # Correctly ordered for display
336
337
# Numbers with RTL text
338
text = "הספר עולה 25 שקל"
339
display = get_display(text)
340
print(display) # Numbers maintain LTR order within RTL text
341
```
342
343
### Working with Different Encodings
344
345
```python
346
from bidi import get_display
347
348
# Hebrew text in different encoding
349
hebrew_cp1255 = "שלום".encode('cp1255')
350
display = get_display(hebrew_cp1255, encoding='cp1255')
351
print(display.decode('cp1255'))
352
```
353
354
### Debugging Text Processing
355
356
```python
357
from bidi.algorithm import get_display, debug_storage, get_empty_storage, get_embedding_levels
358
359
# Enable debug output
360
text = "Hello שלום"
361
display = get_display(text, debug=True)
362
# Outputs detailed algorithm steps to stderr
363
364
# Manual debugging with storage
365
storage = get_empty_storage()
366
get_embedding_levels(text, storage)
367
debug_storage(storage, base_info=True, chars=True, runs=True)
368
```