Cython hash table that trusts the keys are pre-hashed
npx @tessl/cli install tessl/pypi-preshed@4.0.00
# Preshed
1
2
A high-performance Cython-based hash table library for Python that assumes keys come pre-hashed. Inspired by Jeff Preshing's hash table design, preshed provides efficient data structures optimized for use cases where keys are already randomized, such as natural language processing pipelines, caching systems, and other performance-critical applications.
3
4
## Package Information
5
6
- **Package Name**: preshed
7
- **Language**: Python (with Cython extensions for performance)
8
- **Dependencies**: `cymem>=2.0.2,<2.1.0`, `murmurhash>=0.28.0,<1.1.0`
9
- **Installation**: `pip install preshed`
10
11
## Core Imports
12
13
```python
14
from preshed.maps import PreshMap
15
from preshed.counter import PreshCounter
16
from preshed.bloom import BloomFilter, calculate_size_and_hash_count
17
```
18
19
Package metadata:
20
21
```python
22
from preshed.about import __version__, __author__, __license__
23
print(__version__) # "4.0.0"
24
print(__author__) # "Explosion"
25
print(__license__) # "MIT"
26
27
# Or access via main module
28
import preshed
29
print(preshed.__version__) # "4.0.0"
30
```
31
32
## Basic Usage
33
34
```python
35
from preshed.maps import PreshMap
36
from preshed.counter import PreshCounter
37
from preshed.bloom import BloomFilter
38
39
# Hash map for pre-hashed keys
40
hash_map = PreshMap()
41
hash_map[12345] = 67890
42
print(hash_map[12345]) # 67890
43
44
# Counter with frequency tracking
45
counter = PreshCounter()
46
counter.inc(42, 5) # Increment key 42 by 5
47
print(counter[42]) # 5
48
print(counter.total) # 5
49
50
# Bloom filter for membership testing
51
bloom = BloomFilter(size=1024, hash_funcs=3)
52
bloom.add(12345)
53
print(12345 in bloom) # True
54
print(54321 in bloom) # False (probably)
55
```
56
57
## Architecture
58
59
Preshed provides three main high-performance data structures:
60
61
- **PreshMap**: Hash table with open addressing and linear probing, optimized for pre-hashed 64-bit keys
62
- **PreshCounter**: Frequency counter with optional Good-Turing smoothing for probability estimation
63
- **BloomFilter**: Space-efficient probabilistic membership testing with configurable error rates
64
65
All data structures are implemented in Cython for maximum performance and designed to work with pre-hashed keys to minimize hash computation overhead. The library compiles to native C extensions, providing near-C performance while maintaining a Python interface.
66
67
## Capabilities
68
69
### Hash Maps
70
71
High-performance hash table that maps pre-hashed uint64_t keys to uint64_t values using open addressing with linear probing. Provides dictionary-like interface with O(1) average-case performance.
72
73
```python { .api }
74
class PreshMap:
75
def __init__(self, initial_size: int = 8): ...
76
def __getitem__(self, key: int) -> int | None: ... # Returns None if key not found
77
def __setitem__(self, key: int, value: int) -> None: ...
78
def __delitem__(self, key: int) -> None: ... # Raises KeyError if key not found
79
def __len__(self) -> int: ...
80
def __contains__(self, key: int) -> bool: ...
81
def pop(self, key: int, default=None) -> int | None: ...
82
def items(self) -> Iterator[tuple[int, int]]: ...
83
def keys(self) -> Iterator[int]: ...
84
def values(self) -> Iterator[int]: ...
85
@property
86
def capacity(self) -> int: ...
87
```
88
89
[Hash Maps](./hash-maps.md)
90
91
### Frequency Counters
92
93
Efficient frequency counting for uint64-valued keys with optional Good-Turing smoothing for probability estimation. Supports statistical operations and probability calculations.
94
95
```python { .api }
96
class PreshCounter:
97
def __init__(self, initial_size: int = 8): ...
98
def __getitem__(self, key: int) -> int: ...
99
def __len__(self) -> int: ...
100
def __iter__(self) -> Iterator[tuple[int, int]]: ... # Yields (key, count) pairs
101
def inc(self, key: int, inc: int) -> int: ...
102
def prob(self, key: int) -> float: ...
103
def smooth(self) -> None: ...
104
@property
105
def total(self) -> int: ...
106
@property
107
def length(self) -> int: ...
108
```
109
110
[Frequency Counters](./frequency-counters.md)
111
112
### Bloom Filters
113
114
Space-efficient probabilistic data structure for membership testing. Provides configurable error rates and serialization capabilities with no false negatives but possible false positives.
115
116
```python { .api }
117
class BloomFilter:
118
def __init__(self, size: int = 1024, hash_funcs: int = 23, seed: int = 0): ...
119
@classmethod
120
def from_error_rate(cls, members: int, error_rate: float = 1e-4): ...
121
def add(self, item: int) -> None: ...
122
def __contains__(self, item: int) -> bool: ...
123
def to_bytes(self) -> bytes: ...
124
def from_bytes(self, byte_string: bytes): ...
125
126
def calculate_size_and_hash_count(members: int, error_rate: float) -> tuple[int, int]: ...
127
```
128
129
[Bloom Filters](./bloom-filters.md)
130
131
## Types
132
133
```python { .api }
134
# All keys are expected to be 64-bit unsigned integers (pre-hashed)
135
Key = int # uint64_t
136
Value = int # uint64_t for maps, int64_t for counters
137
Count = int # int64_t
138
```