0
# JSON Canonicalization
1
2
RFC 8785 compliant JSON canonicalization for consistent JSON serialization, hashing, and digital signatures. The c14n module provides deterministic JSON string representation.
3
4
## Capabilities
5
6
### JSON Canonicalization
7
8
Produces canonical JSON string representation according to RFC 8785 standards.
9
10
```python { .api }
11
def canonicalize(obj, utf8=True):
12
"""
13
Canonicalizes a JSON object according to RFC 8785.
14
15
Produces a deterministic string representation of JSON data by:
16
- Sorting object keys lexicographically
17
- Using minimal whitespace (no extra spaces)
18
- Consistent number formatting
19
- Proper Unicode escape sequences
20
21
Args:
22
obj: The JSON-serializable object to canonicalize (dict, list, str,
23
int, float, bool, None)
24
utf8 (bool): If True, return bytes encoded as UTF-8; if False,
25
return Unicode string (default: True)
26
27
Returns:
28
bytes or str: Canonical JSON representation (bytes if utf8=True,
29
str if utf8=False)
30
31
Raises:
32
TypeError: If obj contains non-JSON-serializable types
33
ValueError: If obj contains circular references
34
"""
35
```
36
37
#### Examples
38
39
```python
40
from c14n import canonicalize
41
import json
42
43
# Basic canonicalization
44
data = {"name": "Alice", "age": 30, "city": "New York"}
45
canonical_bytes = canonicalize(data)
46
print(canonical_bytes) # b'{"age":30,"city":"New York","name":"Alice"}'
47
48
# Get string instead of bytes
49
canonical_str = canonicalize(data, utf8=False)
50
print(canonical_str) # '{"age":30,"city":"New York","name":"Alice"}'
51
52
# Complex nested structure
53
complex_data = {
54
"users": [
55
{"id": 2, "name": "Bob"},
56
{"id": 1, "name": "Alice"}
57
],
58
"metadata": {
59
"version": "1.0",
60
"created": "2023-01-01"
61
}
62
}
63
64
canonical = canonicalize(complex_data, utf8=False)
65
print(canonical)
66
# Output: {"metadata":{"created":"2023-01-01","version":"1.0"},"users":[{"id":2,"name":"Bob"},{"id":1,"name":"Alice"}]}
67
```
68
69
### JSON Serialization
70
71
Alternative serialization function without key sorting (non-canonical).
72
73
```python { .api }
74
def serialize(obj, utf8=True):
75
"""
76
Serializes JSON object without canonicalization (preserves key order).
77
78
Args:
79
obj: The JSON-serializable object to serialize
80
utf8 (bool): If True, return bytes encoded as UTF-8; if False,
81
return Unicode string (default: True)
82
83
Returns:
84
bytes or str: JSON representation without key reordering
85
86
Raises:
87
TypeError: If obj contains non-JSON-serializable types
88
ValueError: If obj contains circular references
89
"""
90
```
91
92
#### Example
93
94
```python
95
from c14n import serialize
96
97
data = {"name": "Alice", "age": 30, "city": "New York"}
98
99
# Serialize preserving original key order
100
serialized = serialize(data, utf8=False)
101
print(serialized) # '{"name":"Alice","age":30,"city":"New York"}'
102
103
# Compare with canonicalization (keys sorted)
104
canonical = canonicalize(data, utf8=False)
105
print(canonical) # '{"age":30,"city":"New York","name":"Alice"}'
106
```
107
108
## Canonicalization Rules
109
110
### Key Ordering
111
112
Object keys are sorted lexicographically using Unicode code points:
113
114
```python
115
data = {
116
"zebra": 1,
117
"apple": 2,
118
"banana": 3,
119
"Apple": 4 # Capital A comes before lowercase a
120
}
121
122
canonical = canonicalize(data, utf8=False)
123
# Result: {"Apple":4,"apple":2,"banana":3,"zebra":1}
124
```
125
126
### Number Formatting
127
128
Numbers are formatted in their minimal representation:
129
130
```python
131
numbers = {
132
"integer": 42,
133
"float": 3.14159,
134
"zero": 0,
135
"negative": -123,
136
"scientific": 1.23e-4
137
}
138
139
canonical = canonicalize(numbers, utf8=False)
140
# Numbers formatted without unnecessary precision or notation
141
```
142
143
### String Handling
144
145
Strings are properly escaped with minimal escape sequences:
146
147
```python
148
strings = {
149
"quote": 'He said "Hello"',
150
"newline": "Line 1\nLine 2",
151
"unicode": "café",
152
"control": "tab\there"
153
}
154
155
canonical = canonicalize(strings, utf8=False)
156
# Proper JSON string escaping applied
157
```
158
159
### Array Preservation
160
161
Array element order is preserved (not sorted):
162
163
```python
164
data = {
165
"numbers": [3, 1, 4, 1, 5],
166
"mixed": ["zebra", "apple", "banana"]
167
}
168
169
canonical = canonicalize(data, utf8=False)
170
# Array order maintained: {"mixed":["zebra","apple","banana"],"numbers":[3,1,4,1,5]}
171
```
172
173
## Use Cases
174
175
### Digital Signatures
176
177
```python
178
from c14n import canonicalize
179
import hashlib
180
import hmac
181
182
def sign_json(data, secret_key):
183
"""Create digital signature of JSON data."""
184
canonical_bytes = canonicalize(data)
185
signature = hmac.new(secret_key, canonical_bytes, hashlib.sha256).hexdigest()
186
return signature
187
188
def verify_json(data, signature, secret_key):
189
"""Verify digital signature of JSON data."""
190
canonical_bytes = canonicalize(data)
191
expected_signature = hmac.new(secret_key, canonical_bytes, hashlib.sha256).hexdigest()
192
return hmac.compare_digest(signature, expected_signature)
193
194
# Example usage
195
document = {"user": "alice", "action": "login", "timestamp": "2023-01-01T12:00:00Z"}
196
secret = b"my-secret-key"
197
198
signature = sign_json(document, secret)
199
is_valid = verify_json(document, signature, secret)
200
```
201
202
### Content Hashing
203
204
```python
205
import hashlib
206
from c14n import canonicalize
207
208
def hash_json(data):
209
"""Create deterministic hash of JSON data."""
210
canonical_bytes = canonicalize(data)
211
return hashlib.sha256(canonical_bytes).hexdigest()
212
213
# Same data in different orders produces same hash
214
data1 = {"name": "Alice", "age": 30}
215
data2 = {"age": 30, "name": "Alice"}
216
217
hash1 = hash_json(data1)
218
hash2 = hash_json(data2)
219
print(hash1 == hash2) # True - same canonical representation
220
```
221
222
### Data Deduplication
223
224
```python
225
from c14n import canonicalize
226
227
def deduplicate_json(json_objects):
228
"""Remove duplicate JSON objects based on canonical form."""
229
seen = set()
230
unique_objects = []
231
232
for obj in json_objects:
233
canonical = canonicalize(obj)
234
if canonical not in seen:
235
seen.add(canonical)
236
unique_objects.append(obj)
237
238
return unique_objects
239
240
# Example with duplicate data in different order
241
objects = [
242
{"name": "Alice", "age": 30},
243
{"age": 30, "name": "Alice"}, # Duplicate in different order
244
{"name": "Bob", "age": 25}
245
]
246
247
unique = deduplicate_json(objects)
248
# Returns: [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
249
```
250
251
### JSON-LD Integration
252
253
```python
254
from pyld import jsonld
255
from c14n import canonicalize
256
257
def canonical_json_ld_hash(doc):
258
"""Create hash of JSON-LD document after normalization and canonicalization."""
259
# First normalize with JSON-LD
260
normalized = jsonld.normalize(doc, {
261
'algorithm': 'URDNA2015',
262
'format': 'application/n-quads'
263
})
264
265
# Then canonicalize the normalized form
266
canonical = canonicalize(normalized)
267
return hashlib.sha256(canonical).hexdigest()
268
```
269
270
## RFC 8785 Compliance
271
272
The canonicalization follows RFC 8785 specifications:
273
274
1. **Object Key Ordering**: Keys sorted by Unicode code point values
275
2. **Whitespace**: No unnecessary whitespace (compact representation)
276
3. **String Escaping**: Minimal required escape sequences
277
4. **Number Representation**: Minimal numeric representation
278
5. **Array Ordering**: Preserve original array element order
279
6. **Unicode Handling**: Proper UTF-8 encoding and escape sequences
280
281
## Performance Considerations
282
283
### Memory Usage
284
285
```python
286
# For large objects, canonicalization creates string representation in memory
287
large_data = {"items": list(range(100000))}
288
canonical = canonicalize(large_data) # Creates large string in memory
289
```
290
291
### Caching Canonical Forms
292
293
```python
294
from functools import lru_cache
295
296
@lru_cache(maxsize=1000)
297
def cached_canonicalize(data_str):
298
"""Cache canonical forms for frequently used data."""
299
import json
300
data = json.loads(data_str)
301
return canonicalize(data, utf8=False)
302
303
# Use with JSON string input for caching
304
data_json = '{"name": "Alice", "age": 30}'
305
canonical = cached_canonicalize(data_json)
306
```
307
308
## Error Handling
309
310
Canonicalization functions may raise standard Python JSON errors:
311
312
- **TypeError**: Non-serializable objects (functions, custom classes)
313
- **ValueError**: Circular references in nested structures
314
- **UnicodeEncodeError**: Invalid Unicode characters
315
316
```python
317
from c14n import canonicalize
318
import json
319
320
try:
321
# This will fail - functions aren't JSON serializable
322
invalid_data = {"func": lambda x: x}
323
canonical = canonicalize(invalid_data)
324
except TypeError as e:
325
print(f"Serialization error: {e}")
326
327
try:
328
# This will fail - circular reference
329
circular = {}
330
circular["self"] = circular
331
canonical = canonicalize(circular)
332
except ValueError as e:
333
print(f"Circular reference error: {e}")
334
```
335
336
## Integration with PyLD
337
338
The c14n module is used internally by PyLD for JSON-LD processing:
339
340
```python
341
# PyLD uses canonicalization in normalization algorithms
342
from pyld import jsonld
343
344
doc = {"@context": {...}, "@id": "example:1", "name": "Test"}
345
normalized = jsonld.normalize(doc, {
346
'algorithm': 'URDNA2015',
347
'format': 'application/n-quads'
348
})
349
# Internally uses canonicalization for consistent RDF representation
350
```