0
# Serialization and Formats
1
2
Comprehensive serialization support for multiple PROV formats including PROV-JSON, PROV-XML, PROV-O (RDF), and PROV-N, with automatic format detection and pluggable serializer architecture.
3
4
## Capabilities
5
6
### Serializer Base Class
7
8
Abstract base class for all format-specific serializers.
9
10
```python { .api }
11
class Serializer:
12
def __init__(self, document=None):
13
"""
14
Create a serializer for PROV documents.
15
16
Args:
17
document (ProvDocument, optional): Document to serialize
18
"""
19
20
def serialize(self, stream, **args):
21
"""
22
Abstract method for serializing a document.
23
24
Args:
25
stream (file-like): Stream object to serialize the document into
26
**args: Format-specific serialization arguments
27
"""
28
29
def deserialize(self, stream, **args):
30
"""
31
Abstract method for deserializing a document.
32
33
Args:
34
stream (file-like): Stream object to deserialize the document from
35
**args: Format-specific deserialization arguments
36
37
Returns:
38
ProvDocument: Deserialized document
39
"""
40
```
41
42
### Serializer Registry
43
44
Registry system for managing available serializers.
45
46
```python { .api }
47
class Registry:
48
serializers: dict[str, type[Serializer]] = None
49
"""Dictionary mapping format names to serializer classes."""
50
51
@staticmethod
52
def load_serializers():
53
"""
54
Load all available serializers into the registry.
55
56
Registers serializers for:
57
- 'json': PROV-JSON format
58
- 'xml': PROV-XML format
59
- 'rdf': PROV-O (RDF) format
60
- 'provn': PROV-N format
61
"""
62
63
def get(format_name: str) -> type[Serializer]:
64
"""
65
Get the serializer class for the specified format.
66
67
Args:
68
format_name (str): Format name ('json', 'xml', 'rdf', 'provn')
69
70
Returns:
71
type[Serializer]: Serializer class for the format
72
73
Raises:
74
DoNotExist: If no serializer is available for the format
75
"""
76
77
class DoNotExist(Exception):
78
"""Exception raised when a serializer is not available for a format."""
79
```
80
81
### Document Serialization Methods
82
83
ProvDocument provides high-level serialization methods.
84
85
```python { .api }
86
class ProvDocument:
87
def serialize(self, destination=None, format='json', **args):
88
"""
89
Serialize this document to various formats.
90
91
Args:
92
destination (str or file-like, optional): Output destination
93
format (str): Output format ('json', 'xml', 'rdf', 'provn')
94
**args: Format-specific arguments
95
96
Returns:
97
str: Serialized content if no destination specified
98
"""
99
100
@staticmethod
101
def deserialize(source, format=None, **args):
102
"""
103
Deserialize a document from various formats.
104
105
Args:
106
source (str or file-like): Input source
107
format (str, optional): Input format, auto-detected if None
108
**args: Format-specific arguments
109
110
Returns:
111
ProvDocument: Deserialized document
112
"""
113
```
114
115
### Format-Specific Serializers
116
117
Individual serializer classes for each supported format.
118
119
```python { .api }
120
class ProvJSONSerializer(Serializer):
121
"""
122
Serializer for PROV-JSON format.
123
124
PROV-JSON represents provenance as JSON objects with arrays for
125
different record types and attributes.
126
"""
127
128
class ProvXMLSerializer(Serializer):
129
"""
130
Serializer for PROV-XML format.
131
132
PROV-XML represents provenance as XML documents following the
133
W3C PROV-XML schema.
134
135
Requirements:
136
lxml>=3.3.5 (install with: pip install prov[xml])
137
"""
138
139
class ProvRDFSerializer(Serializer):
140
"""
141
Serializer for PROV-O (RDF) format.
142
143
PROV-O represents provenance as RDF triples using the W3C PROV
144
ontology vocabulary.
145
146
Requirements:
147
rdflib>=4.2.1,<7 (install with: pip install prov[rdf])
148
"""
149
150
class ProvNSerializer(Serializer):
151
"""
152
Serializer for PROV-N format.
153
154
PROV-N is the human-readable textual notation for PROV defined
155
by the W3C specification.
156
"""
157
```
158
159
### Convenience Functions
160
161
High-level functions for easy serialization/deserialization.
162
163
```python { .api }
164
def read(source, format=None):
165
"""
166
Convenience function for reading PROV documents with automatic format detection.
167
168
Args:
169
source (str or PathLike or file-like): Source to read from
170
format (str, optional): Format hint for parsing
171
172
Returns:
173
ProvDocument: Loaded document or None
174
175
Raises:
176
TypeError: If format cannot be detected and parsing fails
177
"""
178
```
179
180
## Supported Formats
181
182
### PROV-JSON
183
184
JSON representation of PROV documents with structured objects for each record type.
185
186
```python
187
# Serialize to PROV-JSON
188
doc.serialize('output.json', format='json')
189
doc.serialize('output.json', format='json', indent=2) # Pretty-printed
190
191
# Deserialize from PROV-JSON
192
doc = ProvDocument.deserialize('input.json', format='json')
193
```
194
195
### PROV-XML
196
197
XML representation following the W3C PROV-XML schema.
198
199
```python
200
# Serialize to PROV-XML (requires lxml)
201
doc.serialize('output.xml', format='xml')
202
203
# Deserialize from PROV-XML
204
doc = ProvDocument.deserialize('input.xml', format='xml')
205
```
206
207
### PROV-O (RDF)
208
209
RDF representation using the W3C PROV ontology.
210
211
```python
212
# Serialize to RDF (requires rdflib)
213
doc.serialize('output.rdf', format='rdf')
214
doc.serialize('output.ttl', format='rdf', rdf_format='turtle')
215
doc.serialize('output.n3', format='rdf', rdf_format='n3')
216
217
# Deserialize from RDF
218
doc = ProvDocument.deserialize('input.rdf', format='rdf')
219
doc = ProvDocument.deserialize('input.ttl', format='rdf', rdf_format='turtle')
220
```
221
222
### PROV-N
223
224
Human-readable textual notation defined by W3C.
225
226
```python
227
# Serialize to PROV-N
228
doc.serialize('output.provn', format='provn')
229
230
# Get PROV-N as string
231
provn_string = doc.get_provn()
232
233
# Deserialize from PROV-N
234
doc = ProvDocument.deserialize('input.provn', format='provn')
235
```
236
237
## Usage Examples
238
239
### Basic Serialization
240
241
```python
242
from prov.model import ProvDocument
243
import prov
244
245
# Create a document with some content
246
doc = ProvDocument()
247
doc.add_namespace('ex', 'http://example.org/')
248
249
entity = doc.entity('ex:entity1', {'prov:label': 'Example Entity'})
250
activity = doc.activity('ex:activity1')
251
doc.generation(entity, activity)
252
253
# Serialize to different formats
254
doc.serialize('output.json', format='json')
255
doc.serialize('output.xml', format='xml')
256
doc.serialize('output.rdf', format='rdf')
257
doc.serialize('output.provn', format='provn')
258
259
# Serialize to string
260
json_string = doc.serialize(format='json')
261
xml_string = doc.serialize(format='xml')
262
```
263
264
### Reading Documents
265
266
```python
267
# Read with automatic format detection
268
doc1 = prov.read('document.json') # Auto-detects JSON
269
doc2 = prov.read('document.xml') # Auto-detects XML
270
doc3 = prov.read('document.rdf') # Auto-detects RDF
271
272
# Read with explicit format
273
doc4 = prov.read('document.txt', format='provn')
274
275
# Read from file-like objects
276
with open('document.json', 'r') as f:
277
doc5 = prov.read(f, format='json')
278
```
279
280
### Advanced Serialization Options
281
282
```python
283
# PROV-JSON with pretty printing
284
doc.serialize('pretty.json', format='json', indent=4)
285
286
# RDF with specific format
287
doc.serialize('output.ttl', format='rdf', rdf_format='turtle')
288
doc.serialize('output.nt', format='rdf', rdf_format='nt')
289
290
# Using serializer classes directly
291
from prov.serializers import get
292
293
json_serializer = get('json')(doc)
294
with open('output.json', 'w') as f:
295
json_serializer.serialize(f, indent=2)
296
```
297
298
### Format Detection and Error Handling
299
300
```python
301
from prov.serializers import DoNotExist
302
303
try:
304
# Attempt to read with format detection
305
doc = prov.read('unknown_format.dat')
306
except TypeError as e:
307
print(f"Format detection failed: {e}")
308
# Try with explicit format
309
doc = prov.read('unknown_format.dat', format='json')
310
311
try:
312
# Attempt to get unavailable serializer
313
serializer = get('unsupported_format')
314
except DoNotExist as e:
315
print(f"Serializer not available: {e}")
316
```
317
318
### Working with Streams
319
320
```python
321
import io
322
323
# Serialize to string buffer
324
buffer = io.StringIO()
325
doc.serialize(buffer, format='json')
326
json_content = buffer.getvalue()
327
328
# Deserialize from string buffer
329
input_buffer = io.StringIO(json_content)
330
loaded_doc = ProvDocument.deserialize(input_buffer, format='json')
331
332
# Binary formats (for some RDF serializations)
333
binary_buffer = io.BytesIO()
334
doc.serialize(binary_buffer, format='rdf', rdf_format='xml')
335
```
336
337
### Batch Processing
338
339
```python
340
import os
341
342
# Serialize document to multiple formats
343
formats = ['json', 'xml', 'rdf', 'provn']
344
base_name = 'provenance'
345
346
for fmt in formats:
347
filename = f"{base_name}.{fmt}"
348
try:
349
doc.serialize(filename, format=fmt)
350
print(f"Saved {filename}")
351
except Exception as e:
352
print(f"Failed to save {filename}: {e}")
353
354
# Read and convert between formats
355
def convert_format(input_file, output_file, output_format):
356
"""Convert PROV document between formats."""
357
doc = prov.read(input_file)
358
doc.serialize(output_file, format=output_format)
359
360
# Convert JSON to XML
361
convert_format('input.json', 'output.xml', 'xml')
362
```
363
364
### Handling Large Documents
365
366
```python
367
# For large documents, serialize directly to file
368
with open('large_document.json', 'w') as f:
369
doc.serialize(f, format='json')
370
371
# Stream processing for large RDF documents
372
def process_large_rdf(filename):
373
"""Process large RDF document efficiently."""
374
doc = ProvDocument.deserialize(filename, format='rdf')
375
376
# Process in chunks or specific record types
377
entities = doc.get_records(prov.model.ProvEntity)
378
activities = doc.get_records(prov.model.ProvActivity)
379
380
print(f"Found {len(entities)} entities and {len(activities)} activities")
381
```
382
383
### Custom Serialization Parameters
384
385
```python
386
# JSON serialization options
387
doc.serialize('compact.json', format='json', separators=(',', ':'))
388
doc.serialize('readable.json', format='json', indent=4, sort_keys=True)
389
390
# RDF serialization with base URI
391
doc.serialize('output.rdf', format='rdf',
392
rdf_format='turtle',
393
base='http://example.org/')
394
395
# XML serialization with encoding
396
doc.serialize('output.xml', format='xml', encoding='utf-8')
397
```