0
# Document Loading
1
2
Configurable document loaders for fetching remote JSON-LD contexts and documents via HTTP. PyLD supports both synchronous and asynchronous loading with pluggable HTTP client implementations.
3
4
## Capabilities
5
6
### Document Loader Management
7
8
Global document loader configuration for all JSON-LD processing operations.
9
10
```python { .api }
11
def set_document_loader(load_document_):
12
"""
13
Sets the global default JSON-LD document loader.
14
15
Args:
16
load_document_: Document loader function that takes (url, options)
17
and returns RemoteDocument
18
"""
19
20
def get_document_loader():
21
"""
22
Gets the current global document loader.
23
24
Returns:
25
function: Current document loader function
26
"""
27
28
def load_document(url, options, base=None, profile=None, requestProfile=None):
29
"""
30
Loads a document from a URL using the current document loader.
31
32
Args:
33
url (str): The URL (relative or absolute) of the remote document
34
options (dict): Loading options including documentLoader
35
base (str): The absolute URL to use for making url absolute
36
profile (str): Profile for selecting JSON-LD script elements from HTML
37
requestProfile (str): One or more IRIs for request profile parameter
38
39
Returns:
40
RemoteDocument: Loaded document with content and metadata
41
42
Raises:
43
JsonLdError: If document loading fails
44
"""
45
```
46
47
### Requests-based Document Loader
48
49
Synchronous HTTP document loader using the popular Requests library.
50
51
```python { .api }
52
def requests_document_loader(secure=False, **kwargs):
53
"""
54
Creates a document loader using the Requests library.
55
56
Args:
57
secure (bool): Require all requests to use HTTPS (default: False)
58
**kwargs: Additional keyword arguments passed to requests.get()
59
60
Common kwargs:
61
timeout (float or tuple): Request timeout in seconds
62
verify (bool or str): SSL certificate verification
63
cert (str or tuple): Client certificate for authentication
64
headers (dict): Custom HTTP headers
65
proxies (dict): Proxy configuration
66
allow_redirects (bool): Follow redirects (default: True)
67
stream (bool): Stream download (default: False)
68
69
Returns:
70
function: Document loader function compatible with PyLD
71
72
Raises:
73
ImportError: If requests library is not available
74
"""
75
```
76
77
#### Example
78
79
```python
80
from pyld import jsonld
81
82
# Basic requests loader with timeout
83
loader = jsonld.requests_document_loader(timeout=10)
84
jsonld.set_document_loader(loader)
85
86
# Advanced requests loader with SSL and authentication
87
secure_loader = jsonld.requests_document_loader(
88
secure=True, # Force HTTPS
89
timeout=(5, 30), # 5s connect, 30s read timeout
90
verify='/path/to/cacert.pem', # Custom CA bundle
91
cert=('/path/to/client.crt', '/path/to/client.key'), # Client cert
92
headers={'User-Agent': 'MyApp/1.0'},
93
proxies={'https': 'https://proxy.example.com:8080'}
94
)
95
jsonld.set_document_loader(secure_loader)
96
97
# Use in JSON-LD processing
98
doc = jsonld.expand('https://example.org/context.jsonld')
99
```
100
101
### Aiohttp-based Document Loader
102
103
Asynchronous HTTP document loader using aiohttp for high-performance concurrent operations.
104
105
```python { .api }
106
def aiohttp_document_loader(loop=None, secure=False, **kwargs):
107
"""
108
Creates an asynchronous document loader using aiohttp.
109
110
Args:
111
loop: Event loop for async operations (default: current loop)
112
secure (bool): Require all requests to use HTTPS (default: False)
113
**kwargs: Additional keyword arguments passed to aiohttp session
114
115
Common kwargs:
116
timeout (aiohttp.ClientTimeout): Request timeout configuration
117
connector (aiohttp.BaseConnector): Custom connector for connection pooling
118
headers (dict): Default headers for all requests
119
cookies (dict): Default cookies
120
auth (aiohttp.BasicAuth): Authentication credentials
121
trust_env (bool): Use environment proxy settings
122
connector_kwargs: Additional arguments for TCPConnector
123
124
Returns:
125
function: Async document loader function compatible with PyLD
126
127
Raises:
128
ImportError: If aiohttp library is not available
129
"""
130
```
131
132
#### Example
133
134
```python
135
import asyncio
136
from pyld import jsonld
137
import aiohttp
138
139
# Basic aiohttp loader
140
loader = jsonld.aiohttp_document_loader()
141
jsonld.set_document_loader(loader)
142
143
# Advanced aiohttp loader with custom configuration
144
timeout = aiohttp.ClientTimeout(total=30, connect=5)
145
connector = aiohttp.TCPConnector(
146
limit=100, # Total connection pool size
147
ttl_dns_cache=300, # DNS cache TTL
148
use_dns_cache=True
149
)
150
151
advanced_loader = jsonld.aiohttp_document_loader(
152
secure=True,
153
timeout=timeout,
154
connector=connector,
155
headers={'User-Agent': 'MyApp/1.0'},
156
auth=aiohttp.BasicAuth('user', 'pass')
157
)
158
jsonld.set_document_loader(advanced_loader)
159
160
# Process documents asynchronously
161
async def process_documents():
162
doc1 = jsonld.expand('https://example.org/doc1.jsonld')
163
doc2 = jsonld.expand('https://example.org/doc2.jsonld')
164
return doc1, doc2
165
166
# Note: aiohttp loader only provides async loading;
167
# JSON-LD processing itself remains synchronous
168
```
169
170
### Dummy Document Loader
171
172
Fallback loader that raises exceptions for all requests, used when no HTTP libraries are available.
173
174
```python { .api }
175
def dummy_document_loader(**kwargs):
176
"""
177
Creates a dummy document loader that raises exceptions on use.
178
179
Args:
180
**kwargs: Extra keyword arguments (ignored)
181
182
Returns:
183
function: Document loader that always fails
184
185
Raises:
186
JsonLdError: Always raises with 'loading document failed' error
187
"""
188
```
189
190
## RemoteDocument Structure
191
192
Document loaders return RemoteDocument objects with this structure:
193
194
```python { .api }
195
# RemoteDocument format
196
{
197
"document": {...}, # The loaded JSON-LD document
198
"documentUrl": "string", # Final URL after redirects
199
"contextUrl": "string" # Context URL if Link header present
200
}
201
```
202
203
### RemoteDocument Fields
204
205
- **document**: The parsed JSON-LD document content
206
- **documentUrl**: The final URL after following redirects
207
- **contextUrl**: Context URL extracted from HTTP Link header (optional)
208
209
## HTTP Link Header Processing
210
211
PyLD automatically processes HTTP Link headers to discover JSON-LD contexts:
212
213
```python { .api }
214
def parse_link_header(header):
215
"""
216
Parses HTTP Link header for JSON-LD context discovery.
217
218
Args:
219
header (str): HTTP Link header value
220
221
Returns:
222
list: Parsed link relationships with URLs and attributes
223
"""
224
```
225
226
#### Example
227
228
```python
229
from pyld import jsonld
230
231
# Link header parsing
232
header = '<https://example.org/context.jsonld>; rel="http://www.w3.org/ns/json-ld#context"'
233
links = jsonld.parse_link_header(header)
234
# Result: [{"target": "https://example.org/context.jsonld", "rel": "http://www.w3.org/ns/json-ld#context"}]
235
```
236
237
## Custom Document Loaders
238
239
Create custom document loaders for specialized requirements:
240
241
```python
242
def custom_document_loader(url, options=None):
243
"""
244
Custom document loader implementation.
245
246
Args:
247
url (str): Document URL to load
248
options (dict): Loading options
249
250
Returns:
251
dict: RemoteDocument with document, documentUrl, contextUrl
252
"""
253
try:
254
# Custom loading logic
255
if url.startswith('file://'):
256
# Handle file:// URLs
257
with open(url[7:], 'r') as f:
258
document = json.load(f)
259
return {
260
'document': document,
261
'documentUrl': url,
262
'contextUrl': None
263
}
264
elif url.startswith('cache://'):
265
# Handle cached documents
266
document = get_from_cache(url)
267
return {
268
'document': document,
269
'documentUrl': url,
270
'contextUrl': None
271
}
272
else:
273
# Fallback to default HTTP loading
274
return default_http_loader(url, options)
275
276
except Exception as e:
277
from pyld.jsonld import JsonLdError
278
raise JsonLdError(
279
f'Could not load document: {url}',
280
'loading document failed',
281
{'url': url},
282
cause=e
283
)
284
285
# Register custom loader
286
jsonld.set_document_loader(custom_document_loader)
287
```
288
289
## Security Considerations
290
291
### HTTPS Enforcement
292
293
```python
294
# Force HTTPS for all requests
295
loader = jsonld.requests_document_loader(secure=True)
296
jsonld.set_document_loader(loader)
297
```
298
299
### Certificate Verification
300
301
```python
302
# Custom CA bundle
303
loader = jsonld.requests_document_loader(
304
verify='/path/to/custom-cacert.pem'
305
)
306
307
# Disable verification (not recommended for production)
308
loader = jsonld.requests_document_loader(verify=False)
309
```
310
311
### Request Timeouts
312
313
```python
314
# Requests timeouts
315
loader = jsonld.requests_document_loader(
316
timeout=(5, 30) # 5s connect, 30s read
317
)
318
319
# Aiohttp timeouts
320
import aiohttp
321
timeout = aiohttp.ClientTimeout(total=30, connect=5)
322
loader = jsonld.aiohttp_document_loader(timeout=timeout)
323
```
324
325
### URL Filtering
326
327
```python
328
def filtered_document_loader(url, options=None):
329
"""Document loader with URL filtering."""
330
331
# Block private networks
332
if url.startswith('http://192.168.') or url.startswith('http://10.'):
333
raise JsonLdError('Private network access denied', 'loading document failed')
334
335
# Allow only specific domains
336
allowed_domains = ['example.org', 'w3.org', 'schema.org']
337
domain = urllib.parse.urlparse(url).netloc
338
if domain not in allowed_domains:
339
raise JsonLdError('Domain not allowed', 'loading document failed')
340
341
# Use standard loader for allowed URLs
342
return standard_loader(url, options)
343
344
jsonld.set_document_loader(filtered_document_loader)
345
```
346
347
## Default Loader Selection
348
349
PyLD automatically selects document loaders in this priority order:
350
351
1. **Requests** - If requests library is available (default)
352
2. **Aiohttp** - If aiohttp is available and requests is not
353
3. **Dummy** - Fallback that always fails
354
355
Override with explicit loader selection:
356
357
```python
358
# Force aiohttp even if requests is available
359
jsonld.set_document_loader(jsonld.aiohttp_document_loader())
360
361
# Or force requests
362
jsonld.set_document_loader(jsonld.requests_document_loader())
363
```
364
365
## Error Handling
366
367
Document loaders may raise `JsonLdError` with these error types:
368
369
- **loading document failed**: Network errors, timeouts, HTTP errors
370
- **invalid remote context**: Invalid JSON-LD context documents
371
- **recursive context inclusion**: Context import loops
372
373
Handle loading errors gracefully:
374
375
```python
376
from pyld.jsonld import JsonLdError
377
378
try:
379
result = jsonld.expand('https://example.org/doc.jsonld')
380
except JsonLdError as e:
381
if e.code == 'loading document failed':
382
print(f"Could not load document: {e.details}")
383
# Handle network error
384
else:
385
# Handle other JSON-LD errors
386
raise
387
```