0
# URL and IRI Utilities
1
2
Utility functions for URL parsing, IRI manipulation, and base URL resolution following RFC 3986 standards. These functions support JSON-LD's IRI processing requirements and URL normalization.
3
4
## Capabilities
5
6
### Base IRI Resolution
7
8
Functions for resolving relative IRIs against base IRIs and converting absolute IRIs back to relative form.
9
10
```python { .api }
11
def prepend_base(base, iri):
12
"""
13
Prepends a base IRI to a relative IRI to create an absolute IRI.
14
15
Args:
16
base (str): The base IRI to resolve against
17
iri (str): The relative IRI to resolve
18
19
Returns:
20
str: The absolute IRI
21
22
Raises:
23
JsonLdError: If base IRI is invalid or resolution fails
24
"""
25
26
def remove_base(base, iri):
27
"""
28
Removes a base IRI from an absolute IRI to create a relative IRI.
29
30
Args:
31
base (str): The base IRI to remove
32
iri (str): The absolute IRI to make relative
33
34
Returns:
35
str: The relative IRI if the IRI starts with base, otherwise the
36
original absolute IRI
37
38
Raises:
39
JsonLdError: If base IRI is invalid
40
"""
41
```
42
43
#### Examples
44
45
```python
46
from pyld import jsonld
47
48
# Resolve relative IRI against base
49
base = "https://example.org/data/"
50
relative = "document.jsonld"
51
absolute = jsonld.prepend_base(base, relative)
52
print(absolute) # "https://example.org/data/document.jsonld"
53
54
# Resolve with path traversal
55
relative_path = "../other/doc.jsonld"
56
resolved = jsonld.prepend_base(base, relative_path)
57
print(resolved) # "https://example.org/other/doc.jsonld"
58
59
# Make absolute IRI relative to base
60
absolute_iri = "https://example.org/data/context.jsonld"
61
relative_result = jsonld.remove_base(base, absolute_iri)
62
print(relative_result) # "context.jsonld"
63
64
# IRI not relative to base remains absolute
65
other_iri = "https://other.org/context.jsonld"
66
unchanged = jsonld.remove_base(base, other_iri)
67
print(unchanged) # "https://other.org/context.jsonld"
68
```
69
70
### URL Parsing and Construction
71
72
RFC 3986 compliant URL parsing and reconstruction utilities.
73
74
```python { .api }
75
def parse_url(url):
76
"""
77
Parses a URL into its component parts following RFC 3986.
78
79
Args:
80
url (str): The URL to parse
81
82
Returns:
83
ParsedUrl: Named tuple with components (scheme, authority, path, query, fragment)
84
85
Components:
86
scheme (str): URL scheme (http, https, etc.)
87
authority (str): Authority component (host:port)
88
path (str): Path component
89
query (str): Query string component
90
fragment (str): Fragment identifier component
91
"""
92
93
def unparse_url(parsed):
94
"""
95
Reconstructs a URL from its parsed components.
96
97
Args:
98
parsed (ParsedUrl, dict, list, or tuple): URL components
99
100
Returns:
101
str: The reconstructed URL
102
103
Raises:
104
TypeError: If parsed components are in invalid format
105
"""
106
```
107
108
#### Examples
109
110
```python
111
from pyld import jsonld
112
113
# Parse URL into components
114
url = "https://example.org:8080/path/to/doc.jsonld?param=value#section"
115
parsed = jsonld.parse_url(url)
116
117
print(parsed.scheme) # "https"
118
print(parsed.authority) # "example.org:8080" (default ports removed)
119
print(parsed.path) # "/path/to/doc.jsonld"
120
print(parsed.query) # "param=value"
121
print(parsed.fragment) # "section"
122
123
# Reconstruct URL from components
124
reconstructed = jsonld.unparse_url(parsed)
125
print(reconstructed) # "https://example.org:8080/path/to/doc.jsonld?param=value#section"
126
127
# Modify components and reconstruct
128
modified_parsed = parsed._replace(path="/new/path.jsonld", query="new=param")
129
new_url = jsonld.unparse_url(modified_parsed)
130
print(new_url) # "https://example.org:8080/new/path.jsonld?new=param#section"
131
132
# Parse URLs with missing components
133
simple_url = "https://example.org/doc"
134
simple_parsed = jsonld.parse_url(simple_url)
135
print(simple_parsed.query) # None
136
print(simple_parsed.fragment) # None
137
```
138
139
### Path Normalization
140
141
Utility for normalizing URL paths by removing dot segments according to RFC 3986.
142
143
```python { .api }
144
def remove_dot_segments(path):
145
"""
146
Removes dot segments from a URL path according to RFC 3986.
147
148
Resolves '.' and '..' segments in URL paths to create normalized paths.
149
150
Args:
151
path (str): The path to normalize
152
153
Returns:
154
str: The normalized path with dot segments removed
155
"""
156
```
157
158
#### Examples
159
160
```python
161
from pyld import jsonld
162
163
# Remove current directory references
164
path1 = "/a/b/./c"
165
normalized1 = jsonld.remove_dot_segments(path1)
166
print(normalized1) # "/a/b/c"
167
168
# Remove parent directory references
169
path2 = "/a/b/../c"
170
normalized2 = jsonld.remove_dot_segments(path2)
171
print(normalized2) # "/a/c"
172
173
# Complex path with multiple dot segments
174
path3 = "/a/b/c/./../../g"
175
normalized3 = jsonld.remove_dot_segments(path3)
176
print(normalized3) # "/a/g"
177
178
# Leading dot segments
179
path4 = "../../../g"
180
normalized4 = jsonld.remove_dot_segments(path4)
181
print(normalized4) # "g"
182
```
183
184
## ParsedUrl Structure
185
186
The `parse_url()` function returns a `ParsedUrl` named tuple with these fields:
187
188
```python { .api }
189
# ParsedUrl named tuple structure
190
ParsedUrl = namedtuple('ParsedUrl', ['scheme', 'authority', 'path', 'query', 'fragment'])
191
192
# Example ParsedUrl instance
193
ParsedUrl(
194
scheme='https',
195
authority='example.org:8080',
196
path='/path/to/resource',
197
query='param=value',
198
fragment='section'
199
)
200
```
201
202
### ParsedUrl Fields
203
204
- **scheme**: Protocol scheme (http, https, ftp, etc.) or None
205
- **authority**: Host and optional port (example.org:8080) or None
206
- **path**: Path component (always present, may be empty string)
207
- **query**: Query string without leading '?' or None
208
- **fragment**: Fragment identifier without leading '#' or None
209
210
### Default Port Handling
211
212
PyLD automatically removes default ports from the authority component:
213
214
```python
215
# Default ports are removed
216
url1 = "https://example.org:443/path"
217
parsed1 = jsonld.parse_url(url1)
218
print(parsed1.authority) # "example.org" (443 removed)
219
220
url2 = "http://example.org:80/path"
221
parsed2 = jsonld.parse_url(url2)
222
print(parsed2.authority) # "example.org" (80 removed)
223
224
# Non-default ports are preserved
225
url3 = "https://example.org:8080/path"
226
parsed3 = jsonld.parse_url(url3)
227
print(parsed3.authority) # "example.org:8080" (8080 preserved)
228
```
229
230
## IRI vs URL Handling
231
232
These utilities work with both URLs and IRIs (Internationalized Resource Identifiers):
233
234
```python
235
# ASCII URLs
236
ascii_url = "https://example.org/path"
237
parsed_ascii = jsonld.parse_url(ascii_url)
238
239
# International IRIs
240
iri = "https://例え.テスト/パス"
241
parsed_iri = jsonld.parse_url(iri)
242
243
# Both work with the same parsing logic
244
```
245
246
## Common Use Cases
247
248
### Base Context Resolution
249
250
```python
251
# Resolve context relative to document base
252
document_url = "https://example.org/data/document.jsonld"
253
context_ref = "../contexts/main.jsonld"
254
255
# Extract base from document URL
256
base = jsonld.remove_base("", document_url).rsplit('/', 1)[0] + "/"
257
context_url = jsonld.prepend_base(base, context_ref)
258
print(context_url) # "https://example.org/contexts/main.jsonld"
259
```
260
261
### URL Canonicalization
262
263
```python
264
def canonicalize_url(url):
265
"""Canonicalize URL by parsing and reconstructing."""
266
parsed = jsonld.parse_url(url)
267
# Normalize path
268
normalized_path = jsonld.remove_dot_segments(parsed.path)
269
canonical_parsed = parsed._replace(path=normalized_path)
270
return jsonld.unparse_url(canonical_parsed)
271
272
# Canonicalize URLs for comparison
273
url1 = "https://example.org/a/b/../c"
274
url2 = "https://example.org/a/c"
275
canonical1 = canonicalize_url(url1)
276
canonical2 = canonicalize_url(url2)
277
print(canonical1 == canonical2) # True
278
```
279
280
### Relative Link Resolution
281
282
```python
283
def resolve_links(base_url, links):
284
"""Resolve a list of relative links against a base URL."""
285
return [jsonld.prepend_base(base_url, link) for link in links]
286
287
base = "https://example.org/docs/"
288
relative_links = ["intro.html", "../images/logo.png", "section/details.html"]
289
absolute_links = resolve_links(base, relative_links)
290
# Result: ["https://example.org/docs/intro.html",
291
# "https://example.org/images/logo.png",
292
# "https://example.org/docs/section/details.html"]
293
```
294
295
## RFC 3986 Compliance
296
297
These utilities implement RFC 3986 URL processing standards:
298
299
- **Scheme**: Case-insensitive protocol identifier
300
- **Authority**: Host and optional port with default port removal
301
- **Path**: Hierarchical path with dot segment normalization
302
- **Query**: Optional query parameters
303
- **Fragment**: Optional fragment identifier
304
305
The implementation handles edge cases like:
306
- Empty path components
307
- Percent-encoded characters
308
- Unicode in IRIs
309
- Relative reference resolution
310
- Path traversal with '../' segments
311
312
## Error Handling
313
314
URL utility functions may raise `JsonLdError` for:
315
316
- **Invalid base IRI**: Malformed base IRI in resolution functions
317
- **Invalid URL format**: URLs that don't conform to RFC 3986
318
- **Resolution errors**: Failed relative IRI resolution
319
320
Handle URL errors appropriately:
321
322
```python
323
from pyld.jsonld import JsonLdError
324
325
try:
326
result = jsonld.prepend_base("invalid-base", "relative")
327
except JsonLdError as e:
328
print(f"URL resolution failed: {e}")
329
# Handle invalid base IRI
330
```