0
# CoNLL-U Parser
1
2
CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks. This library provides comprehensive parsing, tree conversion, filtering, and serialization capabilities for CoNLL-U data with zero dependencies and full typing support.
3
4
## Package Information
5
6
- **Package Name**: conllu
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install conllu`
10
- **Requirements**: Python 3.8+
11
- **Dependencies**: None (zero dependencies)
12
13
## Core Imports
14
15
```python
16
import conllu
17
```
18
19
Common patterns for parsing:
20
21
```python
22
from conllu import parse, parse_tree, parse_incr, parse_tree_incr
23
```
24
25
Import data models:
26
27
```python
28
from conllu import Token, TokenList, TokenTree, SentenceList, Metadata
29
```
30
31
## Basic Usage
32
33
```python
34
import conllu
35
36
# Parse CoNLL-U data into flat sentence list
37
data = """# text = The quick brown fox jumps
38
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
39
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
40
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
41
4 fox fox NOUN NN Number=Sing 0 root _ _
42
"""
43
44
# Parse into flat list structure
45
sentences = conllu.parse(data)
46
print(f"Parsed {len(sentences)} sentences")
47
print(f"First sentence has {len(sentences[0])} tokens")
48
49
# Parse into tree structure
50
trees = conllu.parse_tree(data)
51
print(f"First tree root: {trees[0].token['form']}")
52
53
# Incremental parsing from file
54
with open('data.conllu', 'r') as f:
55
for sentence in conllu.parse_incr(f):
56
print(f"Sentence: {sentence.metadata.get('text', 'No text')}")
57
58
# Filter and serialize
59
filtered = sentences[0].filter(upos='NOUN')
60
conllu_output = filtered.serialize()
61
```
62
63
## Capabilities
64
65
### Core Parsing Functions
66
67
Primary parsing functions that convert CoNLL-U formatted strings into Python data structures. These functions support custom field definitions and custom parsing logic.
68
69
```python { .api }
70
def parse(
71
data: str,
72
fields: Optional[Sequence[str]] = None,
73
field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,
74
metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None
75
) -> SentenceList:
76
"""
77
Parse CoNLL-U formatted string into a SentenceList (flat list parsing).
78
79
Args:
80
data: CoNLL-U formatted string
81
fields: Field names to use (defaults to DEFAULT_FIELDS)
82
field_parsers: Custom parsers for specific fields
83
metadata_parsers: Custom parsers for metadata lines
84
85
Returns:
86
SentenceList containing parsed sentences
87
"""
88
89
def parse_incr(
90
in_file: TextIO,
91
fields: Optional[Sequence[str]] = None,
92
field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,
93
metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None
94
) -> SentenceGenerator:
95
"""
96
Incremental parsing from file/stream into SentenceGenerator for memory efficiency.
97
98
Args:
99
in_file: File-like object to read from
100
fields: Field names to use (defaults to DEFAULT_FIELDS)
101
field_parsers: Custom parsers for specific fields
102
metadata_parsers: Custom parsers for metadata lines
103
104
Returns:
105
SentenceGenerator for iterating over parsed sentences
106
"""
107
108
def parse_tree(data: str) -> List[TokenTree]:
109
"""
110
Parse CoNLL-U formatted string into tree structure.
111
112
Args:
113
data: CoNLL-U formatted string
114
115
Returns:
116
List of TokenTree objects representing dependency trees
117
"""
118
119
def parse_tree_incr(in_file: TextIO) -> Iterator[TokenTree]:
120
"""
121
Incremental tree parsing from file/stream.
122
123
Args:
124
in_file: File-like object to read from
125
126
Returns:
127
Iterator of TokenTree objects
128
"""
129
```
130
131
### Data Models
132
133
Core data structures for representing CoNLL-U data with built-in methods for manipulation, filtering, and conversion.
134
135
```python { .api }
136
class SentenceList(List[TokenList]):
137
"""
138
List of sentences (TokenList objects) with metadata support.
139
"""
140
def __init__(
141
self,
142
sentences: Optional[Iterable[TokenList]] = None,
143
metadata: Optional[Metadata] = None
144
): ...
145
146
metadata: Metadata
147
148
class TokenList(List[Token]):
149
"""
150
List of tokens representing a sentence with metadata and filtering capabilities.
151
"""
152
def __init__(
153
self,
154
tokens: Optional[Iterable[Token]] = None,
155
metadata: Optional[Metadata] = None,
156
default_fields: Optional[Iterable[str]] = None
157
): ...
158
159
metadata: Metadata
160
default_fields: Optional[Iterable[str]]
161
162
def to_tree(self) -> TokenTree:
163
"""Convert token list to tree structure based on head dependencies."""
164
165
def filter(self, **kwargs: Any) -> TokenList:
166
"""Filter tokens based on field conditions using exact match or callable."""
167
168
def serialize(self) -> str:
169
"""Serialize TokenList back to CoNLL-U format."""
170
171
@staticmethod
172
def head_to_token(sentence: TokenList) -> Dict[int, List[Token]]:
173
"""Create head-to-children mapping for tree construction."""
174
175
class TokenTree:
176
"""
177
Tree representation of tokens with parent-child relationships.
178
"""
179
def __init__(
180
self,
181
token: Token,
182
children: List[TokenTree],
183
metadata: Optional[Metadata] = None
184
): ...
185
186
token: Token
187
children: List[TokenTree]
188
metadata: Optional[Metadata]
189
190
def to_list(self) -> TokenList:
191
"""Flatten tree back to token list."""
192
193
def serialize(self) -> str:
194
"""Serialize tree to CoNLL-U format."""
195
196
def print_tree(
197
self,
198
depth: int = 0,
199
indent: int = 4,
200
exclude_fields: Sequence[str] = DEFAULT_EXCLUDE_FIELDS
201
) -> None:
202
"""Print tree structure to console."""
203
204
def set_metadata(self, metadata: Optional[Metadata]) -> None:
205
"""Set metadata for the tree."""
206
207
class Token(dict):
208
"""
209
Dictionary representing a single token with field mappings and aliases.
210
"""
211
MAPPING: Dict[str, str] # Field name aliases (upos<->upostag, xpos<->xpostag)
212
213
def get(self, key: str, default: Optional[Any] = None) -> Any:
214
"""Get field value with automatic alias resolution."""
215
216
class Metadata(dict):
217
"""
218
Dictionary for storing sentence/document metadata from comment lines.
219
"""
220
221
class SentenceGenerator(Iterable[TokenList]):
222
"""
223
Iterator for incremental sentence processing to handle large files efficiently.
224
"""
225
def __init__(
226
self,
227
sentences: Iterator[TokenList],
228
metadata: Optional[Metadata] = None
229
): ...
230
231
sentences: Iterator[TokenList]
232
metadata: Metadata
233
```
234
235
### Parsing and Serialization Utilities
236
237
Low-level parsing functions and serialization utilities for custom parsing scenarios and advanced usage.
238
239
```python { .api }
240
def parse_sentences(in_file: TextIO) -> Iterator[str]:
241
"""
242
Split input stream into individual sentence strings.
243
244
Args:
245
in_file: File-like object to read from
246
247
Returns:
248
Iterator of sentence strings (raw CoNLL-U blocks)
249
"""
250
251
def parse_token_and_metadata(
252
data: str,
253
fields: Optional[Sequence[str]] = None,
254
field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,
255
metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None
256
) -> TokenList:
257
"""
258
Parse single sentence data into TokenList with metadata.
259
260
Args:
261
data: Single sentence CoNLL-U data
262
fields: Field names to use
263
field_parsers: Custom field parsers
264
metadata_parsers: Custom metadata parsers
265
266
Returns:
267
TokenList representing the sentence
268
"""
269
270
def serialize(tokenlist: TokenList) -> str:
271
"""
272
Serialize TokenList to CoNLL-U format string.
273
274
Args:
275
tokenlist: TokenList to serialize
276
277
Returns:
278
CoNLL-U formatted string
279
"""
280
281
def serialize_field(field: Any) -> str:
282
"""
283
Serialize individual field value to string representation.
284
285
Args:
286
field: Field value to serialize
287
288
Returns:
289
String representation suitable for CoNLL-U format
290
"""
291
```
292
293
### Field Parsing Functions
294
295
Specialized functions for parsing individual CoNLL-U field types with proper validation and type conversion.
296
297
```python { .api }
298
def parse_line(
299
line: str,
300
fields: Sequence[str],
301
field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None
302
) -> Token:
303
"""
304
Parse single token line into Token object.
305
306
Args:
307
line: Single token line from CoNLL-U data
308
fields: Field names for the columns
309
field_parsers: Custom parsers for specific fields
310
311
Returns:
312
Token object representing the parsed line
313
"""
314
315
def parse_comment_line(
316
line: str,
317
metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None
318
) -> List[Tuple[str, Optional[str]]]:
319
"""
320
Parse metadata comment line into key-value pairs.
321
322
Args:
323
line: Comment line starting with '#'
324
metadata_parsers: Custom metadata parsers
325
326
Returns:
327
List of (key, value) tuples from the comment
328
"""
329
330
def parse_int_value(value: str) -> Optional[int]:
331
"""
332
Parse integer field values, handling '_' as None.
333
334
Args:
335
value: String value to parse
336
337
Returns:
338
Parsed integer or None for '_'
339
"""
340
341
def parse_id_value(value: str) -> Optional[Union[int, Tuple[int, str, int]]]:
342
"""
343
Parse ID field supporting single IDs, ranges, and decimal IDs.
344
345
Args:
346
value: ID field value
347
348
Returns:
349
Parsed ID as int, tuple for ranges/decimals, or None
350
"""
351
352
def parse_dict_value(value: str) -> Optional[Dict[str, Optional[str]]]:
353
"""
354
Parse feature dictionaries from pipe-separated key=value pairs.
355
356
Args:
357
value: Feature string (e.g., "Case=Nom|Number=Sing")
358
359
Returns:
360
Dictionary of features or None for '_'
361
"""
362
363
def parse_nullable_value(value: str) -> Optional[str]:
364
"""
365
Parse nullable string values, converting '_' to None.
366
367
Args:
368
value: String value to parse
369
370
Returns:
371
String value or None for empty/'_' values
372
"""
373
374
def parse_paired_list_value(value: str) -> Union[Optional[str], List[Tuple[str, Optional[Union[int, Tuple[int, str, int]]]]]]:
375
"""
376
Parse dependency relations from dependency field values.
377
378
Args:
379
value: Dependency field value (e.g., "4:nsubj|5:conj")
380
381
Returns:
382
List of (relation, head_id) tuples or None for '_'
383
"""
384
385
def parse_pair_value(value: str) -> Tuple[str, Optional[str]]:
386
"""
387
Parse key=value pairs, splitting on the first '=' character.
388
389
Args:
390
value: String potentially containing key=value pair
391
392
Returns:
393
Tuple of (key, value) where value is None if no '=' found
394
"""
395
```
396
397
### Utility Functions
398
399
Helper functions for advanced data manipulation and tree traversal.
400
401
```python { .api }
402
def traverse_dict(obj: Mapping[str, T], query: str) -> Optional[T]:
403
"""
404
Navigate nested dictionaries using '__' separated query strings.
405
406
Args:
407
obj: Dictionary-like object to traverse
408
query: Query string with '__' separators (e.g., 'feats__Case')
409
410
Returns:
411
Value at query path or None if path doesn't exist
412
"""
413
```
414
415
## Types
416
417
```python { .api }
418
# Type aliases for function signatures
419
FieldParserType = Callable[[List[str], int], Any]
420
MetadataParserType = Callable[[str, Optional[str]], Any]
421
IdType = Union[int, Tuple[int, str, int]]
422
423
# Default field configuration
424
DEFAULT_FIELDS: Tuple[str, ...] = (
425
'id', 'form', 'lemma', 'upos', 'xpos', 'feats',
426
'head', 'deprel', 'deps', 'misc'
427
)
428
429
DEFAULT_FIELD_PARSERS: Dict[str, FieldParserType] = {
430
"id": parse_id_value,
431
"xpos": parse_nullable_value,
432
"feats": parse_dict_value,
433
"head": parse_int_value,
434
"deps": parse_paired_list_value,
435
"misc": parse_dict_value,
436
}
437
438
DEFAULT_METADATA_PARSERS: Dict[str, MetadataParserType] = {
439
"newpar": lambda key, value: (key, value),
440
"newdoc": lambda key, value: (key, value),
441
}
442
443
DEFAULT_EXCLUDE_FIELDS: Tuple[str, ...] = (
444
'id', 'deprel', 'xpos', 'feats', 'head', 'deps', 'misc'
445
)
446
```
447
448
## Exceptions
449
450
```python { .api }
451
class ParseException(Exception):
452
"""
453
Exception raised for parsing errors in CoNLL-U data.
454
455
Raised when:
456
- Invalid line format (missing tabs/spaces)
457
- Invalid field values
458
- Tree construction failures
459
- Invalid comment format
460
"""
461
```
462
463
## Advanced Usage Examples
464
465
### Custom Field Parsing
466
467
```python
468
import conllu
469
470
# Define custom parser for a non-standard field
471
def parse_custom_field(line_parts, field_index):
472
value = line_parts[field_index]
473
if value == '_':
474
return None
475
return value.upper() # Custom transformation
476
477
# Use custom parser
478
custom_parsers = {'misc': parse_custom_field}
479
sentences = conllu.parse(data, field_parsers=custom_parsers)
480
```
481
482
### Filtering and Analysis
483
484
```python
485
# Filter tokens by part-of-speech
486
nouns = sentence.filter(upos='NOUN')
487
488
# Filter using callable for complex conditions
489
def is_long_word(form):
490
return len(form) > 5
491
492
long_words = sentence.filter(form=is_long_word)
493
494
# Navigate nested features
495
adjectives = sentence.filter(feats__Degree='Pos')
496
```
497
498
### Tree Operations
499
500
```python
501
# Convert to tree and traverse
502
tree = sentence.to_tree()
503
print(f"Root: {tree.token['form']}")
504
505
# Print tree structure
506
tree.print_tree(indent=2)
507
508
# Convert back to flat list
509
flat_sentence = tree.to_list()
510
```
511
512
### Incremental Processing
513
514
```python
515
# Process large files efficiently
516
with open('large_corpus.conllu', 'r') as f:
517
for sentence in conllu.parse_incr(f):
518
# Process each sentence individually
519
words = [token['form'] for token in sentence]
520
print(' '.join(words))
521
```