or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-conllu

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/conllu@6.0.x

To install, run

npx @tessl/cli install tessl/pypi-conllu@6.0.0

0

# CoNLL-U Parser

1

2

CoNLL-U Parser parses a CoNLL-U formatted string into a nested python dictionary. CoNLL-U is often the output of natural language processing tasks. This library provides comprehensive parsing, tree conversion, filtering, and serialization capabilities for CoNLL-U data with zero dependencies and full typing support.

3

4

## Package Information

5

6

- **Package Name**: conllu

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install conllu`

10

- **Requirements**: Python 3.8+

11

- **Dependencies**: None (zero dependencies)

12

13

## Core Imports

14

15

```python

16

import conllu

17

```

18

19

Common patterns for parsing:

20

21

```python

22

from conllu import parse, parse_tree, parse_incr, parse_tree_incr

23

```

24

25

Import data models:

26

27

```python

28

from conllu import Token, TokenList, TokenTree, SentenceList, Metadata

29

```

30

31

## Basic Usage

32

33

```python

34

import conllu

35

36

# Parse CoNLL-U data into flat sentence list

37

data = """# text = The quick brown fox jumps

38

1 The the DET DT Definite=Def|PronType=Art 4 det _ _

39

2 quick quick ADJ JJ Degree=Pos 4 amod _ _

40

3 brown brown ADJ JJ Degree=Pos 4 amod _ _

41

4 fox fox NOUN NN Number=Sing 0 root _ _

42

"""

43

44

# Parse into flat list structure

45

sentences = conllu.parse(data)

46

print(f"Parsed {len(sentences)} sentences")

47

print(f"First sentence has {len(sentences[0])} tokens")

48

49

# Parse into tree structure

50

trees = conllu.parse_tree(data)

51

print(f"First tree root: {trees[0].token['form']}")

52

53

# Incremental parsing from file

54

with open('data.conllu', 'r') as f:

55

for sentence in conllu.parse_incr(f):

56

print(f"Sentence: {sentence.metadata.get('text', 'No text')}")

57

58

# Filter and serialize

59

filtered = sentences[0].filter(upos='NOUN')

60

conllu_output = filtered.serialize()

61

```

62

63

## Capabilities

64

65

### Core Parsing Functions

66

67

Primary parsing functions that convert CoNLL-U formatted strings into Python data structures. These functions support custom field definitions and custom parsing logic.

68

69

```python { .api }

70

def parse(

71

data: str,

72

fields: Optional[Sequence[str]] = None,

73

field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,

74

metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None

75

) -> SentenceList:

76

"""

77

Parse CoNLL-U formatted string into a SentenceList (flat list parsing).

78

79

Args:

80

data: CoNLL-U formatted string

81

fields: Field names to use (defaults to DEFAULT_FIELDS)

82

field_parsers: Custom parsers for specific fields

83

metadata_parsers: Custom parsers for metadata lines

84

85

Returns:

86

SentenceList containing parsed sentences

87

"""

88

89

def parse_incr(

90

in_file: TextIO,

91

fields: Optional[Sequence[str]] = None,

92

field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,

93

metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None

94

) -> SentenceGenerator:

95

"""

96

Incremental parsing from file/stream into SentenceGenerator for memory efficiency.

97

98

Args:

99

in_file: File-like object to read from

100

fields: Field names to use (defaults to DEFAULT_FIELDS)

101

field_parsers: Custom parsers for specific fields

102

metadata_parsers: Custom parsers for metadata lines

103

104

Returns:

105

SentenceGenerator for iterating over parsed sentences

106

"""

107

108

def parse_tree(data: str) -> List[TokenTree]:

109

"""

110

Parse CoNLL-U formatted string into tree structure.

111

112

Args:

113

data: CoNLL-U formatted string

114

115

Returns:

116

List of TokenTree objects representing dependency trees

117

"""

118

119

def parse_tree_incr(in_file: TextIO) -> Iterator[TokenTree]:

120

"""

121

Incremental tree parsing from file/stream.

122

123

Args:

124

in_file: File-like object to read from

125

126

Returns:

127

Iterator of TokenTree objects

128

"""

129

```

130

131

### Data Models

132

133

Core data structures for representing CoNLL-U data with built-in methods for manipulation, filtering, and conversion.

134

135

```python { .api }

136

class SentenceList(List[TokenList]):

137

"""

138

List of sentences (TokenList objects) with metadata support.

139

"""

140

def __init__(

141

self,

142

sentences: Optional[Iterable[TokenList]] = None,

143

metadata: Optional[Metadata] = None

144

): ...

145

146

metadata: Metadata

147

148

class TokenList(List[Token]):

149

"""

150

List of tokens representing a sentence with metadata and filtering capabilities.

151

"""

152

def __init__(

153

self,

154

tokens: Optional[Iterable[Token]] = None,

155

metadata: Optional[Metadata] = None,

156

default_fields: Optional[Iterable[str]] = None

157

): ...

158

159

metadata: Metadata

160

default_fields: Optional[Iterable[str]]

161

162

def to_tree(self) -> TokenTree:

163

"""Convert token list to tree structure based on head dependencies."""

164

165

def filter(self, **kwargs: Any) -> TokenList:

166

"""Filter tokens based on field conditions using exact match or callable."""

167

168

def serialize(self) -> str:

169

"""Serialize TokenList back to CoNLL-U format."""

170

171

@staticmethod

172

def head_to_token(sentence: TokenList) -> Dict[int, List[Token]]:

173

"""Create head-to-children mapping for tree construction."""

174

175

class TokenTree:

176

"""

177

Tree representation of tokens with parent-child relationships.

178

"""

179

def __init__(

180

self,

181

token: Token,

182

children: List[TokenTree],

183

metadata: Optional[Metadata] = None

184

): ...

185

186

token: Token

187

children: List[TokenTree]

188

metadata: Optional[Metadata]

189

190

def to_list(self) -> TokenList:

191

"""Flatten tree back to token list."""

192

193

def serialize(self) -> str:

194

"""Serialize tree to CoNLL-U format."""

195

196

def print_tree(

197

self,

198

depth: int = 0,

199

indent: int = 4,

200

exclude_fields: Sequence[str] = DEFAULT_EXCLUDE_FIELDS

201

) -> None:

202

"""Print tree structure to console."""

203

204

def set_metadata(self, metadata: Optional[Metadata]) -> None:

205

"""Set metadata for the tree."""

206

207

class Token(dict):

208

"""

209

Dictionary representing a single token with field mappings and aliases.

210

"""

211

MAPPING: Dict[str, str] # Field name aliases (upos<->upostag, xpos<->xpostag)

212

213

def get(self, key: str, default: Optional[Any] = None) -> Any:

214

"""Get field value with automatic alias resolution."""

215

216

class Metadata(dict):

217

"""

218

Dictionary for storing sentence/document metadata from comment lines.

219

"""

220

221

class SentenceGenerator(Iterable[TokenList]):

222

"""

223

Iterator for incremental sentence processing to handle large files efficiently.

224

"""

225

def __init__(

226

self,

227

sentences: Iterator[TokenList],

228

metadata: Optional[Metadata] = None

229

): ...

230

231

sentences: Iterator[TokenList]

232

metadata: Metadata

233

```

234

235

### Parsing and Serialization Utilities

236

237

Low-level parsing functions and serialization utilities for custom parsing scenarios and advanced usage.

238

239

```python { .api }

240

def parse_sentences(in_file: TextIO) -> Iterator[str]:

241

"""

242

Split input stream into individual sentence strings.

243

244

Args:

245

in_file: File-like object to read from

246

247

Returns:

248

Iterator of sentence strings (raw CoNLL-U blocks)

249

"""

250

251

def parse_token_and_metadata(

252

data: str,

253

fields: Optional[Sequence[str]] = None,

254

field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None,

255

metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None

256

) -> TokenList:

257

"""

258

Parse single sentence data into TokenList with metadata.

259

260

Args:

261

data: Single sentence CoNLL-U data

262

fields: Field names to use

263

field_parsers: Custom field parsers

264

metadata_parsers: Custom metadata parsers

265

266

Returns:

267

TokenList representing the sentence

268

"""

269

270

def serialize(tokenlist: TokenList) -> str:

271

"""

272

Serialize TokenList to CoNLL-U format string.

273

274

Args:

275

tokenlist: TokenList to serialize

276

277

Returns:

278

CoNLL-U formatted string

279

"""

280

281

def serialize_field(field: Any) -> str:

282

"""

283

Serialize individual field value to string representation.

284

285

Args:

286

field: Field value to serialize

287

288

Returns:

289

String representation suitable for CoNLL-U format

290

"""

291

```

292

293

### Field Parsing Functions

294

295

Specialized functions for parsing individual CoNLL-U field types with proper validation and type conversion.

296

297

```python { .api }

298

def parse_line(

299

line: str,

300

fields: Sequence[str],

301

field_parsers: Optional[Dict[str, Callable[[List[str], int], Any]]] = None

302

) -> Token:

303

"""

304

Parse single token line into Token object.

305

306

Args:

307

line: Single token line from CoNLL-U data

308

fields: Field names for the columns

309

field_parsers: Custom parsers for specific fields

310

311

Returns:

312

Token object representing the parsed line

313

"""

314

315

def parse_comment_line(

316

line: str,

317

metadata_parsers: Optional[Dict[str, Callable[[str, Optional[str]], Any]]] = None

318

) -> List[Tuple[str, Optional[str]]]:

319

"""

320

Parse metadata comment line into key-value pairs.

321

322

Args:

323

line: Comment line starting with '#'

324

metadata_parsers: Custom metadata parsers

325

326

Returns:

327

List of (key, value) tuples from the comment

328

"""

329

330

def parse_int_value(value: str) -> Optional[int]:

331

"""

332

Parse integer field values, handling '_' as None.

333

334

Args:

335

value: String value to parse

336

337

Returns:

338

Parsed integer or None for '_'

339

"""

340

341

def parse_id_value(value: str) -> Optional[Union[int, Tuple[int, str, int]]]:

342

"""

343

Parse ID field supporting single IDs, ranges, and decimal IDs.

344

345

Args:

346

value: ID field value

347

348

Returns:

349

Parsed ID as int, tuple for ranges/decimals, or None

350

"""

351

352

def parse_dict_value(value: str) -> Optional[Dict[str, Optional[str]]]:

353

"""

354

Parse feature dictionaries from pipe-separated key=value pairs.

355

356

Args:

357

value: Feature string (e.g., "Case=Nom|Number=Sing")

358

359

Returns:

360

Dictionary of features or None for '_'

361

"""

362

363

def parse_nullable_value(value: str) -> Optional[str]:

364

"""

365

Parse nullable string values, converting '_' to None.

366

367

Args:

368

value: String value to parse

369

370

Returns:

371

String value or None for empty/'_' values

372

"""

373

374

def parse_paired_list_value(value: str) -> Union[Optional[str], List[Tuple[str, Optional[Union[int, Tuple[int, str, int]]]]]]:

375

"""

376

Parse dependency relations from dependency field values.

377

378

Args:

379

value: Dependency field value (e.g., "4:nsubj|5:conj")

380

381

Returns:

382

List of (relation, head_id) tuples or None for '_'

383

"""

384

385

def parse_pair_value(value: str) -> Tuple[str, Optional[str]]:

386

"""

387

Parse key=value pairs, splitting on the first '=' character.

388

389

Args:

390

value: String potentially containing key=value pair

391

392

Returns:

393

Tuple of (key, value) where value is None if no '=' found

394

"""

395

```

396

397

### Utility Functions

398

399

Helper functions for advanced data manipulation and tree traversal.

400

401

```python { .api }

402

def traverse_dict(obj: Mapping[str, T], query: str) -> Optional[T]:

403

"""

404

Navigate nested dictionaries using '__' separated query strings.

405

406

Args:

407

obj: Dictionary-like object to traverse

408

query: Query string with '__' separators (e.g., 'feats__Case')

409

410

Returns:

411

Value at query path or None if path doesn't exist

412

"""

413

```

414

415

## Types

416

417

```python { .api }

418

# Type aliases for function signatures

419

FieldParserType = Callable[[List[str], int], Any]

420

MetadataParserType = Callable[[str, Optional[str]], Any]

421

IdType = Union[int, Tuple[int, str, int]]

422

423

# Default field configuration

424

DEFAULT_FIELDS: Tuple[str, ...] = (

425

'id', 'form', 'lemma', 'upos', 'xpos', 'feats',

426

'head', 'deprel', 'deps', 'misc'

427

)

428

429

DEFAULT_FIELD_PARSERS: Dict[str, FieldParserType] = {

430

"id": parse_id_value,

431

"xpos": parse_nullable_value,

432

"feats": parse_dict_value,

433

"head": parse_int_value,

434

"deps": parse_paired_list_value,

435

"misc": parse_dict_value,

436

}

437

438

DEFAULT_METADATA_PARSERS: Dict[str, MetadataParserType] = {

439

"newpar": lambda key, value: (key, value),

440

"newdoc": lambda key, value: (key, value),

441

}

442

443

DEFAULT_EXCLUDE_FIELDS: Tuple[str, ...] = (

444

'id', 'deprel', 'xpos', 'feats', 'head', 'deps', 'misc'

445

)

446

```

447

448

## Exceptions

449

450

```python { .api }

451

class ParseException(Exception):

452

"""

453

Exception raised for parsing errors in CoNLL-U data.

454

455

Raised when:

456

- Invalid line format (missing tabs/spaces)

457

- Invalid field values

458

- Tree construction failures

459

- Invalid comment format

460

"""

461

```

462

463

## Advanced Usage Examples

464

465

### Custom Field Parsing

466

467

```python

468

import conllu

469

470

# Define custom parser for a non-standard field

471

def parse_custom_field(line_parts, field_index):

472

value = line_parts[field_index]

473

if value == '_':

474

return None

475

return value.upper() # Custom transformation

476

477

# Use custom parser

478

custom_parsers = {'misc': parse_custom_field}

479

sentences = conllu.parse(data, field_parsers=custom_parsers)

480

```

481

482

### Filtering and Analysis

483

484

```python

485

# Filter tokens by part-of-speech

486

nouns = sentence.filter(upos='NOUN')

487

488

# Filter using callable for complex conditions

489

def is_long_word(form):

490

return len(form) > 5

491

492

long_words = sentence.filter(form=is_long_word)

493

494

# Navigate nested features

495

adjectives = sentence.filter(feats__Degree='Pos')

496

```

497

498

### Tree Operations

499

500

```python

501

# Convert to tree and traverse

502

tree = sentence.to_tree()

503

print(f"Root: {tree.token['form']}")

504

505

# Print tree structure

506

tree.print_tree(indent=2)

507

508

# Convert back to flat list

509

flat_sentence = tree.to_list()

510

```

511

512

### Incremental Processing

513

514

```python

515

# Process large files efficiently

516

with open('large_corpus.conllu', 'r') as f:

517

for sentence in conllu.parse_incr(f):

518

# Process each sentence individually

519

words = [token['form'] for token in sentence]

520

print(' '.join(words))

521

```