Tessl Tile for pypi/gfftk@25.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-commands.md comparison.md consensus.md format-conversion.md genbank-tbl.md gff-processing.md index.md sequence-operations.md utilities.md

gff-processing.mddocs/

0
# GFF3 and GTF Processing
1

2
Comprehensive parsing and manipulation of GFF3 and GTF format files with support for multiple annotation sources, robust validation, and flexible output options. Handles complex gene models with alternative splicing and provides the foundation for all format conversion operations.
3

4
## Capabilities
5

6
### GFF3 Parsing
7

8
Parse GFF3 files into the central annotation dictionary format with support for multiple annotation sources and validation.
9

10
```python { .api }
11
def gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format="auto", logger=sys.stderr.write):
12
    """
13
    Parse GFF3 file to annotation dictionary.
14

15
    Parameters:
16
    - gff (str): Path to input GFF3 file
17
    - fasta (str): Path to genome FASTA file for sequence validation
18
    - annotation (dict|bool): Pre-existing annotation dictionary to extend, or False
19
    - table (int): Genetic code table for translation (1 or 11)
20
    - debug (bool): Enable debug output for parsing errors
21
    - gap_filter (bool): Filter out models with sequence gaps
22
    - gff_format (str): GFF format variant ("auto", "default", "miniprot", etc.)
23
    - logger (function): Logging function for error messages
24

25
    Returns:
26
    dict: Annotation dictionary with gene_id as keys and gene data as values
27
    """
28

29
def dict2gff3(infile, output=False, debug=False, source=False, newline=False):
30
    """
31
    Write annotation dictionary to GFF3 format.
32

33
    Parameters:
34
    - infile (dict): Annotation dictionary to write
35
    - output (str|bool): Output file path, or False for stdout
36
    - debug (bool): Enable debug output
37
    - source (str|bool): Override source field in output
38
    - newline (bool): Add newlines between gene records
39

40
    Returns:
41
    None
42
    """
43

44
def dict2gff3alignments(infile, output=False, debug=False, alignments=False, source=False, newline=False):
45
    """
46
    Write annotation dictionary to GFF3 alignments format for EVM evidence.
47

48
    Parameters:
49
    - infile (dict): Annotation dictionary to write
50
    - output (str|bool): Output file path, or False for stdout
51
    - debug (bool): Enable debug output
52
    - alignments (dict|bool): Alignment data structure for evidence formatting
53
    - source (str|bool): Override source field in output
54
    - newline (bool): Add newlines between records
55

56
    Returns:
57
    None
58
    """
59
```
60

61
### GTF Parsing
62

63
Parse GTF files with support for different GTF formats and dialects from various annotation sources.
64

65
```python { .api }
66
def gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format="auto", logger=sys.stderr.write):
67
    """
68
    Parse GTF file to annotation dictionary.
69

70
    Parameters:
71
    - gtf (str): Path to input GTF file
72
    - fasta (str): Path to genome FASTA file for sequence validation
73
    - annotation (dict|bool): Pre-existing annotation dictionary to extend, or False
74
    - table (int): Genetic code table for translation (1 or 11)
75
    - debug (bool): Enable debug output for parsing errors
76
    - gap_filter (bool): Filter out models with sequence gaps
77
    - gtf_format (str): GTF format variant ("auto", "default", "genemark", "jgi")
78
    - logger (function): Logging function for error messages
79

80
    Returns:
81
    dict: Annotation dictionary with gene_id as keys and gene data as values
82
    """
83

84
def dict2gtf(infile, output=False, source=False):
85
    """
86
    Write annotation dictionary to GTF format.
87

88
    Parameters:
89
    - infile (dict): Annotation dictionary to write
90
    - output (str|bool): Output file path, or False for stdout
91
    - source (str|bool): Override source field in output
92

93
    Returns:
94
    None
95
    """
96
```
97

98
### Validation and Translation
99

100
Validate gene models and generate protein translations with comprehensive error checking.
101

102
```python { .api }
103
def validate_models(annotation, fadict, logger=sys.stderr.write, table=1, gap_filter=False):
104
    """
105
    Validate gene model structure and sequences.
106

107
    Parameters:
108
    - annotation (dict): Annotation dictionary to validate
109
    - fadict (dict): Genome sequences dictionary
110
    - logger (function): Logging function for error messages
111
    - table (int): Genetic code table for validation
112
    - gap_filter (bool): Filter out models with sequence gaps
113

114
    Returns:
115
    dict: Validated annotation dictionary
116
    """
117

118
def validate_and_translate_models(annotation, fadict, logger=sys.stderr.write, table=1):
119
    """
120
    Validate gene models and generate protein translations.
121

122
    Parameters:
123
    - annotation (dict): Annotation dictionary to process
124
    - fadict (dict): Genome sequences dictionary
125
    - logger (function): Logging function for error messages
126
    - table (int): Genetic code table for translation
127

128
    Returns:
129
    dict: Annotation dictionary with validated translations
130
    """
131
```
132

133
### Specialized Parsers
134

135
Internal parsers for handling different GFF3 and GTF formats from various annotation sources.
136

137
```python { .api }
138
def _gff_default_parser(gff, fasta, Genes):
139
    """
140
    Default GFF3 parser implementation.
141

142
    Parameters:
143
    - gff (str): Path to GFF3 file
144
    - fasta (str): Path to genome FASTA file
145
    - Genes (dict): Annotation dictionary to populate
146

147
    Returns:
148
    dict: Updated annotation dictionary
149
    """
150

151
def _gff_miniprot_parser(gff, fasta, Genes):
152
    """
153
    Miniprot-specific GFF3 parser for protein alignments.
154

155
    Parameters:
156
    - gff (str): Path to miniprot GFF3 file
157
    - fasta (str): Path to genome FASTA file
158
    - Genes (dict): Annotation dictionary to populate
159

160
    Returns:
161
    dict: Updated annotation dictionary
162
    """
163

164
def _gff_alignment_parser(gff, fasta, Genes):
165
    """
166
    Alignment GFF3 parser for transcript/protein alignments.
167

168
    Parameters:
169
    - gff (str): Path to alignment GFF3 file
170
    - fasta (str): Path to genome FASTA file
171
    - Genes (dict): Annotation dictionary to populate
172

173
    Returns:
174
    dict: Updated annotation dictionary
175
    """
176

177
def _gff_ncbi_parser(gff, fasta, Genes):
178
    """
179
    NCBI GFF3 parser for NCBI-formatted annotations.
180

181
    Parameters:
182
    - gff (str): Path to NCBI GFF3 file
183
    - fasta (str): Path to genome FASTA file
184
    - Genes (dict): Annotation dictionary to populate
185

186
    Returns:
187
    dict: Updated annotation dictionary
188
    """
189

190
def _gtf_default_parser(gtf, fasta, Genes, gtf_format="default"):
191
    """
192
    Default GTF parser implementation.
193

194
    Parameters:
195
    - gtf (str): Path to GTF file
196
    - fasta (str): Path to genome FASTA file
197
    - Genes (dict): Annotation dictionary to populate
198
    - gtf_format (str): GTF format variant
199

200
    Returns:
201
    dict: Updated annotation dictionary
202
    """
203

204
def _gtf_genemark_parser(gtf, fasta, Genes, gtf_format="genemark"):
205
    """
206
    GeneMark GTF parser for GeneMark-specific format.
207

208
    Parameters:
209
    - gtf (str): Path to GeneMark GTF file
210
    - fasta (str): Path to genome FASTA file
211
    - Genes (dict): Annotation dictionary to populate
212
    - gtf_format (str): GTF format variant
213

214
    Returns:
215
    dict: Updated annotation dictionary
216
    """
217

218
def _gtf_jgi_parser(gtf, fasta, Genes, gtf_format="jgi"):
219
    """
220
    JGI GTF parser for JGI-specific format.
221

222
    Parameters:
223
    - gtf (str): Path to JGI GTF file
224
    - fasta (str): Path to genome FASTA file
225
    - Genes (dict): Annotation dictionary to populate
226
    - gtf_format (str): GTF format variant
227

228
    Returns:
229
    dict: Updated annotation dictionary
230
    """
231
```
232

233
### GO Term Processing
234

235
Process and simplify Gene Ontology term lists for cleaner annotation output.
236

237
```python { .api }
238
def simplifyGO(inputList):
239
    """
240
    Simplify Gene Ontology term list format.
241

242
    Parameters:
243
    - inputList (list): List of GO terms in various formats
244

245
    Returns:
246
    list: Simplified GO term list
247
    """
248
```
249

250
### Sequence Gap Handling
251

252
Handle start and end gaps in genomic sequences during parsing and validation.
253

254
```python { .api }
255
def start_end_gap(seq, coords):
256
    """
257
    Handle start/end gaps in genomic sequences.
258

259
    Parameters:
260
    - seq (str): Genomic sequence
261
    - coords (list): List of coordinate tuples
262

263
    Returns:
264
    tuple: Adjusted coordinates and gap information
265
    """
266
```
267

268
## Usage Examples
269

270
### Basic GFF3 Parsing
271

272
```python
273
from gfftk.gff import gff2dict, dict2gff3
274

275
# Parse GFF3 file to annotation dictionary
276
annotation = gff2dict("input.gff3", "genome.fasta")
277

278
# Access gene information
279
for gene_id, gene_data in annotation.items():
280
    print(f"Gene: {gene_id}")
281
    print(f"Location: {gene_data['location']}")
282
    print(f"Strand: {gene_data['strand']}")
283
    print(f"Products: {gene_data['product']}")
284

285
# Write back to GFF3 format
286
dict2gff3(annotation, output="output.gff3")
287
```
288

289
### GTF Processing
290

291
```python
292
from gfftk.gff import gtf2dict, dict2gtf
293

294
# Parse GTF file
295
annotation = gtf2dict("input.gtf", "genome.fasta", debug=True)
296

297
# Write to GTF format with custom source
298
dict2gtf(annotation, output="output.gtf", source="custom_pipeline")
299
```
300

301
### Validation and Translation
302

303
```python
304
from gfftk.gff import gff2dict, validate_and_translate_models
305
from gfftk.fasta import fasta2dict
306

307
# Load data
308
annotation = gff2dict("annotation.gff3", "genome.fasta")
309
genome = fasta2dict("genome.fasta")
310

311
# Validate and generate translations
312
validated = validate_and_translate_models(annotation, genome, table=1)
313

314
# Access protein translations
315
for gene_id, gene_data in validated.items():
316
    for i, protein in enumerate(gene_data['protein']):
317
        transcript_id = gene_data['ids'][i]
318
        print(f"{transcript_id}: {protein}")
319
```
320

321
### Working with Different Sources
322

323
```python
324
from gfftk.gff import gff2dict
325

326
# Parse different annotation sources with debug output
327
augustus_annotation = gff2dict("augustus.gff3", "genome.fasta", debug=True)
328
ncbi_annotation = gff2dict("ncbi.gff3", "genome.fasta", debug=True)
329
miniprot_annotation = gff2dict("miniprot.gff3", "genome.fasta", debug=True)
330

331
# Combine annotations (example workflow)
332
combined = {}
333
combined.update(augustus_annotation)
334
combined.update(ncbi_annotation)
335
combined.update(miniprot_annotation)
336
```
337

338
## Types
339

340
```python { .api }
341
# Annotation dictionary structure (detailed in main index)
342
AnnotationDict = dict[str, GeneAnnotation]
343

344
# Parser function type
345
ParserFunction = callable[[str, str, dict], dict]
346

347
# Logger function type
348
LoggerFunction = callable[[str], None]
349

350
# Coordinate tuple format
351
CoordinateTuple = tuple[int, int]
352

353
# Feature coordinate list
354
FeatureCoordinates = list[CoordinateTuple]
355

356
# Gene Ontology term format
357
GOTerm = str  # Format: "GO:0000000"
358

359
# Database cross-reference format
360
DbXref = str  # Format: "database:identifier"
361
```

Version

Tile

Files

gff-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

gff-processing.mddocs/