0
# GFF3 and GTF Processing
1
2
Comprehensive parsing and manipulation of GFF3 and GTF format files with support for multiple annotation sources, robust validation, and flexible output options. Handles complex gene models with alternative splicing and provides the foundation for all format conversion operations.
3
4
## Capabilities
5
6
### GFF3 Parsing
7
8
Parse GFF3 files into the central annotation dictionary format with support for multiple annotation sources and validation.
9
10
```python { .api }
11
def gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format="auto", logger=sys.stderr.write):
12
"""
13
Parse GFF3 file to annotation dictionary.
14
15
Parameters:
16
- gff (str): Path to input GFF3 file
17
- fasta (str): Path to genome FASTA file for sequence validation
18
- annotation (dict|bool): Pre-existing annotation dictionary to extend, or False
19
- table (int): Genetic code table for translation (1 or 11)
20
- debug (bool): Enable debug output for parsing errors
21
- gap_filter (bool): Filter out models with sequence gaps
22
- gff_format (str): GFF format variant ("auto", "default", "miniprot", etc.)
23
- logger (function): Logging function for error messages
24
25
Returns:
26
dict: Annotation dictionary with gene_id as keys and gene data as values
27
"""
28
29
def dict2gff3(infile, output=False, debug=False, source=False, newline=False):
30
"""
31
Write annotation dictionary to GFF3 format.
32
33
Parameters:
34
- infile (dict): Annotation dictionary to write
35
- output (str|bool): Output file path, or False for stdout
36
- debug (bool): Enable debug output
37
- source (str|bool): Override source field in output
38
- newline (bool): Add newlines between gene records
39
40
Returns:
41
None
42
"""
43
44
def dict2gff3alignments(infile, output=False, debug=False, alignments=False, source=False, newline=False):
45
"""
46
Write annotation dictionary to GFF3 alignments format for EVM evidence.
47
48
Parameters:
49
- infile (dict): Annotation dictionary to write
50
- output (str|bool): Output file path, or False for stdout
51
- debug (bool): Enable debug output
52
- alignments (dict|bool): Alignment data structure for evidence formatting
53
- source (str|bool): Override source field in output
54
- newline (bool): Add newlines between records
55
56
Returns:
57
None
58
"""
59
```
60
61
### GTF Parsing
62
63
Parse GTF files with support for different GTF formats and dialects from various annotation sources.
64
65
```python { .api }
66
def gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format="auto", logger=sys.stderr.write):
67
"""
68
Parse GTF file to annotation dictionary.
69
70
Parameters:
71
- gtf (str): Path to input GTF file
72
- fasta (str): Path to genome FASTA file for sequence validation
73
- annotation (dict|bool): Pre-existing annotation dictionary to extend, or False
74
- table (int): Genetic code table for translation (1 or 11)
75
- debug (bool): Enable debug output for parsing errors
76
- gap_filter (bool): Filter out models with sequence gaps
77
- gtf_format (str): GTF format variant ("auto", "default", "genemark", "jgi")
78
- logger (function): Logging function for error messages
79
80
Returns:
81
dict: Annotation dictionary with gene_id as keys and gene data as values
82
"""
83
84
def dict2gtf(infile, output=False, source=False):
85
"""
86
Write annotation dictionary to GTF format.
87
88
Parameters:
89
- infile (dict): Annotation dictionary to write
90
- output (str|bool): Output file path, or False for stdout
91
- source (str|bool): Override source field in output
92
93
Returns:
94
None
95
"""
96
```
97
98
### Validation and Translation
99
100
Validate gene models and generate protein translations with comprehensive error checking.
101
102
```python { .api }
103
def validate_models(annotation, fadict, logger=sys.stderr.write, table=1, gap_filter=False):
104
"""
105
Validate gene model structure and sequences.
106
107
Parameters:
108
- annotation (dict): Annotation dictionary to validate
109
- fadict (dict): Genome sequences dictionary
110
- logger (function): Logging function for error messages
111
- table (int): Genetic code table for validation
112
- gap_filter (bool): Filter out models with sequence gaps
113
114
Returns:
115
dict: Validated annotation dictionary
116
"""
117
118
def validate_and_translate_models(annotation, fadict, logger=sys.stderr.write, table=1):
119
"""
120
Validate gene models and generate protein translations.
121
122
Parameters:
123
- annotation (dict): Annotation dictionary to process
124
- fadict (dict): Genome sequences dictionary
125
- logger (function): Logging function for error messages
126
- table (int): Genetic code table for translation
127
128
Returns:
129
dict: Annotation dictionary with validated translations
130
"""
131
```
132
133
### Specialized Parsers
134
135
Internal parsers for handling different GFF3 and GTF formats from various annotation sources.
136
137
```python { .api }
138
def _gff_default_parser(gff, fasta, Genes):
139
"""
140
Default GFF3 parser implementation.
141
142
Parameters:
143
- gff (str): Path to GFF3 file
144
- fasta (str): Path to genome FASTA file
145
- Genes (dict): Annotation dictionary to populate
146
147
Returns:
148
dict: Updated annotation dictionary
149
"""
150
151
def _gff_miniprot_parser(gff, fasta, Genes):
152
"""
153
Miniprot-specific GFF3 parser for protein alignments.
154
155
Parameters:
156
- gff (str): Path to miniprot GFF3 file
157
- fasta (str): Path to genome FASTA file
158
- Genes (dict): Annotation dictionary to populate
159
160
Returns:
161
dict: Updated annotation dictionary
162
"""
163
164
def _gff_alignment_parser(gff, fasta, Genes):
165
"""
166
Alignment GFF3 parser for transcript/protein alignments.
167
168
Parameters:
169
- gff (str): Path to alignment GFF3 file
170
- fasta (str): Path to genome FASTA file
171
- Genes (dict): Annotation dictionary to populate
172
173
Returns:
174
dict: Updated annotation dictionary
175
"""
176
177
def _gff_ncbi_parser(gff, fasta, Genes):
178
"""
179
NCBI GFF3 parser for NCBI-formatted annotations.
180
181
Parameters:
182
- gff (str): Path to NCBI GFF3 file
183
- fasta (str): Path to genome FASTA file
184
- Genes (dict): Annotation dictionary to populate
185
186
Returns:
187
dict: Updated annotation dictionary
188
"""
189
190
def _gtf_default_parser(gtf, fasta, Genes, gtf_format="default"):
191
"""
192
Default GTF parser implementation.
193
194
Parameters:
195
- gtf (str): Path to GTF file
196
- fasta (str): Path to genome FASTA file
197
- Genes (dict): Annotation dictionary to populate
198
- gtf_format (str): GTF format variant
199
200
Returns:
201
dict: Updated annotation dictionary
202
"""
203
204
def _gtf_genemark_parser(gtf, fasta, Genes, gtf_format="genemark"):
205
"""
206
GeneMark GTF parser for GeneMark-specific format.
207
208
Parameters:
209
- gtf (str): Path to GeneMark GTF file
210
- fasta (str): Path to genome FASTA file
211
- Genes (dict): Annotation dictionary to populate
212
- gtf_format (str): GTF format variant
213
214
Returns:
215
dict: Updated annotation dictionary
216
"""
217
218
def _gtf_jgi_parser(gtf, fasta, Genes, gtf_format="jgi"):
219
"""
220
JGI GTF parser for JGI-specific format.
221
222
Parameters:
223
- gtf (str): Path to JGI GTF file
224
- fasta (str): Path to genome FASTA file
225
- Genes (dict): Annotation dictionary to populate
226
- gtf_format (str): GTF format variant
227
228
Returns:
229
dict: Updated annotation dictionary
230
"""
231
```
232
233
### GO Term Processing
234
235
Process and simplify Gene Ontology term lists for cleaner annotation output.
236
237
```python { .api }
238
def simplifyGO(inputList):
239
"""
240
Simplify Gene Ontology term list format.
241
242
Parameters:
243
- inputList (list): List of GO terms in various formats
244
245
Returns:
246
list: Simplified GO term list
247
"""
248
```
249
250
### Sequence Gap Handling
251
252
Handle start and end gaps in genomic sequences during parsing and validation.
253
254
```python { .api }
255
def start_end_gap(seq, coords):
256
"""
257
Handle start/end gaps in genomic sequences.
258
259
Parameters:
260
- seq (str): Genomic sequence
261
- coords (list): List of coordinate tuples
262
263
Returns:
264
tuple: Adjusted coordinates and gap information
265
"""
266
```
267
268
## Usage Examples
269
270
### Basic GFF3 Parsing
271
272
```python
273
from gfftk.gff import gff2dict, dict2gff3
274
275
# Parse GFF3 file to annotation dictionary
276
annotation = gff2dict("input.gff3", "genome.fasta")
277
278
# Access gene information
279
for gene_id, gene_data in annotation.items():
280
print(f"Gene: {gene_id}")
281
print(f"Location: {gene_data['location']}")
282
print(f"Strand: {gene_data['strand']}")
283
print(f"Products: {gene_data['product']}")
284
285
# Write back to GFF3 format
286
dict2gff3(annotation, output="output.gff3")
287
```
288
289
### GTF Processing
290
291
```python
292
from gfftk.gff import gtf2dict, dict2gtf
293
294
# Parse GTF file
295
annotation = gtf2dict("input.gtf", "genome.fasta", debug=True)
296
297
# Write to GTF format with custom source
298
dict2gtf(annotation, output="output.gtf", source="custom_pipeline")
299
```
300
301
### Validation and Translation
302
303
```python
304
from gfftk.gff import gff2dict, validate_and_translate_models
305
from gfftk.fasta import fasta2dict
306
307
# Load data
308
annotation = gff2dict("annotation.gff3", "genome.fasta")
309
genome = fasta2dict("genome.fasta")
310
311
# Validate and generate translations
312
validated = validate_and_translate_models(annotation, genome, table=1)
313
314
# Access protein translations
315
for gene_id, gene_data in validated.items():
316
for i, protein in enumerate(gene_data['protein']):
317
transcript_id = gene_data['ids'][i]
318
print(f"{transcript_id}: {protein}")
319
```
320
321
### Working with Different Sources
322
323
```python
324
from gfftk.gff import gff2dict
325
326
# Parse different annotation sources with debug output
327
augustus_annotation = gff2dict("augustus.gff3", "genome.fasta", debug=True)
328
ncbi_annotation = gff2dict("ncbi.gff3", "genome.fasta", debug=True)
329
miniprot_annotation = gff2dict("miniprot.gff3", "genome.fasta", debug=True)
330
331
# Combine annotations (example workflow)
332
combined = {}
333
combined.update(augustus_annotation)
334
combined.update(ncbi_annotation)
335
combined.update(miniprot_annotation)
336
```
337
338
## Types
339
340
```python { .api }
341
# Annotation dictionary structure (detailed in main index)
342
AnnotationDict = dict[str, GeneAnnotation]
343
344
# Parser function type
345
ParserFunction = callable[[str, str, dict], dict]
346
347
# Logger function type
348
LoggerFunction = callable[[str], None]
349
350
# Coordinate tuple format
351
CoordinateTuple = tuple[int, int]
352
353
# Feature coordinate list
354
FeatureCoordinates = list[CoordinateTuple]
355
356
# Gene Ontology term format
357
GOTerm = str # Format: "GO:0000000"
358
359
# Database cross-reference format
360
DbXref = str # Format: "database:identifier"
361
```