or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-commands.mdcomparison.mdconsensus.mdformat-conversion.mdgenbank-tbl.mdgff-processing.mdindex.mdsequence-operations.mdutilities.md

gff-processing.mddocs/

0

# GFF3 and GTF Processing

1

2

Comprehensive parsing and manipulation of GFF3 and GTF format files with support for multiple annotation sources, robust validation, and flexible output options. Handles complex gene models with alternative splicing and provides the foundation for all format conversion operations.

3

4

## Capabilities

5

6

### GFF3 Parsing

7

8

Parse GFF3 files into the central annotation dictionary format with support for multiple annotation sources and validation.

9

10

```python { .api }

11

def gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format="auto", logger=sys.stderr.write):

12

"""

13

Parse GFF3 file to annotation dictionary.

14

15

Parameters:

16

- gff (str): Path to input GFF3 file

17

- fasta (str): Path to genome FASTA file for sequence validation

18

- annotation (dict|bool): Pre-existing annotation dictionary to extend, or False

19

- table (int): Genetic code table for translation (1 or 11)

20

- debug (bool): Enable debug output for parsing errors

21

- gap_filter (bool): Filter out models with sequence gaps

22

- gff_format (str): GFF format variant ("auto", "default", "miniprot", etc.)

23

- logger (function): Logging function for error messages

24

25

Returns:

26

dict: Annotation dictionary with gene_id as keys and gene data as values

27

"""

28

29

def dict2gff3(infile, output=False, debug=False, source=False, newline=False):

30

"""

31

Write annotation dictionary to GFF3 format.

32

33

Parameters:

34

- infile (dict): Annotation dictionary to write

35

- output (str|bool): Output file path, or False for stdout

36

- debug (bool): Enable debug output

37

- source (str|bool): Override source field in output

38

- newline (bool): Add newlines between gene records

39

40

Returns:

41

None

42

"""

43

44

def dict2gff3alignments(infile, output=False, debug=False, alignments=False, source=False, newline=False):

45

"""

46

Write annotation dictionary to GFF3 alignments format for EVM evidence.

47

48

Parameters:

49

- infile (dict): Annotation dictionary to write

50

- output (str|bool): Output file path, or False for stdout

51

- debug (bool): Enable debug output

52

- alignments (dict|bool): Alignment data structure for evidence formatting

53

- source (str|bool): Override source field in output

54

- newline (bool): Add newlines between records

55

56

Returns:

57

None

58

"""

59

```

60

61

### GTF Parsing

62

63

Parse GTF files with support for different GTF formats and dialects from various annotation sources.

64

65

```python { .api }

66

def gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format="auto", logger=sys.stderr.write):

67

"""

68

Parse GTF file to annotation dictionary.

69

70

Parameters:

71

- gtf (str): Path to input GTF file

72

- fasta (str): Path to genome FASTA file for sequence validation

73

- annotation (dict|bool): Pre-existing annotation dictionary to extend, or False

74

- table (int): Genetic code table for translation (1 or 11)

75

- debug (bool): Enable debug output for parsing errors

76

- gap_filter (bool): Filter out models with sequence gaps

77

- gtf_format (str): GTF format variant ("auto", "default", "genemark", "jgi")

78

- logger (function): Logging function for error messages

79

80

Returns:

81

dict: Annotation dictionary with gene_id as keys and gene data as values

82

"""

83

84

def dict2gtf(infile, output=False, source=False):

85

"""

86

Write annotation dictionary to GTF format.

87

88

Parameters:

89

- infile (dict): Annotation dictionary to write

90

- output (str|bool): Output file path, or False for stdout

91

- source (str|bool): Override source field in output

92

93

Returns:

94

None

95

"""

96

```

97

98

### Validation and Translation

99

100

Validate gene models and generate protein translations with comprehensive error checking.

101

102

```python { .api }

103

def validate_models(annotation, fadict, logger=sys.stderr.write, table=1, gap_filter=False):

104

"""

105

Validate gene model structure and sequences.

106

107

Parameters:

108

- annotation (dict): Annotation dictionary to validate

109

- fadict (dict): Genome sequences dictionary

110

- logger (function): Logging function for error messages

111

- table (int): Genetic code table for validation

112

- gap_filter (bool): Filter out models with sequence gaps

113

114

Returns:

115

dict: Validated annotation dictionary

116

"""

117

118

def validate_and_translate_models(annotation, fadict, logger=sys.stderr.write, table=1):

119

"""

120

Validate gene models and generate protein translations.

121

122

Parameters:

123

- annotation (dict): Annotation dictionary to process

124

- fadict (dict): Genome sequences dictionary

125

- logger (function): Logging function for error messages

126

- table (int): Genetic code table for translation

127

128

Returns:

129

dict: Annotation dictionary with validated translations

130

"""

131

```

132

133

### Specialized Parsers

134

135

Internal parsers for handling different GFF3 and GTF formats from various annotation sources.

136

137

```python { .api }

138

def _gff_default_parser(gff, fasta, Genes):

139

"""

140

Default GFF3 parser implementation.

141

142

Parameters:

143

- gff (str): Path to GFF3 file

144

- fasta (str): Path to genome FASTA file

145

- Genes (dict): Annotation dictionary to populate

146

147

Returns:

148

dict: Updated annotation dictionary

149

"""

150

151

def _gff_miniprot_parser(gff, fasta, Genes):

152

"""

153

Miniprot-specific GFF3 parser for protein alignments.

154

155

Parameters:

156

- gff (str): Path to miniprot GFF3 file

157

- fasta (str): Path to genome FASTA file

158

- Genes (dict): Annotation dictionary to populate

159

160

Returns:

161

dict: Updated annotation dictionary

162

"""

163

164

def _gff_alignment_parser(gff, fasta, Genes):

165

"""

166

Alignment GFF3 parser for transcript/protein alignments.

167

168

Parameters:

169

- gff (str): Path to alignment GFF3 file

170

- fasta (str): Path to genome FASTA file

171

- Genes (dict): Annotation dictionary to populate

172

173

Returns:

174

dict: Updated annotation dictionary

175

"""

176

177

def _gff_ncbi_parser(gff, fasta, Genes):

178

"""

179

NCBI GFF3 parser for NCBI-formatted annotations.

180

181

Parameters:

182

- gff (str): Path to NCBI GFF3 file

183

- fasta (str): Path to genome FASTA file

184

- Genes (dict): Annotation dictionary to populate

185

186

Returns:

187

dict: Updated annotation dictionary

188

"""

189

190

def _gtf_default_parser(gtf, fasta, Genes, gtf_format="default"):

191

"""

192

Default GTF parser implementation.

193

194

Parameters:

195

- gtf (str): Path to GTF file

196

- fasta (str): Path to genome FASTA file

197

- Genes (dict): Annotation dictionary to populate

198

- gtf_format (str): GTF format variant

199

200

Returns:

201

dict: Updated annotation dictionary

202

"""

203

204

def _gtf_genemark_parser(gtf, fasta, Genes, gtf_format="genemark"):

205

"""

206

GeneMark GTF parser for GeneMark-specific format.

207

208

Parameters:

209

- gtf (str): Path to GeneMark GTF file

210

- fasta (str): Path to genome FASTA file

211

- Genes (dict): Annotation dictionary to populate

212

- gtf_format (str): GTF format variant

213

214

Returns:

215

dict: Updated annotation dictionary

216

"""

217

218

def _gtf_jgi_parser(gtf, fasta, Genes, gtf_format="jgi"):

219

"""

220

JGI GTF parser for JGI-specific format.

221

222

Parameters:

223

- gtf (str): Path to JGI GTF file

224

- fasta (str): Path to genome FASTA file

225

- Genes (dict): Annotation dictionary to populate

226

- gtf_format (str): GTF format variant

227

228

Returns:

229

dict: Updated annotation dictionary

230

"""

231

```

232

233

### GO Term Processing

234

235

Process and simplify Gene Ontology term lists for cleaner annotation output.

236

237

```python { .api }

238

def simplifyGO(inputList):

239

"""

240

Simplify Gene Ontology term list format.

241

242

Parameters:

243

- inputList (list): List of GO terms in various formats

244

245

Returns:

246

list: Simplified GO term list

247

"""

248

```

249

250

### Sequence Gap Handling

251

252

Handle start and end gaps in genomic sequences during parsing and validation.

253

254

```python { .api }

255

def start_end_gap(seq, coords):

256

"""

257

Handle start/end gaps in genomic sequences.

258

259

Parameters:

260

- seq (str): Genomic sequence

261

- coords (list): List of coordinate tuples

262

263

Returns:

264

tuple: Adjusted coordinates and gap information

265

"""

266

```

267

268

## Usage Examples

269

270

### Basic GFF3 Parsing

271

272

```python

273

from gfftk.gff import gff2dict, dict2gff3

274

275

# Parse GFF3 file to annotation dictionary

276

annotation = gff2dict("input.gff3", "genome.fasta")

277

278

# Access gene information

279

for gene_id, gene_data in annotation.items():

280

print(f"Gene: {gene_id}")

281

print(f"Location: {gene_data['location']}")

282

print(f"Strand: {gene_data['strand']}")

283

print(f"Products: {gene_data['product']}")

284

285

# Write back to GFF3 format

286

dict2gff3(annotation, output="output.gff3")

287

```

288

289

### GTF Processing

290

291

```python

292

from gfftk.gff import gtf2dict, dict2gtf

293

294

# Parse GTF file

295

annotation = gtf2dict("input.gtf", "genome.fasta", debug=True)

296

297

# Write to GTF format with custom source

298

dict2gtf(annotation, output="output.gtf", source="custom_pipeline")

299

```

300

301

### Validation and Translation

302

303

```python

304

from gfftk.gff import gff2dict, validate_and_translate_models

305

from gfftk.fasta import fasta2dict

306

307

# Load data

308

annotation = gff2dict("annotation.gff3", "genome.fasta")

309

genome = fasta2dict("genome.fasta")

310

311

# Validate and generate translations

312

validated = validate_and_translate_models(annotation, genome, table=1)

313

314

# Access protein translations

315

for gene_id, gene_data in validated.items():

316

for i, protein in enumerate(gene_data['protein']):

317

transcript_id = gene_data['ids'][i]

318

print(f"{transcript_id}: {protein}")

319

```

320

321

### Working with Different Sources

322

323

```python

324

from gfftk.gff import gff2dict

325

326

# Parse different annotation sources with debug output

327

augustus_annotation = gff2dict("augustus.gff3", "genome.fasta", debug=True)

328

ncbi_annotation = gff2dict("ncbi.gff3", "genome.fasta", debug=True)

329

miniprot_annotation = gff2dict("miniprot.gff3", "genome.fasta", debug=True)

330

331

# Combine annotations (example workflow)

332

combined = {}

333

combined.update(augustus_annotation)

334

combined.update(ncbi_annotation)

335

combined.update(miniprot_annotation)

336

```

337

338

## Types

339

340

```python { .api }

341

# Annotation dictionary structure (detailed in main index)

342

AnnotationDict = dict[str, GeneAnnotation]

343

344

# Parser function type

345

ParserFunction = callable[[str, str, dict], dict]

346

347

# Logger function type

348

LoggerFunction = callable[[str], None]

349

350

# Coordinate tuple format

351

CoordinateTuple = tuple[int, int]

352

353

# Feature coordinate list

354

FeatureCoordinates = list[CoordinateTuple]

355

356

# Gene Ontology term format

357

GOTerm = str # Format: "GO:0000000"

358

359

# Database cross-reference format

360

DbXref = str # Format: "database:identifier"

361

```