tessl install github:brunoasm/my_claude_skills --skill busco-phylogenyGenerate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation
Review Score
80%
Validation Score
12/16
Implementation Score
85%
Activation Score
68%
This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.
This skill helps users generate phylogenies from genome assemblies by:
The skill provides access to these bundled resources:
scripts/)query_ncbi_assemblies.py - Query NCBI for available genome assemblies by taxon name (new!)download_ncbi_genomes.py - Download genomes from NCBI using BioProjects or Assembly accessionsrename_genomes.py - Rename genome files with meaningful sample names (important!)generate_qc_report.sh - Generate quality control reports from compleasm resultsextract_orthologs.sh - Extract and reorganize single-copy orthologsrun_aliscore.sh - Wrapper for Aliscore to identify randomly similar sequences (RSS)run_alicut.sh - Wrapper for ALICUT to remove RSS positions from alignmentsrun_aliscore_alicut_batch.sh - Batch process all alignments through Aliscore + ALICUTconvert_fasconcat_to_partition.py - Convert FASconCAT output to IQ-TREE partition formatpredownloaded_aliscore_alicut/ - Pre-tested Aliscore and ALICUT Perl scriptstemplates/)slurm/ - SLURM job scheduler templatespbs/ - PBS/Torque job scheduler templateslocal/ - Local machine templates (with GNU parallel)README.md - Complete template documentationreferences/)REFERENCE.md - Detailed technical reference including:
The complete phylogenomics pipeline follows this sequence:
Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference
When a user requests phylogeny generation, gather the following information systematically:
Before asking questions, attempt to detect the local computing environment:
# Check for job schedulers
command -v sbatch >/dev/null 2>&1 # SLURM
command -v qsub >/dev/null 2>&1 # PBS/Torque
command -v parallel >/dev/null 2>&1 # GNU parallelReport findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"
Ask these questions to gather essential workflow parameters:
Computing Environment
Input Data
query_ncbi_assemblies.py (see "STEP 0A: Query NCBI for Assemblies" below)Taxonomic Scope & Dataset Details
references/REFERENCE.md for complete lineage listEnvironment Management
Resource Constraints
references/REFERENCE.md for resource recommendationsParallelization Strategy
Ask the user how they want to handle parallel processing:
For job schedulers (SLURM/PBS):
For local machines:
parallel installed)For all systems:
Scheduler-Specific Configuration (if using SLURM or PBS)
logs/)Alignment Trimming Preference
Substitution Model Selection (for IQ-TREE phylogenetic inference)
Context needed: Taxonomic breadth, number of taxa, evolutionary rates
Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.
Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
Educational Goals
Organize analyses with dedicated folders for each pipeline step:
project_name/
├── logs/ # All log files
├── 00_genomes/ # Input genome assemblies
├── 01_busco_results/ # BUSCO/compleasm outputs
├── 02_qc/ # Quality control reports
├── 03_extracted_orthologs/ # Extracted single-copy orthologs
├── 04_alignments/ # Multiple sequence alignments
├── 05_trimmed/ # Trimmed alignments
├── 06_concatenation/ # Supermatrix and partition files
├── 07_partition_search/ # Partition model selection
├── 08_concatenated_tree/ # Concatenated ML tree
├── 09_gene_trees/ # Individual gene trees
├── 10_species_tree/ # ASTRAL species tree
└── scripts/ # All analysis scriptsBenefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.
This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.
When generating scripts for users:
Read the appropriate template for their computing environment:
Read("templates/slurm/02_compleasm_first.job")Replace placeholders with user-specific values:
TOTAL_THREADS → e.g., 64THREADS_PER_JOB → e.g., 16NUM_GENOMES → e.g., 20NUM_LOCI → e.g., 2795LINEAGE → e.g., insecta_odb10MODEL_SET → e.g., LG,WAG,JTT,Q.pfamPresent the customized script to the user with setup instructions
Key templates by workflow step:
references/REFERENCE.md02_compleasm_first, 02_compleasm_parallel08a_partition_search08c_gene_trees_array, 08c_gene_trees_parallel, 08c_gene_trees_serialSee templates/README.md for complete template documentation.
When asked about substitution model selection (Question 9), use this systematic approach:
Use WebFetch to retrieve current model information:
WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
prompt="Extract all amino acid substitution models with descriptions and usage guidelines")Consider these factors from user responses:
Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see references/REFERENCE.md section "Substitution Model Recommendation".
General recommendations:
Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.
Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.
Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.
ALWAYS start by generating a setup script for the user's environment.
Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:
Key points:
conda activate phylo (the unified environment)See references/REFERENCE.md for the complete setup script template.
Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.
This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.
Offer this step when:
Ask for focal taxon: Request the taxonomic group of interest
Query NCBI using the script: Use scripts/query_ncbi_assemblies.py to search for assemblies
# Basic query (returns 20 results by default)
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera"
# Query with more results
python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --max-results 50
# Query for RefSeq assemblies only (higher quality, GCF_* accessions)
python scripts/query_ncbi_assemblies.py --taxon "Apis" --refseq-only
# Save accessions to file for later download
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --save assembly_accessions.txtPresent results to user: The script displays:
Help user select assemblies: Ask user which assemblies they want to include
Collect selected accessions: Compile the list of chosen assembly accessions
Proceed to STEP 1: Use the selected accessions with download_ncbi_genomes.py
If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:
For BioProjects:
python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
unzip genomes.zipFor Assembly Accessions:
python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
unzip genomes.zipIMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.
Generate a script that:
See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.
Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.
Key considerations:
Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.
Generate scripts using templates:
02_compleasm_first.job and 02_compleasm_parallel.job02_compleasm_first.job and 02_compleasm_parallel.job02_compleasm_first.sh and 02_compleasm_parallel.shReplace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE
For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".
After compleasm completes, generate QC report using scripts/generate_qc_report.sh:
bash scripts/generate_qc_report.sh qc_report.csvProvide interpretation:
See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.
Use scripts/extract_orthologs.sh to extract single-copy orthologs:
bash scripts/extract_orthologs.sh LINEAGE_NAMEThis generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.
Activate the unified environment (conda activate phylo) which contains MAFFT.
Create locus list, then generate alignment scripts:
cd single_copy_orthologs/unaligned_aa
ls *.fas > locus_names.txt
num_loci=$(wc -l < locus_names.txt)Generate scheduler-specific scripts:
For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".
Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.
Options:
-automated1), recommended for large datasets-t AA)For Aliscore/ALICUT:
scripts/run_aliscore_alicut_batch.sh for batch processingscripts/run_aliscore.sh and scripts/run_alicut.sh-N flag for amino acid sequencesGenerate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).
For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".
Download FASconCAT-G (Perl script) and run concatenation:
conda activate phylo # Has Perl installed
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
chmod +x FASconCAT-G.pl
cd trimmed_aa
perl ../FASconCAT-G.pl -s -iConvert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:
python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txtOutputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt
IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.
Use the substitution models selected during initial setup (Question 9).
Generate script using templates:
templates/[slurm|pbs|local]/08a_partition_search.[job|sh]MODEL_SET placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".
Run IQ-TREE using the best partition scheme from Part 8A:
iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
-nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnniOutput: concatenated_ML_tree.treefile
Estimate gene trees for coalescent-based species tree inference.
Generate scripts using templates:
08c_gene_trees_array.job template08c_gene_trees_parallel.sh or 08c_gene_trees_serial.sh templateNUM_LOCI placeholderFor detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".
ASTRAL is already installed in the unified conda environment.
conda activate phylo
# Concatenate all gene trees
cat trimmed_aa/*.treefile > all_gene_trees.tre
# Run ASTRAL
astral -i all_gene_trees.tre -o astral_species_tree.treOutput: astral_species_tree.tre
ALWAYS generate a methods paragraph to help users write their publication methods section.
Create METHODS_PARAGRAPH.md file with:
For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".
Pre-fill known values when possible:
Provide users with a summary of outputs:
Phylogenetic Results:
concatenated_ML_tree.treefile - ML tree from concatenated supermatrixastral_species_tree.tre - Coalescent species tree*.treefile - Individual gene treesData and Quality Control:
4. qc_report.csv - Genome quality statistics
5. FcC_supermatrix.fas - Concatenated alignment
6. partition_search.best_scheme.nex - Selected partitioning scheme
Publication Materials:
7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations
Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)
ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.
For each generated script, perform these validation checks in order:
Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.
Procedure:
compleasm run, iqtree -s, mafft --auto)templates/ directoryreferences/REFERENCE.mdcompleasm run - Check -a, -o, -l, -t optionsiqtree - Verify -s, -p, -m, -bb, -alrt, -nt, -safe optionsmafft - Check --auto, --thread, --reorder optionsastral - Verify -i, -o optionstrimal, clipkit, BMGE.jar) - Validate optionsAction on issues:
Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.
Procedure:
Map input/output relationships:
01_busco_results/*_compleasm/) → Step 3 input (QC script)single_copy_orthologs/) → Step 5 input (MAFFT)04_alignments/*.fas) → Step 6 input (trimming)05_trimmed/*.fas) → Step 7 input (FASconCAT-G)FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE)*.treefile) → Step 8D input (ASTRAL)Check for consistency:
Action on issues:
Purpose: Ensure allocated computational resources are appropriate for the task.
Procedure:
Verify resource allocations against recommendations in references/REFERENCE.md:
Common issues to check:
-nt should match allocated CPUsValidate against user-specified constraints:
Action on issues:
After completing all validation checks:
If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."
If issues found: Present a structured report:
**Validation Results**
⚠️ Issues found during validation:
1. [Issue category]: [Description]
- Current: [What was generated]
- Suggested: [Recommended fix]
- Reason: [Why this is an issue]
Would you like me to apply these corrections?Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.
Document corrections: If corrections are applied, explain what was changed and why.
conda activate phyloreferences/REFERENCE.md for detailsscripts/ directorytemplates/ directory for major scriptsRead("templates/slurm/02_compleasm_first.job")TOTAL_THREADS, LINEAGE, NUM_GENOMES, MODEL_SET, etc.templates/README.md for complete listreferences/REFERENCE.md for thread allocation recommendationsThis skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).
When a user requests phylogeny generation:
references/REFERENCE.MDquery_ncbi_assemblies.pyreferences/REFERENCE.md for detailed implementationconda activate phylo)references/REFERENCE.md