Phylo Task

Warning

Be sure to use the phylo_task Codespace when initiating VM for this practical. This task you will also need to wget the data wget -O - http://genomics.lshtm.ac.uk/phylo_task.tar.gz | tar -xvz

Introduction

Welcome to the Phylo analysis practical, where you will reconstruct evolutionary relationships between Mycobacterium tuberculosis isolates starting from raw sequencing data.

This workflow mirrors real outbreak genomics pipelines used to: - track transmission - identify clusters of infection - understand lineage structure - compare genomic relationships between isolates

You will move from raw reads to a final phylogenetic tree and visualise it using iTOL.

Goal

By the end of this task you will produce: - consensus genomes for each isolate - a core-genome alignment - a phylogenetic tree - an annotated iTOL visualisation

Input Data

You are given paired-end sequencing files :

*_1.fastq.gz
*_2.fastq.gz

Each pair represents a TB isolate.

You are also given a reference genome:

Mycobacterium tuberculosis reference (tb.fasta)

And given an annotation file to use in Itol at the end

itol.txt

These will need to be downloaded using

wget -O - http://genomics.lshtm.ac.uk/phylo_task.tar.gz | tar -xvz

Workflow Overview

You will complete the following stages:

Read alignment
Variant detection
Consensus genome generation
Multi-sample genome collection
Phylogenetic reconstruction
Tree visualisation and annotation

Stage 1 — Read Alignment

Each sample must be aligned to the reference genome.

What you should think about:

ensuring correct pairing of reads
generating a sorted alignment file per sample
indexing the alignment for downstream analysis

Output expected:

one BAM file per sample

Stage 2 — Variant Detection

From each alignment, identify differences relative to the reference genome.

What you should think about:

detecting SNPs and small variants
filtering low-confidence calls
producing a variant file per sample

Output expected:

VCF file per sample

Stage 3 — Consensus Genome Construction

Use variant information to reconstruct each isolate’s genome.

What you should think about:

Use bcftools on the vcfs to create a consensus with bcftools consensus
generating a full-length FASTA per sample
ensuring consistent naming across samples

Output expected:

one consensus FASTA per isolate

Stage 4 — Combine Genomes

Merge all consensus genomes into a single dataset. Here we will use Parsnp, you can read the manual here

What you should think about:

move the generated fastas into one dir
running parsnp on that one directory

Output expected:

multi-FASTA file containing all isolates

Stage 5 — iTOL Visualisation

Upload your tree to iTOL (Interactive Tree of Life). We have provided an annotation file and a script to generate the annotation file if interested.

What to look for:

clustering patterns
lineage structure
potential transmission clusters
outliers or unusual branches

Key Hints

Tips for Mtb Phylo Analysis

Keep sample names consistent across all files
Low-quality samples may break downstream clustering
Might need to remove some samples
The tree reflects genetic distance, not just lineage
Lineage labels are metadata overlays, not the result itself
Parsnp builds a core-genome alignment, not just SNPs
Interpretation is more important than execution