Third Generation Sequencing

Introduction

In this session, we will be exploring data generated by third-generation nanopore sequencing technology. Developed by Oxford Nanopore Technologies (ONT), these platforms differ from next-generation 'sequencing-by-synthesis' approaches by using an array of microscopic protein pores set into an electrically resistant membrane. Strands of DNA or RNA are guided through these nanopores. Each nanopore is connected to its own electrode and sensor chip, which measures the electric current flowing through the pore. As a molecule passes through, it disrupts the current to produce a characteristic signal known as a 'squiggle' which is stored in 'pod5 files'. This squiggle is decoded in real time using basecalling algorithms to determine the DNA or RNA sequence in fastq format. ONT’s most popular platform, the MinION, is capable of generating single reads longer than 4 megabases.

The MinION is one of five scalable platforms developed by ONT. High-throughput systems such as the GridION and PromethION use an array of nanopore flow cells to produce between 5 and 48 times more data than the MinION alone — outputting up to 48 TB of data in a single run. More compact solutions such as the Flongle and SmidgION use smaller, single flow cells to generate data. The MinION itself is highly portable, roughly the size of a large USB flash drive. This portability allows researchers to perform sequencing analyses almost anywhere, provided they have the necessary equipment for library preparation and data analysis.

A complete sequencing run on the MinION can generate upwards of 1 TB of raw data. Processing this volume of data requires substantial computing power, including multicore high-performance processors and large amounts of RAM. This presents a logistical challenge for researchers who wish to take full advantage of the platform’s portability. However, the integration of GPUs (graphics processing units) in recent years has significantly improved analysis workflows.

The long reads generated by nanopore sequencing are ideal for de novo genome assembly. They can span repetitive regions and structural variants, making it easier to reconstruct genomes with fewer gaps and higher continuity compared to short-read sequencing. Here, we will apply what we have learned about genome assembly and long-read sequencing to E. coli data.

Activity Briefing: Identifying drug-resistance from Escherichia coli WGS sequenced using Oxford Nanopore

Create new codespace nanopore!

In the data repository you have been provided with Escherichia coli whole genome data that have been sequenced using Oxford Nanopore (MinION). These samples were collected from a long-term health care facility in Japan by a previous study (ENA Project Accession: PRJDB9189). The WGS data will need to be basecalled, quality checked and analysed. Your task is to analyse the Nanopore fastq data to identify drug-resistance genes.

Step 1. Basecalling

We have the pod5 files already in the data directory that are output directly from the nanopore sequencer. Now it is time to basecall the data to get the fastq files. To run dorado correctly, we need a few things:

The pod5 files
The barcoding kit. Important to get right due to the signal interpretation from the model, and the specific barcoding sequences given to each model.
The basecalling model. The machine learning model to decode the nanopore sequencing data. You can learn more about the models here https://dorado-docs.readthedocs.io/en/latest/models/models/. But they essentially come in 3 different versions and correspond to the kits and chemistries used:
fast (the fastest and least accurate)
hac (high accuracy)
sup (super-accurate, the most accurate)
The sample sheet. Information about the samples you're running, must be in the correct format which you can see here https://dorado-docs.readthedocs.io/en/latest/barcoding/sample_sheet/

You will find all of these things in the data directory.

We will now basecall using dorado, take notice of the parameters we have used. ** Note you may wish to add --emit-fastq when running in the future.

conda activate nanopore
cd nanopore/data
python ./taxdump/dorado --output-dir dorado_out  --min-qscore 7 --kit-name SQK-NBD114-24 --sample-sheet sample_sheet.csv --trim all

Ok, great we have basecalled! However, the file has a generic name, let's move it back to our data directory before we do the quality checks.

cp ./dorado_out/exp40.fastq.gz Ecoli_Japan_1.fastq.gz

Step 2. Quality Control and Contamination

The output sequence data is not always ready to be put straight through a pipeline. We must first perform quality control, trimming and filtering for contamination. Nanopore library preparation results in the addition of a sequencing adapter at each end of the fragment. Both the template and complement strands to be sequenced carry the motor protein which means both strands are able to translocate the nanopore. For downstream analysis, it is important to remove these adapters. For this we will use chopper, although sometimes porechop can be used. These programs processes all of the reads in our basecalled fastq file, and removes these adapter sequences. Furthermore, the ligation library prep process can result in conjoined reads, meaning an adapter will be found in the middle of an extra-long read. Trimming tools will identify these, split them and remove the adapters. In addition, if you use a multiplexing kit to maximise sample throughput, this program will split the reads based on the molecular barcode added to each sample. Our dataset only has one sample, so this demultiplexing won't be necessary. Another method of quality control is to check our reads for sequence contamination from other 'off-target' organisms. This is important in order to understand how effective your DNA extraction, enrichment and sequencing was. And secondly, to prevent anomalous reads from being incorporated in to assemblies. Using our basecalled reads we will perform an analysis using Kraken. Kraken is a tool which sifts through each read in a .fastq file and crosschecks it against a database of microorganisms. The output is a taxonomic assignment of each read, enabling to identify if any contamination has occurred.

Let's first check the quality of the nanopore fastq data. To do this we use a 'sequencing_summary.txt' file generated using dorado. We check quality using PycoQC:

mkdir pycoqc
pycoQC -f sequencing_summary.txt -o pycoqc/Ecoli_Japan_1_PycoQC.html
firefox pycoqc/Ecoli_Japan_1_PycoQC.html

Open the files using firefox or view your files on the left-hand side and right click PycoQC.html and download to your own folder and open the html file in your web browser.

Question

What can you see from the output? How do you think the read lengths compare to Illumina? How does read quality vary over time?

Next we should trim our files and filter for high-quality reads. To trim the nanopore reads we have used Chopper (You could also used Porechop_abi, see Tips and Tricks) to identify any adapter sequences and remove them. This is an important step, especially when performing genome assembly.

gunzip -c Ecoli_Japan_1.fastq.gz | chopper -q 10 -l 500 --headcrop 50 --tailcrop 50 | gzip > Ecoli_Japan_1_trim.fastq.gz

We should also check to see if the sequence contains any contaminants. We have used Kraken2 to do this. The Kraken reports have been generated for you as they require large databases. The report generated by Kraken is a tab-delimited file which contains a list of all the reads in the input file, and the taxonomic assignment of each read. The first column is the read ID, the second is the taxonomic ID, the third is the length of the read, and the fourth is the lowest common ancestor (LCA) of the taxonomic assignment. The LCA is the lowest taxonomic rank that all the taxonomic assignments share. The report is useful for identifying the taxonomic assignments of reads, and for identifying any contamination in the dataset. We can visualise the results using recentrifuge (rcf), here there is a dependency clash in our conda environment, so let's quickly install using pip:

mkdir kraken
pip install recentrifuge
rcf -n ./taxdump/ -k Ecoli_Japan_1.koutput.txt -o kraken/Ecoli_Japan_1_kraken.html
firefox kraken/Ecoli_Japan_1_kraken.html

Open the files using firefox or view your files on the left-hand side and right click Ecoli_Japan_1_kraken.html and download to your own folder and open the html file in your web browser.

Question

What can you determine from the Kraken2 outputs? Is the data clean? There are a few other genera included in the output — do you think they might be contaminants?

Just in case we will filter our fastq files to remove any contaminants. We won't be too stringent with the filtering- we will use the taxa id from NCBI for Enterobacteriaceae (543). To do this we will use KrakenTools:

python ./taxdump/extract_kraken_reads.py --include-children --fastq-output -t 543 -k Ecoli_Japan_1.koutput.txt -r Ecoli_Japan_1.kreport.txt -s Ecoli_Japan_1_trim.fastq.gz -o ./kraken/Ecoli_Japan_1.kraken_filtered.fq

Step 3. Mapping and Variant Calling

The mapping and variant calling tools for Nanopore are different than those typically used with Illumina data. Now that we have verified a successful sequencing run, our data are ready to go, we will now map the reads on to a reference genome and perform variant calling. Nanopore fastq data can be mapped to an E.coli reference genome using Minimap2. Notice the '-ax map-ont' signalling we are using Nanopore data. Minimap2 creates a SAM file, a type of alignment file. We need to convert this to a BAM file to make the alignment compatible with other software (see tips and tricks). We do this using samtools, but you are not required to do this as a part of this practical.

minimap2 -ax map-ont Ecoli_reference.fasta ./kraken/Ecoli_Japan_1.kraken_filtered.fq > Ecoli_Japan_1_aln.sam

For variant calling we tend to use variant callers that have been designed specifically for Nanopore data. This includes Clair3 and Freebayes, which can generate outputs that are generally compatible with GATK-based workflows. These tools can be fairly slow so feel free to trial them in your own time!

Step 4. Assembly

While mapping aligns reads to a known reference, assembly builds the genome from scratch. Assembly is essential when no good reference exists or when studying structural variation and novel regions. We often assemble bacterial genomes because they are small, circular, and can be fully reconstructed. Assembly allows us to detect plasmids, structural variants, and novel genes that mapping alone might miss. Let's now perform genome assembly using our high-quality, trimmed and filtered nanopore fastq. There are a few different tools we can use, but for Nanopore data Flye performs well. This may take 2 minutes or so to run.

mkdir flye_output
python ./taxdump/flye --nano-raw ./kraken/Ecoli_Japan_1.kraken_filtered.fq --genome-size 4.6m --out-dir flye_output --threads 4

Step 5. Assess the Quality of your Assembly

Now the assembly is ready we will test its quality using BUSCO. BUSCO assesses the quality of genome assemblies by looking at the percentage of conserved genes. Open the short_summary.specific.enterobacteriaceae_odb12.assembly_QC.txt file.

Question

What do you think, is it a good assembly?

Step 6. Annotate your assembly

Ok, so the assembly could be better, but how about we annotate some genes? Genome annotation identifies genes and other important elements within a DNA sequence. It predicts where genes are located and what they might do. This helps scientists understand the biological function of the genome. Annotation is crucial for studying how organisms work and evolve. We will do this using Prokka. Prokka is a software tool used for rapid genome annotation of prokaryotic (bacterial and archaeal) genomes. It identifies genes, predicts their functions, and produces standard annotation files. Prokka streamlines the process by using multiple databases and tools in one easy step. It is widely used because it is fast, automated, and produces consistent results. There are lots of other tools for annotating Eukaryotes and beyond, however, this tool is well-developed for bacteria species.

prokka --outdir assembly_annotation --prefix Ecoli_Japan_1 flye_output/assembly.fasta

Inside the assembly_annotation directory you will see "Ecoli_Japan_1.gff". This is the most common format for annotations. If you like, you can load this into igv to visualise the annotation (Load Genome: Ecoli_Japan_1.fna; Load files:Ecoli_Japan_1.gff). We can analyse the annotations further downstream (See Step 6).

Step 7. Predicting Drug-resistance

Great! We have an assembly and we have annotated it. How are the assemblies useful for downstream analysis? For bacteria specifically, genome assemblies can be input into specialist software to mass screen the contigs (bits of the assembly) for antimicrobial resistance genes, virulence genes or plasmids. Abricate is a handy suite of tools that can help us do this by scanning databases of these genes. To identify resistance genes:

mkdir abricate_results
abricate --db resfinder --quiet flye_output/assembly.fasta > abricate_results/resistance_results.txt

To identify virulence genes:

abricate --db vfdb --quiet flye_output/assembly.fasta > abricate_results/virulence_results.txt

To identify plasmids:

abricate --db plasmidfinder --quiet flye_output/assembly.fasta > abricate_results/plasmid_results.txt

Question

Take a look at some of the resistance genes and plasmids- why might the blaCTX-M-27 gene be of concern?

Step 8. If you have time: Pan-genome Analysis

Please activate a new codespace pangenomes!

With growth in the size of datasets, there is a need to understand key processes such as selection and evolution taking place in bacteria populations. For example, bacteria can transfer genes to one another, otherwise known as 'horizontal gene transfer' which can spread virulence and resistance genes. One way to identify some of these differences is to look at core or accessory genes within a population: A Pangenome. Roary and Pirate can be used to compare our gff annotation files to one another and construct a pangenome. Roary constructs bacterial pangenomes by clustering genes extracted from annotated genomes provided in GFF3 format. It uses these annotations to identify core and accessory genes, helping researchers analyse genetic diversity, evolution, and functional differences across strains. Accurate GFF files are critical for reliable gene comparison and clustering. We will use a different conda environment and codespace. This may take a moment or two!

conda activate roary
cd pangenomics/all_gffs
cpanm -f Bio::Roary 
roary -e --mafft -p 2 *.gff

View your files, right click and download "accessory_binary_genes.fa.newick" and "gene_presence_absence.csv". Open your favourite web browser and load https://jameshadfield.github.io/phandango/#/ to view the roary output files. Drag and drop the "accessory_binary_genes.fa.newick" and "gene_presence_absence.csv" files into phandango. You should get something that looks like this:

Question

What do you think the blue blocks mean?

Can you do this analysis with Illumina data?

Yes! Anything from steps 4 to 8 can be done with Illumina data. You might just need to switch up your tool box. For bacteria genome assembly with short read data, there are some fantastic tools available such as shovill. The short read assemblies can be annotated using prokka and analysed using abricate as above. Here, we wanted to demonstrate the utility of nanopore data for real-time sequencing and help you gain some experience in the downstream analysis.

Tips and Tricks

You may wish in the future to trim your own reads and make Kraken2 reports by yourself or use porechop to trim reads. These are the commands we have used to generate the data for the practical. These commands take some time or require large databases which is why we have performed these steps in advance.

# Kraken2 
kraken2 Ecoli_Japan_1.fastq.gz --report Ecoli_Japan_1.kreport.txt --output Ecoli_Japan_1.koutput.txt
# Trimming
porechop_abi -abi -i Ecoli_Japan_1.fastq.gz -o Ecoli_Japan_1_trim.fastq.gz
# Flye
flye --nano-raw ./kraken/Ecoli_Japan_1.kraken_filtered.fq --genome-size 4.6m --out-dir flye_output --threads 4
# BUSCO
busco -i flye_output/assembly.fasta -l enterobacteriaceae_odb12 -c 4 -m genome -o assembly_QC
# Converting sam to bam
samtools sort -o Ecoli_Japan_1_aln.bam Ecoli_Japan_1_aln.sam