README.md 12.7 KB
Newer Older
ttmgr's avatar
ttmgr committed
1
2
3
# Workflow Metagenomics
# Tim Reska 

ttreska57's avatar
ttreska57 committed
4
5
6
7




ttreska57's avatar
ttreska57 committed
8
![Image](https://ascgitlab.helmholtz-muenchen.de/ttreska57/PhD_Notes/-/blob/main/Copy%20of%20computational_workflow_overviewv0.2.png)
ttmgr's avatar
ttmgr committed
9

ttmgr's avatar
ttmgr committed
10
# Applied software and the corresponding console commands
ttmgr's avatar
ttmgr committed
11
12
13
## basecalling 
-guppy basecaller

ttmgr's avatar
ttmgr committed
14
15
The first step for each sequencing run is to basecall your data, i.e., produce fastq sequencing files from the initial fast5 files. We will use the standard basecaller is guppy developed by Oxford Nanopore.

ttmgr's avatar
ttmgr committed
16
```shell
ttreska57's avatar
ttreska57 committed
17
$ guppy_basecaller -i /input_directory_fast5 -s /guppy_output -c <configuration_filge.cfg> [options]
ttmgr's avatar
ttmgr committed
18
```
ttmgr's avatar
ttmgr committed
19
The command above will call guppy on the input fast5 directory option (-i), write the output to the directory given with option -s in fastq format (option -o). The configuration file comprises the used nanopore device and the corresponding kit
ttmgr's avatar
ttmgr committed
20
21


ttreska57's avatar
ttreska57 committed
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## Data Analysis and Quality Control
-Nanoplot

After the basecalling step we want an overview of the data: what is the number of reads, average length, quality etc.

```shell
$ NanoPlot --fastq /guppy_output -o nanoplot/ --title Passed_reads --loglength
```

The options --fastq and -o set the input and output for the command. We also want the length of the sequences shown log-transformed (--loglength) and also set a title for the plots (--title). The above command will create several plots and summary files including a *.html file that can be opened in your browser.

-fastqc 


ttmgr's avatar
ttmgr committed
36
37
38
## Adapter Removal
-Porechop

ttmgr's avatar
ttmgr committed
39
40
Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads

ttmgr's avatar
ttmgr committed
41
42
43
44
45
```shell
$ porechop -i <fastq_output_guppy.fastq> -o <porechopped.fastq> 
```
The above command will use the default values of porechop to search for adapters in the input fastq file, trim the reads and write them to file porechopped.fastq

ttmgr's avatar
ttmgr committed
46
47
48
## Filtering
-NanoFilt

ttmgr's avatar
ttmgr committed
49
50
NanoFilt is a python script that can be used to filter reads based on length and quality as well as trimming parts of the sequence

ttmgr's avatar
ttmgr committed
51
```shell
ttmgr's avatar
ttmgr committed
52
$ NanoFilt -l 200 -q 7 < <porechopped.fastq> > <nanofiltered.fastq>
ttmgr's avatar
ttmgr committed
53
```
ttmgr's avatar
ttmgr committed
54
55
56
NanoFilt does not provide options for input or output files. Therefore we will use the two operators > and <

With the option -l 200 we remove all sequences shorter than 200 nucleotides and with -q we filter all reads with a phred quality score lower than 7
ttmgr's avatar
ttmgr committed
57

ttmgr's avatar
ttmgr committed
58
59
60
## Assembly
-Flye

ttmgr's avatar
ttmgr committed
61
62
Flye is a de novo assembler for long-reads. Flye also produces a polished consensus sequence for the assembly which significantly reduces the error rate.

ttmgr's avatar
ttmgr committed
63
```shell
ttreska57's avatar
ttreska57 committed
64
$ flye --meta --nano-hq <nanofiltered.fastq> -o /output_dir --genome-size 1m -i 1
ttmgr's avatar
ttmgr committed
65
```
ttreska57's avatar
ttreska57 committed
66
The above command uses the high-quality polished nanopore reads and assembles the genome. The estimated --genome-size is an obligatory input. The --meta option is for metagenomic approaches. The -i option is for polishing iterations.
ttmgr's avatar
ttmgr committed
67

ttmgr's avatar
ttmgr committed
68
69
## Polishing
-minimap2
ttmgr's avatar
ttmgr committed
70

ttmgr's avatar
ttmgr committed
71
72
Minimap2 is a versatile sequence alignment program that aligns DNA against a reference or draft genome. 

ttmgr's avatar
ttmgr committed
73
```shell
ttreska57's avatar
ttreska57 committed
74
$ minimap2 -ax map-ont -t 8 <flye_assembly.fa> <nanofiltered.fastq> > <aligned_reads.sam>
ttmgr's avatar
ttmgr committed
75
```
ttmgr's avatar
ttmgr committed
76
77
78

The above command uses the raw nanopore reads and maps them to the draft genome generated by flye. The option -ax map-ont sets the default parameters for nanopore reads.

ttreska57's avatar
ttreska57 committed
79
Minimap2 is also used to find overlaps between sequences.
ttmgr's avatar
ttmgr committed
80

ttreska57's avatar
ttreska57 committed
81
```shell
ttreska57's avatar
ttreska57 committed
82
$ minimap2 -ax map-ont -t 8 <raw_ont_reads.fa> <raw_ont_reads.fasta> > <overlaps.paf>
ttreska57's avatar
ttreska57 committed
83
```
ttmgr's avatar
ttmgr committed
84
-Racon
ttmgr's avatar
ttmgr committed
85
86
87
88

Consensus assemblies try to reduce error rates by choosing the most likely sequence of a given assembly and a set of raw reads. Although this does not incorporate the raw signal information of the flow cell to correct individual reads it can significantly improve the quality of an assembly. To improve a draft assembly with racon map the reads that should be used for error correction against the assembly. Use minimap to map the trimmed reads from prac3/nanofilt against the miniasm assembly and subsequently use the filtered reads and the mapping to build the consensus assembly

```shell
ttreska57's avatar
ttreska57 committed
89
$ racon racon [options ...] <correctedreads.fasta> <overlaps.sam|paf> <flyeassembly.fasta>
ttmgr's avatar
ttmgr committed
90
```
ttreska57's avatar
ttreska57 committed
91
-Medaka
ttmgr's avatar
ttmgr committed
92

ttreska57's avatar
ttreska57 committed
93
Medaka is a tool to create consensus sequences and variant calls from nanopore sequencing data. This task is performed using neural networks applied a pileup of individual sequencing reads against a draft assembly
ttmgr's avatar
ttmgr committed
94

ttreska57's avatar
ttreska57 committed
95
96
97
```shell
$ medaka_consensus –d flye_assembly.fasta -i filtered.fastq \ -o medaka_output –m R941_min_high_g303 –b 50
```
ttmgr's avatar
ttmgr committed
98

ttreska57's avatar
ttreska57 committed
99
The above command runs medaka on the flye assembly using the filtered reads. The –m model specifies the basecalling model used by guppy to produce the reads. Medaka will write its results to the folder medaka_output.
ttmgr's avatar
ttmgr committed
100

ttreska57's avatar
ttreska57 committed
101
102
103
104
105
106
107
108
## Sorting
-Samtools

```shell
$ samtools view -bS <aligned_reads.sam> > <aligned_reads.bam> [options]
$ samtools sort <aligned_sample.bam> -o <sample_reads_sorted.bam> [options]
```

ttmgr's avatar
ttmgr committed
109
110
## Binning
-MetaBat2
ttreska57's avatar
ttreska57 committed
111
112
113
114
115
116
117
118

MetaBAT2 takes a metagenome assembly and the reads that produced the assembly and organizes the contigs into putative genomes, called "bins".

```shell
$ runMetaBat.sh <options> <flye_assembly.fasta> <sample_reads_sorted.bam>
```
The above command runs medaka and the following files are the Output Object:The BinnedContig Object represents the directory of binned contigs created by MetaBAT2. This object can be used for downstream analysis. Output Bin Summary Report:The number of bins produced, the number of contigs that were binned and the total number of contigs in the assembly. Downloadable files: The enitre output of the MetaBAT2 run may be downloaded as a zip file. This zip file also contains a table of read-depth coverage per contig ("*.depth.txt")

ttmgr's avatar
ttmgr committed
119
-MaxBin2
ttreska57's avatar
ttreska57 committed
120
121
122
123
124
125
126
127
128
129

Let’s run a MaxBin binning on the flye assembly. First, we need to generate an abundance file from the mappes reads:

```shell
pileup.sh in=megahit.sam out=cov.txt | awk '{print $1"\t"$5}' cov.txt | grep -v '^#' > abundance.txt
qsub -cwd -pe multislot 24 -N maxbin -l mtc=1 -b y \ run_MaxBin.pl -thread 24 -contig final.contigs.fa -out maxbin -abund abundance.tx
```

MaxBin2 clusters metagenomic contigs (assembled contiguous genome fragments) into different "bins", each of which corresponds to a putative genome. It uses nucleotide composition information, source strain abundance (measured by depth-of-coverage by aligning the reads to the contigs), and phylogenetic marker genes to perform binning through an Expectation-Maximization (EM) algorithm.

ttmgr's avatar
ttmgr committed
130
-Vamb
ttreska57's avatar
ttreska57 committed
131
132
133
134
First we have to combine all our contigs to one fasta-list

```shell
concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta
ttreska57's avatar
ttreska57 committed
135
136
137
/path/to/assemblies/sample2/contigs.fasta
```
Second we use our aligned and sorted reads together with the fasta catalogue to generate the corresponding bins.
ttreska57's avatar
ttreska57 committed
138
139
140
```shell
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
```
ttmgr's avatar
ttmgr committed
141

ttreska57's avatar
ttreska57 committed
142
143
Vamb is a metagenomic binner which feeds sequence composition information from a contig catalogue and co-abundance information from BAM files into a variational autoencoder and clusters the latent representation. It performs excellently with multiple samples, and pretty good on single-sample data. Vamb is implemented purely in Python (with a little bit of Cython) and can be used both from command line and from within a Python interpreter.

ttmgr's avatar
ttmgr committed
144
145
146
## Bin Refining
-DAS Tool

ttreska57's avatar
ttreska57 committed
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
General command: 

```shell
DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>
```

Example command:

```shell
DAS_Tool  -i sample_data/sample.vamb_contigs2bin.tsv,\
sample_data/sample.maxbin2_contigs2bin.tsv,\
sample_data/sample.metabat_contigs2bin.tsv,\
-l vamb,maxbin,metabat \
-c sample_data/sample.contigs.fa \
-o sample_output/DASToolRun
```


DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly


ttmgr's avatar
ttmgr committed
168
169
170
## Bin coverage
-CoverM

ttreska57's avatar
ttreska57 committed
171
172
173
174
175
176
177
178
CoverM calculates the coverage of a set of reads on a set of contigs

```shell
coverm contig --coupled read1.fastq.gz read2.fastq.gz --reference assembly.fna
```

CoverM calculates coverage of genomes/MAGs coverm genome (help) or individual contigs coverm contig (help). Calculating coverage by read mapping, its input can either be BAM files sorted by reference, or raw reads and reference genomes in various formats

ttmgr's avatar
ttmgr committed
179
180
181
## Bin QC
-CheckM

ttreska57's avatar
ttreska57 committed
182
183
184
185
186
187
188
189
190
191
192
193
194
CheckM works on a directory of genome bins in FASTA format. By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna. You can specify a different extension with the –x flag. CheckM calls genes internally using prodigal, taking care to identify genes with recoded stop codons

```shell
checkm lineage_wf -t 8 -x fa /home/bins /home/checkm
```
Or, to process files of called genes in amino acid space which have the extension faa, use:

```shell
checkm lineage_wf --genes -t 8 -x faa <bin folder> <output folder>
```

CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. CheckM also provides tools for identifying genome bins that are likely candidates for merging based on marker set compatibility, similarity in genomic characteristics, and proximity within a reference genome tree.

ttmgr's avatar
ttmgr committed
195
196
197
## Bin Classification
-GDTB-Tk

ttreska57's avatar
ttreska57 committed
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy 


Step 1: Gene calling (identify)
```shell
gtdbtk identify --genome_dir /gtdbtk/genomes --out_dir /gtdbtk/identify --extension gz 
```

Step 2: Aligning genomes: The align step will align all identified markers, determine the most likely domain and output a concatenated MSA
```shell
gtdbtk align --identify_dir /gtdbtk/identify --out_dir /gtdbtk/align
``` 

Step 3: Classifying genomes (classify)
The classify step will place the genomes into a reference tree, then determine their most likely classification.
```shell
!gtdbtk classify --genome_dir /gtdbtk/genomes --align_dir /gtdbtk/align --out_dir /gtdbtk/classify -x gz 
```

ttmgr's avatar
ttmgr committed
217
218
219
## Bin Clustering
-dRep

ttreska57's avatar
ttreska57 committed
220
221
222
223
224
225
226
227
228
229
dRep is a python program which performs rapid pair-wise comparison of genome sets. One of it’s major purposes is for genome de-replication, but it can do a lot more.

It has in general three commands: compare, dereplicate, and check dependencies. For creating a set of non redundant genomes we use dRep dereplicate.

```shell
dRep dereplicate complete_only -g *.fna --S_algorithm gANI
```

De-replication is the process of identifying sets of genomes that are the “same” in a list of genomes, and removing all but the “best” genome from each redundant set. How similar genomes need to be to be considered “same”, how to determine which genome is “best”, and other important decisions are discussed in Important Concepts

ttreska57's avatar
ttreska57 committed
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
## Consensus error calculation
-QUAST

QUAST evaluates genome assemblies. QUAST works both with and without a reference genome. The tool accepts multiple assemblies, thus is suitable for comparison. 

```shell
quast.py test_data/contigs_1.fasta \
           test_data/contigs_2.fasta \
        -r test_data/reference.fasta.gz \
        -g test_data/genes.txt \
        -1 test_data/reads1.fastq.gz -2 test_data/reads2.fastq.gz \
        -o quast_test_output
```

QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics. The current QUAST toolkit includes the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, QUAST-LG, the extension for large genomes (e.g., mammalians), and Icarus, the interactive visualizer for these tools.


ttreska57's avatar
ttreska57 committed
247
248
## Protein truncation interference
-IDEEL
ttreska57's avatar
ttreska57 committed
249
250


ttreska57's avatar
ttreska57 committed
251
252
## Protein Sequence Prediction
-Prodigal
ttreska57's avatar
ttreska57 committed
253

ttreska57's avatar
ttreska57 committed
254
Fast, reliable protein-coding gene prediction for prokaryotic genomes.
ttreska57's avatar
ttreska57 committed
255

ttreska57's avatar
ttreska57 committed
256
257
258
259
```shell
prodigal -i my.genome.fna -o my.genes -a my.proteins.faa
prodigal -i my.metagenome.fna -o my.genes -a my.proteins.faa -p meta
```
ttreska57's avatar
ttreska57 committed
260
261


ttreska57's avatar
ttreska57 committed
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
## Homopolymer Counting
-Counterr

Counterr is a light-weight command line tool that computes errors in sequencing data by comparing the reads to a reference genome.

```shell
counterr -bam file.bam -genome assembly.fa -output_dir output -verbose
```

Counterr takes as input an alignment of the reads (bam + bai) and the corresponding reference (fasta) and outputs summaries (figures/tables) of errors in the reads. The tool computes both context independent and context dependent error statistics of the reads. The latter can be particularly helpful for uncovering systematic errors latent in the sequencing technology (due to hardware, basecaller, etc.).


## Antimicrobial Resistance
-EPI2ME ARMA? 




ttmgr's avatar
ttmgr committed
280