5 Binning
In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome known as Metagenome Assembled Genome (MAG). Coverage-based binning approaches will require you to map reads to assembled contigs.
Comparision of different binning tools
1. Mapping
Mapping allows you to get a rough count of the total number of reads mapping to each contig, also referred to as the depth file or read coverage. This information is required for most binning algorithm. Raw reads from each sample are mapped to the contigs (assembled reads).The idea being that if two contigs come from the same genome in your sample, so the same organism, then they would have been sequenced roughly to the same depth, they would have a similar coverage and they go together. It gets even better if you have multiple samples that vary a bit and you sequence all of them. You do your assembly on one of the samples, then you can take the reads from the other samples, map to that genome (metagenomic assembly) and you can use that as additional information.
Different tools exists for mapping reads to genomic sequences. Here we are using Scons Wrapper which maps FASTQ reads against an assembly (e.g. genome) in FASTA format using BWA-MEM. This wrapper does not produce huge intermediate files (e.g. unfiltered SAM).
I am currently looking for a way to loop this. Tried specifying only the input folder and the sampleids as proposed in the README but it does not work. Therefore these steps must be repeated for every sample. .
For each sample, create a new folder in the directory 05_binning and inside each folder copy the filtered assembled contigs FASTA and the interleaved-trimed FASTQ for that sample.
- Activate the scons_maps conda environment
- Execute scons from the
genome_mappingdirectory (/home/genomics/genome_mapping). Modify the following parameters accordingly :
--fastq_dir= directory where the filtered assembled contigs FASTA and the interleaved-trimed FASTQ files are.--assembly= path to the filtered contig FASTA files including the name of the file at the end.--sampleids= name of the interleave.trim.fastq.
scons --fastq_dir=/home/genomics/user/05_binning/sample_01/ --assembly=/home/genomics/user/05_binning/sample_01/1kb_sample_01_contigs.fasta --outdir=/home/genomics/user/05_binning/sample_01/output --sampleids=sample_01.fastq.interleave.fastq.trim.fastq --align_thread=5 --samsort_thread=5 --samsort_mem=768M --nheader=0 --tmpdir=/home/genomics/tmp --logfile=mapping.logIn the specified output directory you will find a .sorted.bam which contains the depth information required by MetaBAT2. Create a directory call sorted_bam and move all the sorted.bam files into this directory.
2. Depth file
The depth allows you to know how many sequence you can align with certain sections of your contigs. Section with very little depth (few sequences) are not reputable to use. We use the script jgi_summarize_bam_contig_depths.
- From the directory
sorted_bamcreate a bash script calledrun_depthfile.shwith the following commands and execute the script using nohup.
#!/bin/bash
for i in *.sorted.bam
do /home/SCRIPT/jgi_summarize_bam_contig_depths --outputDepth $i.depth.txt --pairedContigs $i.paired.txt $i
done - Output : For each given file the script generates a file ending with
.depth.txtas output. - Move these files into a directory call
metabat2and also copy into this folder the filtered contig FASTA.
3. Create bins
MetaBAT2 is an efficient tool for accurately reconstructing single genomes from complex microbial communities.
- If required, first deactivate
scons_mapand then activate conda.
- In order to be able to loop this for throught all samples, the name of filtered contig FASTA file must match exactly the beginning of the depths file. Example :
- Filtered contig FASTA :
1kb_sample_01_contigs.fasta - Depth file :
1kb_sample_01_contigs.fasta.depth.txt
- Create a bash script called
run_metabat2.shwith the following commands and execute the script using nohup.
#!/bin/bash
for i in *.fasta
do metabat2 -i $i -a $i.depth.txt -o bins_$i -t 0 --minCVSum 0 --saveCls -d -v --minCV 0.1 -m 1000
done - Output - for each given file metabat2 generates a directory containing all the bins created.
MetaBAT2 - Under user genomics2
This MetaBAT2 installation does not require conda
4. Check quality
I recommend using checkm2 which is now available under the user genomics2.
You have to go back one folder in the terminal as checkm will run on all the files in the folder you give it as input. Checkm will automatically create a folder called checkm in the specified directory, therefor if you must run checkm again make sure to delete the newly created checkm folder, otherwise checkm will give you an error message.
- Activate the environment
- Create a bash script called
run_checkm.shwith the following commands and execute the script using nohup.
Open the
output.txtdocument with excel to verify the completeness and contamination of your bins. Standard : Completeness > 50 % and Contamination < 10 %Remove all the spaces with
control+HFilter the columns by Completeness, and separate the ones < 50 % by adding a line in excel
Filter by Contamination, and highlight all the ones > 10 % - These are the bins you want to clean
Available under the user genomics2
- Activate the environment
- Set path to the databse
- Create a bash script called
run_checkm.shwith the following commands and execute the script using nohup.
checkm2 predict --threads 30 --input path/to/folder/with/your/bins --output-directory path/to/folder/whereto/output --force -x fasta 5. Metagenome assembled genome statistics
These informations are now all generate by Checkm2 so if Checkm2 was used these steps do not need to be used.
#!/bin/bash
for i in *.fa
do python /home/genomics/quast-5.2.0/quast.py $i -o /home/genomics/karine/stleonard/E4-2/quast_out/$i --threads 48 --no-check --no-plots --no-html --no-icarus
donefor folder in *; do
(cd "$folder" && report.txt)
mv "$folder/report.txt" "${folder}_report.txt"
done6. Taxonomy
Activate the GTDBTK environment
In the folder with all your clean and completed genomes run this command with nohup
#!/bin/bash
gtdbtk classify_wf --cpus 20 --genome_dir /home/genomics/06_mags/ --out_dir /home/genomics/06_mags/gtdbk_output -x faOnce it is done running, you can open the folder called gtdbk_output and copy the file gtdbtk.bac120.summary.tsv to your local computer in order to open it with excel.
gtdbtk-2.5.2 - Under user genomics2
gtdbtk classify_wf --genome_dir /home/genomics2/pierre/05_bins --out_dir /home/genomics2/pierre/05_bins -x fa --cpus 40[INFO] - Added GTDBTK_DATA_PATH (/home/genomics2/miniconda3/envs/gtdbtk-2.5.2/share/gtdbtk-2.5.2/db) to the GTDB-Tk conda environment.
conda activate gtdbtk_v2.5.2