5 Binning

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome known as Metagenome Assembled Genome (MAG). Coverage-based binning approaches will require you to map reads to assembled contigs.

Comparision of different binning tools

1. Mapping

Mapping allows you to get a rough count of the total number of reads mapping to each contig, also referred to as the depth file or read coverage. This information is required for most binning algorithm. Raw reads from each sample are mapped to the contigs (assembled reads).The idea being that if two contigs come from the same genome in your sample, so the same organism, then they would have been sequenced roughly to the same depth, they would have a similar coverage and they go together. It gets even better if you have multiple samples that vary a bit and you sequence all of them. You do your assembly on one of the samples, then you can take the reads from the other samples, map to that genome (metagenomic assembly) and you can use that as additional information.

Different tools exists for mapping reads to genomic sequences. Here we are using Scons Wrapper which maps FASTQ reads against an assembly (e.g. genome) in FASTA format using BWA-MEM. This wrapper does not produce huge intermediate files (e.g. unfiltered SAM).

I am currently looking for a way to loop this. Tried specifying only the input folder and the sampleids as proposed in the README but it does not work. Therefore these steps must be repeated for every sample. .

For each sample, create a new folder in the directory 05_binning and inside each folder copy the filtered assembled contigs FASTA and the interleaved-trimed FASTQ for that sample.

  1. Activate the scons_maps conda environment
conda activate scons_map
  1. Execute scons from the genome_mapping directory (/home/genomics/genome_mapping). Modify the following parameters accordingly :
  • --fastq_dir = directory where the filtered assembled contigs FASTA and the interleaved-trimed FASTQ files are.
  • --assembly = path to the filtered contig FASTA files including the name of the file at the end.
  • --sampleids = name of the interleave.trim.fastq.
scons --fastq_dir=/home/genomics/user/05_binning/sample_01/ --assembly=/home/genomics/user/05_binning/sample_01/1kb_sample_01_contigs.fasta --outdir=/home/genomics/user/05_binning/sample_01/output --sampleids=sample_01.fastq.interleave.fastq.trim.fastq --align_thread=5 --samsort_thread=5 --samsort_mem=768M --nheader=0 --tmpdir=/home/genomics/tmp --logfile=mapping.log

In the specified output directory you will find a .sorted.bam which contains the depth information required by MetaBAT2. Create a directory call sorted_bam and move all the sorted.bam files into this directory.

2. Depth file

The depth allows you to know how many sequence you can align with certain sections of your contigs. Section with very little depth (few sequences) are not reputable to use. We use the script jgi_summarize_bam_contig_depths.

  1. From the directory sorted_bam create a bash script called run_depthfile.sh with the following commands and execute the script using nohup.
#!/bin/bash
for i in *.sorted.bam
  do /home/SCRIPT/jgi_summarize_bam_contig_depths --outputDepth $i.depth.txt --pairedContigs $i.paired.txt $i
done 
  • Output : For each given file the script generates a file ending with .depth.txt as output.
  • Move these files into a directory call metabat2 and also copy into this folder the filtered contig FASTA.

3. Create bins

MetaBAT2 is an efficient tool for accurately reconstructing single genomes from complex microbial communities.

  1. If required, first deactivate scons_map and then activate conda.
conda deactivate
conda activate
  1. In order to be able to loop this for throught all samples, the name of filtered contig FASTA file must match exactly the beginning of the depths file. Example :
  • Filtered contig FASTA : 1kb_sample_01_contigs.fasta
  • Depth file : 1kb_sample_01_contigs.fasta.depth.txt
  1. Create a bash script called run_metabat2.sh with the following commands and execute the script using nohup.
#!/bin/bash
for i in *.fasta
  do metabat2 -i $i -a $i.depth.txt -o bins_$i -t 0 --minCVSum 0 --saveCls -d -v --minCV 0.1 -m 1000
done 
  • Output - for each given file metabat2 generates a directory containing all the bins created.

MetaBAT2 - Under user genomics2

This MetaBAT2 installation does not require conda

~/metabat/bin/metabat2 -i assmbly.fasta -o ~/folder/to/export/bins/ -a depth.txt -t 20

4. Check quality

I recommend using checkm2 which is now available under the user genomics2.

Checkm

You have to go back one folder in the terminal as checkm will run on all the files in the folder you give it as input. Checkm will automatically create a folder called checkm in the specified directory, therefor if you must run checkm again make sure to delete the newly created checkm folder, otherwise checkm will give you an error message.

  1. Activate the environment
conda activate checkm
  1. Create a bash script called run_checkm.sh with the following commands and execute the script using nohup.
checkm lineage_wf -x fa 2kbp_bins/ 2kbp_bins/checkm -f 2kbp_bins/output.txt -t 48 
  1. Open the output.txt document with excel to verify the completeness and contamination of your bins. Standard : Completeness > 50 % and Contamination < 10 %

  2. Remove all the spaces with control + H

  3. Filter the columns by Completeness, and separate the ones < 50 % by adding a line in excel

  4. Filter by Contamination, and highlight all the ones > 10 % - These are the bins you want to clean

checkm2

Available under the user genomics2

  1. Activate the environment
conda activate checkm2
  1. Set path to the databse
export CHECKM2DB="/home/SCRIPT/CheckM2_database/uniref100.KO.1.dmnd"
  1. Create a bash script called run_checkm.sh with the following commands and execute the script using nohup.
checkm2 predict --threads 30 --input path/to/folder/with/your/bins --output-directory path/to/folder/whereto/output --force -x fasta 

5. Metagenome assembled genome statistics

Quast

These informations are now all generate by Checkm2 so if Checkm2 was used these steps do not need to be used.

#!/bin/bash
for i in *.fa
  do  python /home/genomics/quast-5.2.0/quast.py $i -o /home/genomics/karine/stleonard/E4-2/quast_out/$i --threads 48 --no-check --no-plots --no-html --no-icarus
done
for folder in *; do
    (cd "$folder" && report.txt)  
    mv "$folder/report.txt" "${folder}_report.txt"
done

6. Taxonomy

GTDBTK

Activate the GTDBTK environment

conda activate gtdbtk-2.4.0

In the folder with all your clean and completed genomes run this command with nohup

#!/bin/bash
gtdbtk classify_wf --cpus 20 --genome_dir /home/genomics/06_mags/ --out_dir /home/genomics/06_mags/gtdbk_output -x fa

Once it is done running, you can open the folder called gtdbk_output and copy the file gtdbtk.bac120.summary.tsv to your local computer in order to open it with excel.

gtdbtk-2.5.2 - Under user genomics2

conda activate gtdbtk_v2.5.2
gtdbtk classify_wf --genome_dir /home/genomics2/pierre/05_bins --out_dir /home/genomics2/pierre/05_bins -x fa --cpus 40

[INFO] - Added GTDBTK_DATA_PATH (/home/genomics2/miniconda3/envs/gtdbtk-2.5.2/share/gtdbtk-2.5.2/db) to the GTDB-Tk conda environment.

conda activate gtdbtk_v2.5.2