2 Pre-process
1. Unzip
Copy all your raw fastq.gz files into the folder called 02_preprocess. Move into the folder and unzip the files using Parallel Implementation of GZip (pigz).
2. Interleave
We are using a modified version of the script interleave.py from this GitHub gist to interleave the forward and reverse read. Create a bash script called run_interleave.sh with the following commands and execute the script using nohup.
#!/bin/bash
for R1 in *R1*.fastq;
do
python3 /home/SCRIPT/interleave.py $R1 "${R1/R1/R2}" > $R1.interleave.fastq ;
done - Output : For each sample given as input (not file) the script generates a new file ending in
.interleave.fastq. - Create a new directory called
interleaveand move all the.interleave.fastqin this new directory. Move into this new directory for downstream analysis.
View the top of the new interleaved files to make sure your reads alternate between R1 and R2. Replace sample-name with the name of the sample you want to verify. For sample sequenced on the NovaSeq replace @M for @A.
3. Trim
Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3’-end of reads and also determines when the quality is sufficiently high enough to trim the 5’-end of reads.
Create a bash script called run_sickle.sh with the following commands and execute the script using nohup.
#!/bin/bash
for i in *.interleave.fastq
do sickle pe -c $i -t sanger -m $i.trim.fastq -s $i.singles.fastq
donesickle pe (paired end) -c (inputfile) -t sanger (from illumina) -m (outputfilename) -s (exclutedreadsfilename)
- Output : For each given file Sickle generates two files (one ending in
.interleave.fastq.trim.fastqand the other ending in.interleave.fastq.singles.fastq). - For downstream processing we only need the
.interleave.fastq.trim.fastqand therefore you can move all the.singleinto a new directory calledsingle.
4. Quality check
FastQc provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
Create a new folder directory called fastqc which is where the HTLM output of FastQc will be saved. Create a bash script called run_fastqc.sh with the following commands and execute the script using nohup.
- Output : For each given file FastQc generates an HTML file. The quality of the samples can be assessed by transferring the HTML files to your local computer.
- To determine if trimming caused any issue you can also run FastQc on the unzipped raw file and compare the results with the interleaved-trimmed file.
5. Transfer to Fasta
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
Move all the .interleave.fastq.trim.fastq to the folder 03_trim_interleave and move into this folder. Create a bash script called run_seqtk.sh with the following commands and execute the script using nohup.