2 Pre-process

1. Unzip

Copy all your raw fastq.gz files into the folder called 02_preprocess. Move into the folder and unzip the files using Parallel Implementation of GZip (pigz).

pigz -d *

2. Interleave

We are using a modified version of the script interleave.py from this GitHub gist to interleave the forward and reverse read. Create a bash script called run_interleave.sh with the following commands and execute the script using nohup.

#!/bin/bash
for R1 in *R1*.fastq;
do 
  python3 /home/SCRIPT/interleave.py $R1 "${R1/R1/R2}"  > $R1.interleave.fastq ; 
done

Output : For each sample given as input (not file) the script generates a new file ending in .interleave.fastq.
Create a new directory called interleave and move all the .interleave.fastq in this new directory. Move into this new directory for downstream analysis.

View the top of the new interleaved files to make sure your reads alternate between R1 and R2. Replace sample-name with the name of the sample you want to verify. For sample sequenced on the NovaSeq replace @M for @A.

grep @M sample-name.interleave.fastq | head

3. Trim

Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3’-end of reads and also determines when the quality is sufficiently high enough to trim the 5’-end of reads.

Create a bash script called run_sickle.sh with the following commands and execute the script using nohup.

#!/bin/bash
for i in *.interleave.fastq
  do sickle pe -c $i -t sanger -m $i.trim.fastq -s $i.singles.fastq
done

sickle pe (paired end) -c (inputfile) -t sanger (from illumina) -m (outputfilename) -s (exclutedreadsfilename)

Output : For each given file Sickle generates two files (one ending in .interleave.fastq.trim.fastq and the other ending in .interleave.fastq.singles.fastq).
For downstream processing we only need the .interleave.fastq.trim.fastq and therefore you can move all the .single into a new directory called single.

4. Quality check

FastQc provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

Create a new folder directory called fastqc which is where the HTLM output of FastQc will be saved. Create a bash script called run_fastqc.sh with the following commands and execute the script using nohup.

#!/bin/bash 
fastqc *.interleave.fastq.trim.fastq --outdir=fastqc

Output : For each given file FastQc generates an HTML file. The quality of the samples can be assessed by transferring the HTML files to your local computer.
To determine if trimming caused any issue you can also run FastQc on the unzipped raw file and compare the results with the interleaved-trimmed file.

5. Transfer to Fasta

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format.

Move all the .interleave.fastq.trim.fastq to the folder 03_trim_interleave and move into this folder. Create a bash script called run_seqtk.sh with the following commands and execute the script using nohup.

#!/bin/bash
for i in *.gz ; 
  do seqtk seq -a $i > $i.fasta ; 
done