1 Introduction

This guide is intended to walk the user through the typical R workflow for processing raw amplicon sequencing data from paired end Illumina Miseq data into a table of exact amplicon sequence variants (ASVs) present in each sample. Please note that many other extensive documentation and tutorial pages are available for packages used in this workflow, notably DADA2 and phyloseq. Therefore, for issues or for further questions about certain functions I always recommend consulting the available R documentation and if available the relevant github issues sites for answers.

This guide assumes the following :

The user has access to one of the server from the Lazar lab (Orion, Hercules, Ulysse).
The user has an active VPN access.
The paired-end fastq files from Illumina Miseq sequencing were transferred to the user’s home directory of her/his server. Please see section Download fastqs for more details.
The user has followed the Introduction to linux guide (available on the lab’s TEAM) and is comfortable with basic command line functions such as :
- listing files inside a current directory (ls) ;
- moving from one directory to the other (cd) ;
- creating new directory (mkdir) ; and
- moving / copying (mv/cp) files from one directory to the other.

1.1 Setting up your environment

Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files/folders you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped, which can ultimately lead to significantly slower research progress. Furthermore, one must keep in mind that someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

While there’s a lot of ways to keep your files organized, and there’s not a “one size fits all” organizational solution, below we propose a simple organizational scheme which is project-oriented, maintainable and ultimately follows consistent patterns for amplicon sequence analysis. Please note that the proposed workflow assumes such organization.

┌─ ~ -------------------------------- Your home directory
│   ├── chapter1_16S_diversity ------ Project with a short but meaningfull name with  
│       │                             a second level domain-specific organization (if applicable). 
│       ├── archaea 
│       ├── eukaryotes
│       ├── bacteria ---------------- Third level domain-specific organization :
│           ├── data   -------------- final datasets generated by scripts to be used for interpretation ;
│           ├── int_data  ----------- intermediate data to be used by other script (DADA2 output, rarefied data, etc.) ;
│           ├── figures  ------------ figures generated by scripts ; 
│           ├── raw_data   ---------- raw files (i.e. fastq files generated by sequencing, raw metadata table, etc.) ; and
│                 ├── BAC_sample-1_R1.fastq.gz
│                 ├── BAC_sample-1_R2.fastq.gz
│                 ├── BAC_sample-2_R1.fastq.gz
│                 ├── BAC_sample-2_R2.fastq.gz
│                 └── ...
│           └── scripts   ----------- executable and scripts. 
│                  ├── bac_DADA2.rmd
│                  ├── bac_rarefy.rmd
│                  ├── bac_alpha_div.rmd
│                  ├── bac_stacked_barchart.rmd
│                  └── ... 
│       ├── data   ------------------ Project-specific final datasets for publication
│       ├── figures ----------------- Project-specific final figures for publication
│       └── scripts ----------------- Project-specific scripts (only if applicable, i.e. multi-domain, explanatory variables, etc.)
└──────────────────────────────────────────────────────────────────

Further reading about organizing files and foldes :

Organizing your project by the Johns Hopkins Data Science Lab
A Quick Guide to Organizing Computational Biology Projects by William Stafford Noble, 2009
Reddit post
Organizing your data by The Max Delbrück Center

1.2 Download fastqs

How to download fastqs files from Illumina BaseSpace Sequence Hub

Open the link found in the email sent by Geneviève Bourret.

If this is your first time downloading your fastqs create a new BaseSpace Sequence Hub account using the email address to which the email from Geneviève was addressed (normally this would be your UQAM’s email)
On the pop-up window informing you that the CERMO-FC has shared the following item with you click ACCEPT.
Click on the PROJECTS tab in the upper section of the page.
Select your project and then click on the second round logo from the left which looks like a blank page and in the drop-down menu select DOWNLOAD then PROJECT.
If required, download the Illumina Basespace downloader by clicking INSTALL DOWNLOAD and follow the instructions. Otherwise simply click DOWNLOAD to begin downloading your fastqs.
Once the download is complete you will find inside the folder a folder for each of your sample inside which the forward and reverse read are both found in another folder. Instead of going into each folder individually and copying the fastqs manually we can use the terminal to do the job for us. From a new local terminal window navigate to the folder containing all the folders and execute the following command after having modified /path/to/directory/where/to/move/fastqs to the actual path where you wish to move your fastqs.

find ./ -name "*.gz" -exec cp -prv "{}" "/path/to/directory/where/to/move/fastqs" ";"

Finally you can transfer your fastqs to your assigned server using any File Transfer Protocol (FTP) clients (such as FileZilla or Cyberduck) or using the SCP (secure copy) command-line utility.

For SCP you can copy an entire folder by opening a new local terminal window and navigating to the directory containing the folder with the fastqs. From that directory execute the following command. You will then be asked to enter the password for your user on the server.

scp -r name_of_foler_with_fastqs username@server.bio.uqam.ca:/path/to/copy/folder

Example workflow of amplicon sequence data analysis

Example workflow of amplicon sequence data analysis

1 Introduction

1.1 Setting up your environment

1.2 Download fastqs