Exercise 1. From Fastq data files to Read Count Matrix Part 1

Exercise 1. From Fastq data files to Read Count Matrix

Part 1. Prepare the working directory.

1. Find out the name of the computer that has been reserved for you

(https://cbsu.tc.cornell.edu/ww/machines.aspx?i=123 ).

2. Connect to the computer using Putty (Windows) or Terminal (Mac).

3. From the command line, create a working directory and copy all data files required for

this exercise to the working directory. (Replacing “<my_user_ID>” in the commands

with your actual BioHPC user ID).

Part 2. Examine the quality of the fastq data files

1. Run fastqc on the fastq file

2. The fastqc software would create a new file called “ERR458493_fastqc.html”. You can use

FileZilla to download the file to your laptop, and double click the file to check the results.

• The data used for this exercise are from this article: Schurch et al. (2016) RNA 22(6):839 PMID: PMID: 27022035

Part 3. Run read mapping software

We are going to use STAR to map the sequencing reads in the fastq files to the reference

genome. STAR is a fast alignment software, but it requires a computer with large memory

(30GB for the 3GB human genome).

1. Inspect the files in the working directory (/workdir/my_user_ID. If you are not in

working directory already, type “cd /workdir/usre_user_ID” first )

mkdir /workdir/<my_us er_ID>/

cd /workdir/<my_user_ID>/

cp /shared_data/RNAseq/exercise1/* ./

ls -la

fastqc ERR458493.fastq.gz

ls -la

Description of the files in the directory.

R64.fa

Reference genome sequence file, in fasta format.

R64.gtf

Genome annotation file, in gtf format.

ERR458493.fastq.gz

RNA-seq data file, wt_sample1

ERR458494.fastq.gz

RNA-seq data file, wt_sample2

ERR458495.fastq.gz

RNA-seq data file, wt_sample3

ERR458500.fastq.gz

RNA-seq data file, mu_sample1

ERR458501.fastq.gz

RNA-seq data file, mu_sample2

ERR458502.fastq.gz

RNA-seq data file, mu_sample3

If you are interested in finding out what are in the files, or number of reads in the fastq

files, use the following commands to examine the files.

• When inspecting files with “less” command, press “space” key to move on to the

next page, or press “q” key to exit.

• “wc -l” is the command to count the number of lines in a file. The command “gunzip

-c ERR458493.fastq.gz | wc -l” would tell you the number of lines in the file. As

every sequence read takes up 4 lines in the fastq file, the line number divided by 4

gives you the number of sequencing reads in the file.

2. Map the reads to reference genome using STAR.

On the BioHPC computers, STAR is installed in the directory “/programs/STAR”. The “export

PATH=/programs/STAR:$PATH” command would put STAR in your current path. Now you can

run the software by simply typing the command “STAR”.

gunzip -c ERR458493.fastq.gz | head

gunzip -c ERR458493.fastq.gz | wc -l

less R64.fa

less R64.gtf

export PATH=/programs/STAR:$PATH

Then index the reference genome with STAR:

The parameters:

--runMode genomeGenerate: Set runMode to “genomeGenerate” to index the genome;

--runThreadN: Number of CPU cores;

--genomeDir: Output directory for indexed genome database;

--genomeFastaFiles: Reference genome file

--sjdbGTFfile: Genome annotation file and it should be GTF format.

--sjdbOverhang: Use the value (reads_length -1), the read length is 51 for this exercise.

In the next step, we will align sequencing reads to the indexed genome.

--quantMode GeneCounts: Output a file with read counts per gene;

--genomeDir: Reference genome index directory;

--runThreadN: Number of CPU cores;

--readFilesIn: Sequence data file;

--readFilesCommand zcat: Input file is a decompressed .gz file;

--outFileNamePrefix: Prefix of the output file names;

mkdir genome

STAR --runMode genomeGenerate --runThreadN 2 --genomeDir genome \

--genomeFastaFiles R64.fa --sjdbGTFfile R64.gtf \

--sjdbOverhang 50

STAR --quantMode GeneCounts --genomeDir genome --runThreadN 2 \

--readFilesIn ERR458493.fastq.gz --readFilesCommand zcat \

--outFileNamePrefix wt1_ --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 \

--outSAMtype BAM SortedByCoordinate

--outFilterMismatchNmax 2 : Only report alignment with up to 2 mismatches per read;

--outSAMtype BAM SortedByCoordinate: Output sorted bam files.

After running STAR software, many new files will be produced. The files you need to keep are:

1) wt1_Aligned.sortedByCoord.out.bam: BAM file with the alignment results;

2) wt1_Log.final.out: A report file that shows percentage of read can be mapped;

3) wt1_ReadsPerGene.out.tab: a tab delimited text file with read count per gene.

Inspect the files:

In the file wt1_ReadsPerGene.out.tab, there are three numbers for each gene. The four

columns are,

• column 1: gene ID

• column 2: counts for unstranded RNA-seq

• column 3: counts for reads aligned with plus strand of RNA

• column 4: counts for reads aligned with minus strand of RNA

Use column 2 if you use unstranded RNA-seq library prep kit. Use column 4 if you use stranded

RNA-seq. Use column 3 if you do 3’ RNA-seq. For this exercise, we will use column 2

(unstranded).

Part 4. Visualize the BAM file with IGV

1. Index the bam files

We are going to use the IGV software to visualize the BAM files. For IGV to read the

BAM files, the “.bam” files need to be indexed. We will use the samtools software:

After this step, you will see a “.bai” file created for each “.bam” file.

samtools index wt1_Aligned.sortedByCoord.out.bam

less

wt1_Log.final.out

less

wt1_ReadsPerGene.out.tab

2. Using FILEZILLA to download the “*.bam” ,“*.bai”, “R64.fa” and “R64.gtf” files to your

laptop computer.

3. IGV is a JAVA software that can be run on Windows, MAC or a Linux computer. To

launch IGV on your laptop, go to IGV web site

(https://software.broadinstitute.org/software/igv/ ), click “Download”, and download the

Windows or Mac version for your laptop. Double click the IGV installation tool to install

IGV. On Windows computer, the software is installed in the directory C:\Program

Files\IGV_2.6.3. Double click “igv.bat” to start IGV. After double clicking, it might take a

few seconds before you see the software starting.

4. Most commonly used genomes are already loaded in IGV. In this exercise, we will create

our own genome database. Click “Genomes”->”Create .genome” file. Fill out the

following fields:

Unique identifier: R64

Descript name: R64

Fasta: use the “Browse” button to find the R64.fa file

Gene file: use the “Browse” button to find the R64.gtf file

Then save the genome database on your computer.

5. From menu “File” -> “Load file”, open the

“wt1_Aligned.sortedByCoord.out.bam”.

Inspect the following regions by enter the text in the box next to “Go” and click “Go”.

II:265,593-282,726

Part 5. Run the commands in a shell script

In a typical RNA-seq experiment, you have many samples and it could take many hours to finish

the alignments. There are two things you can do to make computing faster.

1. Create a batch command ("a shell script") to process all files;

2. Use the “Shared Memory” feature of STAR. (We do not use it in workshop, I will explain

it at the end of this note.)

In order to do this, you can use a text editor to make a text file with the following lines. We

recommend Mac users to use “BBEdit” (The free version is fine).

( https://www.barebones.com/products/bbedit/), Windows users can use “Notepad++”

( http://notepad-plus-plus.org/ ). You can give the script a name, normally with the extension

“sh”, e.g.“runSTAR.sh”. If the file is made on a Windows computer, you need to make sure to

save the file as a LINUX style text file. From NotePad++, used the "Edit -> EOL Conversion ->

UNIX" option. If you are not sure about this, after uploading the script to Linux, run command

“dos2unit runSTAR.sh” to convert to LINUX text file.

You can use FileZilla(win & mac) to upload the file to your home directory. To make things

easier, both software include a function to directly save edited file to a remote LINUX machine.

Here are the lines in your shell script.

You can also use the shell script that we have prepared for you. It is located in the data

directory with the file name “runSTAR.sh”;

•

In these commands, I set --runThreadN to 2. You would want to increase the number

in real work.

•

You might want to run multiple jobs in parallel. Read the instructions at

https://biohpc.cornell.edu/lab/doc/using_BioHPC_CPUs.pdf for using BioHPC computer

efficiently, or get help form our office hours.

To run the shell script, start “screen”, and in a screen session run these commands: