DataCamp
R Tutorial : RNA-Seq Workflow
Want to learn more? Take the full course at https://learn.datacamp.com/courses/rna-seq-with-bioconductor-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Now that you know a bit about the types of questions that RNA-Seq experiments can address, and how we use this technique to understand more about the genes important to a particular disease or condition, let's explore the steps required for the analysis workflow.
Prior to starting the RNA-Seq workflow, planning is essential. This step in the analysis is crucial for good results, as there is often no saving a poorly designed experiment.
There are a couple of important considerations during planning, including replicates, batch effects, and confounding:
For RNA-Seq experiments there is generally low technical variation, so invest in biological replicates instead.
The more biological replicates you have, the better the estimates are for mean expression and variation, leading to more robust analyses; be sure to have at least 3.
Also, an experiment performed as different batches can confound your analysis. As much as possible try to perform experimental steps across all conditions at the same time, and if you cannot avoid batches, distribute the samples from each sample group into each batch.
Finally, avoid confounding your experiment with major sources of variation. For example, if your animals are of different sexes, don't have all-male mice as control and all-female mice as treatment, as you won't be able to differentiate the treatment effect from the effect of sex.
After you have a well-planned out the experiment, you can begin with sample preparation.
When preparing RNA-Seq libraries, the samples are harvested, the RNA is isolated and DNA contamination is removed. The rRNA is removed or mature mRNAs are selected by their polyA tails.
Then, the RNA is turned into cDNA, fragmented, size selected and adapters are added to generate the RNA-Seq libraries to be sequenced.
The sequencing generates millions of nucleotide sequences called reads. The reads correspond to ends of the fragments sequenced. The sequence of each read is output into FASTQ files.
After acquiring the FASTQ files, we can start with the computational analysis. The first step in the analysis is to assess the quality of the raw data.
At this step, we ensure something didn't go wrong at the sequencing facility and explore the data for contamination, such as vector, adapter, or ribosomal.
The next step is alignment or mapping to the genome to determine the location on the genome where the reads originated.
Since mRNA contains only the exons needed to create the proteins when the mRNA is aligned to the genome containing introns, some of the reads will be split across introns. Therefore, tools for aligning reads to the genome need to align across introns for RNA-seq.
The output of alignment gives the genome coordinates for where the read most likely originated from in the genome and information about the quality of the mapping.
Following alignment, the reads aligning to the exons of each gene are quantified to yield a matrix of gene counts.
For the entire process up until this point in the workflow, we use command-line tools, which can handle the large sequencing files and computational demands. Therefore, we will not perform these steps in the current course.
This course will focus on the identification of differentially expressed genes using these count matrices as input. The analysis will be performed in R using predominantly Bioconductor packages.
We can read into R the count matrix using the read.csv() function and specifying the file.
The gene count matrix is arranged with the samples as columns and gene IDs as rows. The count values represent the number of reads or fragments aligning to the exons of each gene.
Once we have count data, differential expression analysis is performed by comparing the expression of each gene between the specified conditions.
The output of the statistical analysis includes the log2 fold changes of expression between conditions and the adjusted p-values for each gene. Genes that reach a threshold for significance can be subset to define a list of significant differentially expressed genes.
Now that we have a general understanding of the workflow and have the count's file loaded, we can get started.
Now let's explore the workflow.
#DataCamp #RTutorial #RNASeqwithBioconductorinR