Training @ Lu Lab
Lu Lab Docs
  • Home
    • Training @ Lu Lab
  • Drylab Training
    • Genomics
      • RNA Types in Genome
  • Wetlab Training
    • Wetlab Safety Guide
    • Wetlab FAQ
  • Archive
    • Archive 2021
      • cfDNA Methylation
      • Genomic Annotation
    • Archive 2019 - Wetlab Training
      • Class I. Basics
        • 1. Wet Lab Safety
        • 2. Wet Lab Regulation
        • 3. Wet Lab Protocols
        • 4. How to design sample cohort
        • 5. How to collect and manage samples
        • 6. How to purify RNA from blood
        • 7. How to check the quantity and quality of RNA
        • 8. RNA storage
        • 9. How to remove DNA contanimation
        • 10. What is Spike-in
      • Class II. NGS - I
        • 1. How to do RNA-seq
        • 2. How to check the quantity and quality of RNA-seq library
        • 3. What is SMART-seq2 and Multiplex
    • Archive 2019 - Drylab Training
      • Getting Startted
      • Part I. Programming Skills
        • Introduction of PART I
        • 1.Setup
        • 2.Linux
        • 3.Bash and Github
        • 4.R
        • 5.Python
        • 6.Perl
        • Conclusion of PART I
      • Part II. Machine Learning Skills
        • 1.Machine Learning
        • 2.Feature Selection
        • 3.Machine Learning Practice
        • 4.Deep Learning
      • Part III. Case studies
        • Case Study 1. exRNA-seq
          • 1.1 Mapping, Annotation and QC
          • 1.2 Expression Matrix
          • 1.3.Differential Expression
          • 1.4 Normalization Issues
        • Case Study 2. exSEEK
          • 2.1 Plot Utilities
          • 2.2 Matrix Processing
          • 2.3 Feature Selection
        • Case Study 3. DeepSHAPE
          • 3.1 Background
          • 3.2 Resources
          • 3.3 Literature
      • Part IV. Appendix
        • Appendix I. Keep Learning
        • Appendix II. Public Data
        • Appendix III. Mapping Protocol of RNA-seq
        • Appendix IV. Useful tools for bioinformatics
      • Part V. Software
        • I. Docker Manual
        • II. Local Gitbook Builder
        • III. Teaching Materials
  • Archive 2018
Powered by GitBook
On this page
  • Pipeline
  • Data Structure
  • Inputs
  • Outputs
  • Running Scripts
  • Software/Tools
  • Example of single case
  • Example of batch job
  • Draw Plots
  • 1. Heatmap for DESeq2 normalized count matrix
  • 2. PCA analysis
  • 3. MA plot
  • 4. Distance between samples
  • 5. Hierarchical clustering for differential expressed genes
  • Tips/Utilities
  • Homework and more
  • Video
  • 1. Differential expression
Edit on GitHub
  1. Archive
  2. Archive 2019 - Drylab Training
  3. Part III. Case studies
  4. Case Study 1. exRNA-seq

1.3.Differential Expression

Differential Expression for bulk RNAs

Last updated 3 years ago

Pipeline

Data Structure

Inputs

File format

Information contained in file

File description

Notes

txt

miRNA.homer.ct.mx

Raw counts matrix

Used for edgeR and DESeq2 analysis

txt

miRNA.homer.rpm.mx

RPM matrix

Used for wilcoxon test

txt

miRNA.NCvsHCC.DESeq2.ids.rpkm.mx

RPKM matrix

Used for clustering plot

txt

design.txt

Sample info.

Classification information

Outputs

File format

Information contained in file

File description

Notes

tsv

miRNA.NCvsHCC.edgeR.tsv

EdgeR result

LogFC, PValue and FDR

txt

miRNA.homer.DESeq2.rlog.mx

DESeq2 result

Rlog normalized counts

tsv

miRNA.NCvsHCC.DESeq2.tsv

DESeq2 result

LogFC, PValue and padj

tsv

miRNA.NCvsHCC.wilcox.tsv

Wilcox result

LogF, PVaule and FDR

pdf

miRNA.NCvsHCC.DESeq2.norm.pdf

Heatmap result

Rlog normalized count

pdf

miRNA.NCvsHCC.DESeq2.pca.pdf

Scatter plot

PCA analysis

pdf

miRNA.NCvsHCC.DESeq2.MAplot.pdf

Scatter plot

MAplot analysis

pdf

miRNA.NCvsHCC.DESeq2.Dist.pdf

Heatmap result

Sample distance

pdf

miRNA.NCvsHCC.DESeq2.ids.DE.pdf

Heatmap result

DE genes

Running Scripts

Software/Tools

Assumption for most normalization and differential expression analysis tools: The expression levels of most genes are similar, i.e., not differentially expressed.

a) DEseq: defines scaling factor (also known as size factor) estimates based on a pseudoreferencesample, which is built with the geometric mean of gene counts across all cells (samples).

c) Wilcox Test using RPM: Read counts Per Million of total mapped reads; alternatives: RPKM, TPM

Performance:

Example of single case

# experimential design
design <- read.table("design.txt",sep="\t",header=T)

# expression matrix
mx <- read.table("miRNA.homer.ct.mx",sep="\t",header=T)

# filter genes
filter_genes <- apply(
    mx[,2:ncol(mx)],
    1,
    function(x) length(x[x > 2]) >= 2
)
mx_filterGenes <- mx[filter_genes,]
library(edgeR)

#Read Data in
countData <- mx_filterGenes[,-1]
rownames(countData) <- mx_filterGenes[,1]
design <- read.table("design.txt",sep="\t",header=T)
colData <- design

# generate DGE object
y <- DGEList(countData, samples=colData, group=colData$Treatment)
y <- calcNormFactors(y)

#Estimate Error Model
design <-model.matrix(~Treatment, data=colData)
y <- estimateDisp(y, design)

# classic methods: compute p-values, then output
et <- exactTest(y)
res <- topTags(et,Inf)
tidyResult <- data.frame(Gene=rownames(res$table), res$table)
#write.table(tidyResult,file="hcc_example.miRNA.NCvsHCC.edgeR.classic.tsv", quote=F, sep="\t",row.names=FALSE)

# Generalized linear models 
fit <- glmFit(y,design)
# likelihood ratio test
lrt <- glmLRT(fit,contrast = c(1,-1))
FDR <- p.adjust(lrt$table$PValue, method="BH")
padj_lrt <- cbind(lrt$table,FDR)
fit_df <- lrt$fitted.values
#write.table(fit_df,file = "hcc_example.miRNA.homer.edgeR.TMM.mx",row.names = T, sep="\t", quote=F)
merged_lrt <- merge(fit_df,padj_lrt,by="row.names")
colnames(merged_lrt)[1] <- "Genes"
write.table(merged_lrt,file = "miRNA.NCvsHCC.edgeR.tsv",row.names = F, sep="\t", quote=F)
library(DESeq2)
#Read Data in
countData <- mx_filterGenes
colData <- design

# generate Dataset object 
dds <- DESeqDataSetFromMatrix(countData, colData, design=~Treatment, tidy=TRUE)

# normlize using rlog method
norm <- rlog(dds,blind=FALSE)
norm_matrix <- assay(norm)
norm_df <- data.frame(Gene=rownames(norm_matrix), norm_matrix)
write.table(norm_df, "miRNA.homer.DESeq2.rlog.mx", quote=F, row.names = FALSE,sep="\t")

deg <- DESeq(dds)
res <- results(deg,tidy=TRUE)
merged_res <- merge(norm_df,res,by.x="Gene",by.y="row")
write.table(merged_res,file="miRNA.NCvsHCC.DESeq2.tsv",quote=F, sep="\t",row.names=FALSE)
cpmMx <- read.table("miRNA.homer.rpm.mx",sep="\t",header=T)
filter_cpm <- apply(
    mx[,2:ncol(cpmMx)],
    1,
    function(x) length(x[x > 0]) >= 2
)
mx_filterCPM <- cpmMx[filter_cpm,]

myFun <- function(x){
  x = as.numeric(x)
  v1 = x[2:4]
  v2 = x[5:10]
  out <- wilcox.test(v1,v2)
  out <- out$p.value
}
p_value <- apply(mx_filterCPM,1,myFun)
p_value[is.nan(p_value)] <- 1
FDR <- p.adjust(p_value,method = "BH")
mx_filterCPM$avgNC <- apply(mx_filterCPM[,2:4],1,mean)
mx_filterCPM$avgHCC <- apply(mx_filterCPM[,5:10],1,mean)
mx_filterCPM$log2fc <- log2((mx_filterCPM$avgNC+0.25)/(mx_filterCPM$avgHCC+0.25))
results <- cbind(mx_filterCPM,p_value,FDR)
write.table(results,file = "miRNA.NCvsHCC.wilcox.tsv",row.names = F, sep="\t", quote=F)
# 1.heatmap for DESeq2 rlog normalized count matrix
pdf("miRNA.NCvsHCC.DESeq2.norm.pdf")
library("pheatmap")
select <- order(rowMeans(counts(deg,normalized=TRUE)), decreasing=TRUE)[1:25]
df <- as.data.frame(colData(deg)[,c("Treatment","Sample")])
pheatmap(assay(norm)[select,], cluster_rows=FALSE, show_rownames=FALSE, cluster_cols=TRUE, annotation_col=df)
dev.off()

# 2.PCA analysis
pdf("miRNA.NCvsHCC.DESeq2.pca.pdf")
plotPCA(norm, intgroup=c("Treatment"))
dev.off()

# 3.MA plot
pdf("miRNA.NCvsHCC.DESeq2.MAplot.pdf")
plotMA(deg, ylim=c(-5,5))
dev.off()

# 4.distance between samples
pdf("miRNA.NCvsHCC.DESeq2.Dist.pdf")
sampleDists <- dist(t(assay(norm)))
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(norm$Treatment, norm$Sample, sep="-")
colnames(sampleDistMatrix) <- paste(norm$Treatment, norm$Sample, sep="-")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
dev.off()


# 5.Hierarchical clustering for differential expressed genes
awk 'BEGIN{FS=OFS="\t"}($12>1 && $15<=0.05){print $1}' hcc_example.miRNA.NCvsHCC.DESeq2.tsv > hcc_example.miRNA.NCvsHCC.DESeq2.NC_high.ids
awk 'BEGIN{FS=OFS="\t"}($12<-1 && $15<=0.05){print $1}' hcc_example.miRNA.NCvsHCC.DESeq2.tsv > hcc_example.miRNA.NCvsHCC.DESeq2.HCC_high.ids

mx <- read.table("miRNA.NCvsHCC.DESeq2.ids.DE.mx",sep="\t",head=T)
# log2 transfer the RPKM value
log2mx <- log2(mx+0.05)
pdf("miRNA.NCvsHCC.DESeq2.ids.rpkm.heatmap.pdf",width=4,height=6)
pheatmap(as.matrix(log2mx), Rowv=T, Colv=T, dendrogram = "both", trace = "none", density.info = c("none"), scale="row")
dev.off()

Draw Plots

1. Heatmap for DESeq2 normalized count matrix

2. PCA analysis

3. MA plot

4. Distance between samples

5. Hierarchical clustering for differential expressed genes

Tips/Utilities

Homework and more

  1. Identify differential expressed genes for other RNA types. between differential conditions, i.e. Normal Control (NC) V.S. HCC using three methods: edgeR, DESeq2 and Wilcox/Mann-Whitney-U Test.

  2. Draw Venn plot to show the difference among the above three methods.

More Reading and Practice

Video

1. Differential expression

b) EdgeR (): trimmed mean of M values

: 3. Differential Expression Analysis

TMM
Example of batch job
edgeR User Guide
DESeq2 User Guide
@Youtube
@Bilibili
Additional Tutorial
Normalizing single-cell RNA sequencing data: challenges and opportunities, Nature Methods, 2017