1.2 Expression Matrix
Pipeline

Data Structure
"genomes/hg38/" # reference genomes (i.e. genome sequence and annotation)
"shared_scripts/" # shared scripts by Lu Lab
~/github # I sync my own scripts to github
~/proj_exRNA/
|-- RNA_index # rerference transcriptomes (fasta and index)
|-- sample*_name # ...
|-- sample2_name # different samples
`-- sample1_name
|-- fastq # raw data: fastq files
|-- fastqc # QC of fastq
|-- trim # trimmed fastq (e.g. 3' adaptor cutted)
|-- mapped # mapped data: SAM/BAM files
`-- exp # expression of each gene/ncRNAInputs
File format
Information contained in file
File description
Notes
bam
alignments
Produced by mapping reads to the transcriptome.
Reads are trimmed using a proprietary version of cutAdapt. We map to transcriptome for a better sensitivity (see details in protocol and example).
Outputs
File format
Information contained in file
File description
Notes
bigWig
signal
Normalized RNA-seq signal
Signals are generated for transcriptome both the plus and minus strands and for unique reads and unique+multimapping reads.
tsv
gene (ncRNA) quantifications
Non-normalized counts.
Running Scripts
Software/Tools
RSEM
homer
HTseq
FeatureCounts
Example of single case
# genome seuqneces and annotaions
ln -s /BioII/lulab_b/shared/genomes ~/genomes
export hg38=~/genomes/human_hg38/sequence/GRCh38.p12.genome.fa
export gtf=~/genomes/human_hg38/gtf
# working space
cd ~/proj_exRNA# first convert relative positions to genomic coordinates
# then convert bam to bigWig, so that we can view it in genome browser
#####################################################################
#0.1 convert the relative positions on RNAs to genomic coordinates
#####################################################################
rsem-tbam2gbam miRNA.indexDir/ NC_1.miRNA.sam NC_1.miRNA.rsem.bam
#####################################################################
#0.2 two ways to convert bam/sam to bedGraph/bigwig for visulaization
#####################################################################
#0.2a Using homer
#Make tag directories for each sample (e.g. NC_1)
#works with sam or bam (samtools must be installed for bam)
makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam
## Add "-strand separate" for strand-specific sequencing
makeUCSCfile NC_1.miRNA.tagsDir/ -fragLength given -o auto
# (repeat for other tag directories)makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam analyzeRepeats.pl analyzeRepeats.pl /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gtf hg38 -count exons -d NC_1.miRNA.tagsDir/ -gid -noadj > NC_1.miRNA.homer.rpmhtseq-count -f bam -m intersection-strict miRNA_primary_transcript NC_1.miRNA_hg38_mapped.sam /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff > NC_1.miRNA.htseq.countsfeatureCounts -s 1 -a /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff -o NC_1.miRNA.featureCounts.counts NC_1.miRNA.sorted.bamTips/Utilities
Merge multiple bams to one
# mutiple bams to one bam file, in Snakemake form
rule mergeBam:
input:
bam = "02.mapping/{library}/sequentialMap/{library}.hg38other.bam"
output:
bam = "02.mapping/{library}/sequentialMap/{library}.sequentialMap.bam"
params:
jobname = "{library}.mergeBam",
dir = "02.mapping/{library}/sequentialMap/"
shell:
"""
samtools merge {output.bam} {input.bam} `find {params.dir} -name "*.rsem.clean.bam"`Homework and more
Visualize your mapped reads with IGV (locally) and/or UCSC Genome Browser (on line).
Learn how to construct the expression matrix using HTSeq, featureCounts and homer; then compare the difference among these three methods.
More Reading and Practice
Additional Tutorial : 2. Construction of expression matrix
Bioinformatics Data Skills
11) Working with Alignment Data
Video
a) Expression matrix
Last updated