1.1 Mapping, Annotation and QC
Last updated
Last updated
File format
Information contained in file
File description
Notes
fastq
reads
small RNA-seq reads from single-end libraries.
This will be processed from long reads we got. Raw reads are trimmed as single-ended, 50nt clean reads (see details in the protocol and example).
fasta
genome/transcriptome sequences
sequences of the whole human genome or different RNAs (transcriptome)
gtf / gff
genome annotation
Default genome annotation file is from GENCODE, mirBase, etc.
Be careful about exon and intron for mRNA/lncRNA, precursor and mature product for miRNA.
File format
Information contained in file
File description
Notes
sam/bam
alignments, i.e. mapped reads (bam is a binary format of sam)
Produced by mapping reads to transcriptome
We map to transcriptome for a better sensitivity (see details in protocol and example).
bt2
indexed sequences
Indexed genome/ transcriptome
Different ncRNA types are indexed separately and mapped sequentially.
RSEM: index and many utility package for transcriptome analysis
fastqc: QC for fastq file
cutadapt: trim and cut adaptor
bowtie2: mapping
Not needed for sRNA-seq.
Use the following criteria to QC the mapped samples for cfRNAs (exRNAs):
Check point
Threshold
Notes
Raw reads quality
reads quality >28 (median lines in green area)
Check fastqc results(*.html)
Clean reads number
> 10 million
Adaptors and too-short sequences removed reads
rRNAs%
< 10%
Reads mapped to rRNAs (all following % are divided by the total number of clean reads)
HG%
> 60% (optional)
Reads mapped to Human Genome except rRNAs
Transcriptome%
> 50%
Reads mapped to Human Transcriptome (including rRNA, miRNA, piRNA, Y RNA, srpRNA, snRNA, snoRNA, tRNA, mRNA exons, lncRNA exons, TUCP exons)
Y RNA%
10%~65%
Reads mapped to Y RNA
miRNA%
10%~65% (up to 80% for exoRNAs)
Reads mapped to miRNA
3.1 Why use this method: We want to determine a satisfying mapping order by examine the mapping ratio and length distribution. It takes hours to map and collect statistics from a specific order. We use a special method, using python (numpy) to manipulate matrix. By using this method, we can test hundreds of mapping order within few minutes.
3.2 Scripts
Introduction to exRNA-seq;
Mapping, Annotation and QC
Week V - PARTII.1.Mapping etc - Wang
Week VI - PARTII. 2. run your job - example 1 bash
Bioinformatics Data Skills
10) Working with Sequence Data
Map the example fastq files and summarize the distribution of 1) reads' length and 2) reads' percentage for each type of RNA (e.g. miRNA, piRNA, Y RNA, srp RNA, etc) **. (You may find some script examples for this statistics calculation from .)
At first we do mapping for all RNAs at one time, we use this snakefile and config file to do mapping at one time. (. .)
Secondly we do some matrix manipulation tricks. Using numpy to do matrix indexing and sorting is ultra fast. By using some tricks, we can have the desired statistics within few minutes for hundreds of different mapping orders: (. )(You can find some tries and tests of the mapping order methods and statistics and plot jupyter notebook here: )
At last we also extend our algorithm, we can create a big dictionary to store all the possible mapping assignment in every order and every one map condition. We encode the binary condition sequence into 10-bit digits. And store the whole dictionary together. It can test up to 100,000 kinds of mapping order within few hours. The jupyter-notebook presents the main idea of this method. ()
: 1.preprocessing_mapping_QC
: *
: