Training @ Lu Lab
Lu Lab Docs
  • Home
    • Training @ Lu Lab
  • Drylab Training
    • Genomics
      • RNA Types in Genome
  • Wetlab Training
    • Wetlab Safety Guide
    • Wetlab FAQ
  • Archive
    • Archive 2021
      • cfDNA Methylation
      • Genomic Annotation
    • Archive 2019 - Wetlab Training
      • Class I. Basics
        • 1. Wet Lab Safety
        • 2. Wet Lab Regulation
        • 3. Wet Lab Protocols
        • 4. How to design sample cohort
        • 5. How to collect and manage samples
        • 6. How to purify RNA from blood
        • 7. How to check the quantity and quality of RNA
        • 8. RNA storage
        • 9. How to remove DNA contanimation
        • 10. What is Spike-in
      • Class II. NGS - I
        • 1. How to do RNA-seq
        • 2. How to check the quantity and quality of RNA-seq library
        • 3. What is SMART-seq2 and Multiplex
    • Archive 2019 - Drylab Training
      • Getting Startted
      • Part I. Programming Skills
        • Introduction of PART I
        • 1.Setup
        • 2.Linux
        • 3.Bash and Github
        • 4.R
        • 5.Python
        • 6.Perl
        • Conclusion of PART I
      • Part II. Machine Learning Skills
        • 1.Machine Learning
        • 2.Feature Selection
        • 3.Machine Learning Practice
        • 4.Deep Learning
      • Part III. Case studies
        • Case Study 1. exRNA-seq
          • 1.1 Mapping, Annotation and QC
          • 1.2 Expression Matrix
          • 1.3.Differential Expression
          • 1.4 Normalization Issues
        • Case Study 2. exSEEK
          • 2.1 Plot Utilities
          • 2.2 Matrix Processing
          • 2.3 Feature Selection
        • Case Study 3. DeepSHAPE
          • 3.1 Background
          • 3.2 Resources
          • 3.3 Literature
      • Part IV. Appendix
        • Appendix I. Keep Learning
        • Appendix II. Public Data
        • Appendix III. Mapping Protocol of RNA-seq
        • Appendix IV. Useful tools for bioinformatics
      • Part V. Software
        • I. Docker Manual
        • II. Local Gitbook Builder
        • III. Teaching Materials
  • Archive 2018
Powered by GitBook
On this page
  • 0. Total RNA Clean Reads V.S. Human Transcriptome
  • 1. Mitocondrial Transcriptome
  • 2. Human Transcriptome for common usage
  • 3. More Potential Human RNAs
Edit on GitHub
  1. Drylab Training
  2. Genomics

RNA Types in Genome

Last updated 8 months ago

Files curated and organized by Yuhuan

Reference bed files: /BioII/lulab_b/taoyuhuan/RNA_biotype/bed_by_biotype

0. Total RNA Clean Reads V.S. Human Transcriptome

  • I. Total RNA Clean Reads (rRNA depleted) = Clean Reads (remove adaptor and short reads) - spikeIn - Univec - nucleus_rRNA - Mt_rRNA

    • (1) Mapped to Human Nucleus Genome

    • (2) Mapped to MT Genome

    • (3) Mapped to Microbial Genomes

    • (4) Unmapped

  • II. Human Transcriptome see the following section 2-3

1. Mitocondrial Transcriptome

Mitocondrial_Transcriptome
Definition
Biotype or name annotation in gtf
Command for gtf_by_biotype
Note
Command for gtf2bed

MT_tRNA

transcript_type "Mt_tRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_tRNA' > MT_tRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_tRNA.bed

MT_lncRNA

transcript_type "Mt_lncRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'

grep chrM > MT_lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep chrM > MT_lncRNA_mitranscriptome.gtf; cat MT_lncRNA_gencode.gtf MT_lncRNA_mitranscriptome.gtf > MT_lncRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_lncRNA_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/MT_lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/MT_lncRNA_gencode.bed ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed > ../bed_by_biotype/MT_lncRNA.bed

MT_mRNA

transcript_type "Mt_mRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep chrM > MT_mRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_mRNA.bed

Note: MT_exon (for reference, not included)

$3=="exon"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep chrM > MT_exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep chrM > MT_exon_mitranscriptome.gtf; cat MT_exon_gencode.gtf MT_exon_mitranscriptome.gtf > MT_exon.gtf

tcat "mixed_read_through"

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_exon_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/MT_exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_exon_mitranscriptome.bed; cat ../bed_by_biotype/MT_exon_gencode.bed ../bed_by_biotype/MT_exon_mitranscriptome.bed > ../bed_by_biotype/MT_exon.bed

Note: MT_rRNA (need to be revmoed)

transcript_type "Mt_rRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_rRNA' > MT_rRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_rRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_rRNA.bed

2. Human Transcriptome for common usage

Human_Transcriptome for common usage
Definition
Biotype or name annotation in gtf
Command for gtf_by_biotype
Note
Command for gtf2bed

lncRNA

Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping.

transcript_type "3prime_overlapping_ncRNA", transcript_type "antisense", transcript_type "bidirectional_promoter_lncRNA", transcript_type "lincRNA", transcript_type "macro_lncRNA", transcript_type "non_coding", transcript_type "processed_transcript", transcript_type "sense_intronic", transcript_type "sense_overlapping", tcat "lncrna"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'

grep -v chrM > lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep -v chrM > lncRNA_mitranscriptome.gtf; cat lncRNA_gencode.gtf lncRNA_mitranscriptome.gtf > lncRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' lncRNA_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/lncRNA_gencode.bed ../bed_by_biotype/lncRNA_mitranscriptome.bed > ../bed_by_biotype/lncRNA.bed

mRNA

transcript_type "protein_coding"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep -v chrM > mRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/mRNA.bed

pseudogene

Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following: processed_pseudogene, polymorphic_pseudogene, retrotransposed, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, unitary_pseudogene, unprocessed_pseudogene.

transcript_type "polymorphic_pseudogene", transcript_type "processed_pseudogene", transcript_type "retrotransposed",transcript_type "transcribed_processed_pseudogene", transcript_type "transcribed_unitary_pseudogene", transcript_type "transcribed_unprocessed_pseudogene", transcript_type "translated_processed_pseudogene", transcript_type "translated_unprocessed_pseudogene", transcript_type "unitary_pseudogene", transcript_type "unprocessed_pseudogene"

awk -F '\t' '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "polymorphic_pseudogene|transcript_type "processed_pseudogene|transcript_type "retrotransposed|transcript_type "transcribed_processed_pseudogene|transcript_type "transcribed_unitary_pseudogene|transcript_type "transcribed_unprocessed_pseudogene|transcript_type "translated_processed_pseudogene|transcript_type "translated_unprocessed_pseudogene|transcript_type "unitary_pseudogene|transcript_type "unprocessed_pseudogene' > pseudogene.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' pseudogene.gtf | sed 's/gene_id //g' | tr -d '"' > pseudogene.bed

snoRNA

transcript_type "snoRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snoRNA' > snoRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snoRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snoRNA.bed

snRNA

transcript_type "snRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snRNA' > snRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snRNA.bed

srpRNA

gene_name "RN7SL*"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "RN7SL' > srpRNA.gtf

belong to misc_RNA

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' srpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/srpRNA.bed

tRNA

gene_type "Pseudo_tRNA"

gencode_tRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' gencode_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' | sed '1,5d' | grep -v chrM > ../bed_by_biotype/tRNA.bed

Y_RNA

gene_name "Y_RNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "Y_RNA' > Y_RNA.gtf

belong to misc_RNA

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' Y_RNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/Y_RNA.bed

misc (extract srpRNA, YRNA independently, the others may be ignored sometimes)

misc_RNA + ribozyme + scaRNA + vaultRNA/vault_RNA etc

misc_RNA: 7SK etc (need to remove Xiist etc)

-

-

-

tucpRNA

tcat "tucp"

awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "tucp' > tucpRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' tucpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/tucpRNA.bed

circRNA

Note: exon (for reference, not included)

$3=="exon"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep -v chrM > exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep -v chrM > exon_mitranscriptome.gtf; cat exon_gencode.gtf exon_mitranscriptome.gtf > exon.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' exon_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/exon_mitranscriptome.bed; cat ../bed_by_biotype/exon_gencode.bed ../bed_by_biotype/exon_mitranscriptome.bed > ../bed_by_biotype/exon.bed

3. More Potential Human RNAs

More Potential Human RNAs
Definition
Biotype or name annotation in gtf
Command for gtf_by_biotype
Note
Command for gtf2bed

enhancer

repeats

promoter

如何获取常见基因/RNA类型的注释信息