RNA Types in Genome

Files curated and organized by Yuhuan

Reference bed files: /BioII/lulab_b/taoyuhuan/RNA_biotype/bed_by_biotype

0. Total RNA Clean Reads V.S. Human Transcriptome

  • I. Total RNA Clean Reads (rRNA depleted) = Clean Reads (remove adaptor and short reads) - spikeIn - Univec - nucleus_rRNA - Mt_rRNA

    • (1) Mapped to Human Nucleus Genome

    • (2) Mapped to MT Genome

    • (3) Mapped to Microbial Genomes

    • (4) Unmapped

  • II. Human Transcriptome see the following section 2-3

1. Mitocondrial Transcriptome

Mitocondrial_TranscriptomeDefinitionBiotype or name annotation in gtfCommand for gtf_by_biotypeNoteCommand for gtf2bed

MT_tRNA

transcript_type "Mt_tRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_tRNA' > MT_tRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_tRNA.bed

MT_lncRNA

transcript_type "Mt_lncRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'

grep chrM > MT_lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep chrM > MT_lncRNA_mitranscriptome.gtf; cat MT_lncRNA_gencode.gtf MT_lncRNA_mitranscriptome.gtf > MT_lncRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_lncRNA_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/MT_lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/MT_lncRNA_gencode.bed ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed > ../bed_by_biotype/MT_lncRNA.bed

MT_mRNA

transcript_type "Mt_mRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep chrM > MT_mRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_mRNA.bed

Note: MT_exon (for reference, not included)

$3=="exon"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep chrM > MT_exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep chrM > MT_exon_mitranscriptome.gtf; cat MT_exon_gencode.gtf MT_exon_mitranscriptome.gtf > MT_exon.gtf

tcat "mixed_read_through"

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_exon_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/MT_exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_exon_mitranscriptome.bed; cat ../bed_by_biotype/MT_exon_gencode.bed ../bed_by_biotype/MT_exon_mitranscriptome.bed > ../bed_by_biotype/MT_exon.bed

Note: MT_rRNA (need to be revmoed)

transcript_type "Mt_rRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_rRNA' > MT_rRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_rRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_rRNA.bed

2. Human Transcriptome for common usage

Human_Transcriptome for common usageDefinitionBiotype or name annotation in gtfCommand for gtf_by_biotypeNoteCommand for gtf2bed

lncRNA

Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping.

transcript_type "3prime_overlapping_ncRNA", transcript_type "antisense", transcript_type "bidirectional_promoter_lncRNA", transcript_type "lincRNA", transcript_type "macro_lncRNA", transcript_type "non_coding", transcript_type "processed_transcript", transcript_type "sense_intronic", transcript_type "sense_overlapping", tcat "lncrna"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'

grep -v chrM > lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep -v chrM > lncRNA_mitranscriptome.gtf; cat lncRNA_gencode.gtf lncRNA_mitranscriptome.gtf > lncRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' lncRNA_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/lncRNA_gencode.bed ../bed_by_biotype/lncRNA_mitranscriptome.bed > ../bed_by_biotype/lncRNA.bed

mRNA

transcript_type "protein_coding"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep -v chrM > mRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/mRNA.bed

pseudogene

Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following: processed_pseudogene, polymorphic_pseudogene, retrotransposed, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, unitary_pseudogene, unprocessed_pseudogene.

transcript_type "polymorphic_pseudogene", transcript_type "processed_pseudogene", transcript_type "retrotransposed",transcript_type "transcribed_processed_pseudogene", transcript_type "transcribed_unitary_pseudogene", transcript_type "transcribed_unprocessed_pseudogene", transcript_type "translated_processed_pseudogene", transcript_type "translated_unprocessed_pseudogene", transcript_type "unitary_pseudogene", transcript_type "unprocessed_pseudogene"

awk -F '\t' '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "polymorphic_pseudogene|transcript_type "processed_pseudogene|transcript_type "retrotransposed|transcript_type "transcribed_processed_pseudogene|transcript_type "transcribed_unitary_pseudogene|transcript_type "transcribed_unprocessed_pseudogene|transcript_type "translated_processed_pseudogene|transcript_type "translated_unprocessed_pseudogene|transcript_type "unitary_pseudogene|transcript_type "unprocessed_pseudogene' > pseudogene.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' pseudogene.gtf | sed 's/gene_id //g' | tr -d '"' > pseudogene.bed

snoRNA

transcript_type "snoRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snoRNA' > snoRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snoRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snoRNA.bed

snRNA

transcript_type "snRNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snRNA' > snRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snRNA.bed

srpRNA

gene_name "RN7SL*"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "RN7SL' > srpRNA.gtf

belong to misc_RNA

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' srpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/srpRNA.bed

tRNA

gene_type "Pseudo_tRNA"

gencode_tRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' gencode_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' | sed '1,5d' | grep -v chrM > ../bed_by_biotype/tRNA.bed

Y_RNA

gene_name "Y_RNA"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "Y_RNA' > Y_RNA.gtf

belong to misc_RNA

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' Y_RNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/Y_RNA.bed

misc (extract srpRNA, YRNA independently, the others may be ignored sometimes)

misc_RNA + ribozyme + scaRNA + vaultRNA/vault_RNA etc

misc_RNA: 7SK etc (need to remove Xiist etc)

-

-

-

tucpRNA

tcat "tucp"

awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "tucp' > tucpRNA.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' tucpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/tucpRNA.bed

circRNA

Note: exon (for reference, not included)

$3=="exon"

awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf

grep -v chrM > exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep -v chrM > exon_mitranscriptome.gtf; cat exon_gencode.gtf exon_mitranscriptome.gtf > exon.gtf

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' exon_gencode.gtf

sed 's/gene_id //g'

tr -d '"' > ../bed_by_biotype/exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/exon_mitranscriptome.bed; cat ../bed_by_biotype/exon_gencode.bed ../bed_by_biotype/exon_mitranscriptome.bed > ../bed_by_biotype/exon.bed

3. More Potential Human RNAs

More Potential Human RNAsDefinitionBiotype or name annotation in gtfCommand for gtf_by_biotypeNoteCommand for gtf2bed

enhancer

repeats

promoter

Last updated