RNA Types in Genome
Files curated and organized by Yuhuan
Reference bed files: /BioII/lulab_b/taoyuhuan/RNA_biotype/bed_by_biotype
0. Total RNA Clean Reads V.S. Human Transcriptome
I. Total RNA Clean Reads (rRNA depleted) = Clean Reads (remove adaptor and short reads) - spikeIn - Univec - nucleus_rRNA - Mt_rRNA
(1) Mapped to Human Nucleus Genome
(2) Mapped to MT Genome
(3) Mapped to Microbial Genomes
(4) Unmapped
II. Human Transcriptome see the following section 2-3
1. Mitocondrial Transcriptome
MT_tRNA
transcript_type "Mt_tRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_tRNA' > MT_tRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_tRNA.bed
MT_lncRNA
transcript_type "Mt_lncRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf
grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'
grep chrM > MT_lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep chrM > MT_lncRNA_mitranscriptome.gtf; cat MT_lncRNA_gencode.gtf MT_lncRNA_mitranscriptome.gtf > MT_lncRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_lncRNA_gencode.gtf
sed 's/gene_id //g'
tr -d '"' > ../bed_by_biotype/MT_lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/MT_lncRNA_gencode.bed ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed > ../bed_by_biotype/MT_lncRNA.bed
MT_mRNA
transcript_type "Mt_mRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep chrM > MT_mRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_mRNA.bed
Note: MT_exon (for reference, not included)
$3=="exon"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf
grep chrM > MT_exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep chrM > MT_exon_mitranscriptome.gtf; cat MT_exon_gencode.gtf MT_exon_mitranscriptome.gtf > MT_exon.gtf
tcat "mixed_read_through"
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_exon_gencode.gtf
sed 's/gene_id //g'
tr -d '"' > ../bed_by_biotype/MT_exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_exon_mitranscriptome.bed; cat ../bed_by_biotype/MT_exon_gencode.bed ../bed_by_biotype/MT_exon_mitranscriptome.bed > ../bed_by_biotype/MT_exon.bed
Note: MT_rRNA (need to be revmoed)
transcript_type "Mt_rRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_rRNA' > MT_rRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_rRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_rRNA.bed
2. Human Transcriptome for common usage
lncRNA
Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping.
transcript_type "3prime_overlapping_ncRNA", transcript_type "antisense", transcript_type "bidirectional_promoter_lncRNA", transcript_type "lincRNA", transcript_type "macro_lncRNA", transcript_type "non_coding", transcript_type "processed_transcript", transcript_type "sense_intronic", transcript_type "sense_overlapping", tcat "lncrna"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf
grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping'
grep -v chrM > lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep -v chrM > lncRNA_mitranscriptome.gtf; cat lncRNA_gencode.gtf lncRNA_mitranscriptome.gtf > lncRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' lncRNA_gencode.gtf
sed 's/gene_id //g'
tr -d '"' > ../bed_by_biotype/lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/lncRNA_gencode.bed ../bed_by_biotype/lncRNA_mitranscriptome.bed > ../bed_by_biotype/lncRNA.bed
mRNA
transcript_type "protein_coding"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep -v chrM > mRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/mRNA.bed
pseudogene
Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following: processed_pseudogene, polymorphic_pseudogene, retrotransposed, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, unitary_pseudogene, unprocessed_pseudogene.
transcript_type "polymorphic_pseudogene", transcript_type "processed_pseudogene", transcript_type "retrotransposed",transcript_type "transcribed_processed_pseudogene", transcript_type "transcribed_unitary_pseudogene", transcript_type "transcribed_unprocessed_pseudogene", transcript_type "translated_processed_pseudogene", transcript_type "translated_unprocessed_pseudogene", transcript_type "unitary_pseudogene", transcript_type "unprocessed_pseudogene"
awk -F '\t' '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "polymorphic_pseudogene|transcript_type "processed_pseudogene|transcript_type "retrotransposed|transcript_type "transcribed_processed_pseudogene|transcript_type "transcribed_unitary_pseudogene|transcript_type "transcribed_unprocessed_pseudogene|transcript_type "translated_processed_pseudogene|transcript_type "translated_unprocessed_pseudogene|transcript_type "unitary_pseudogene|transcript_type "unprocessed_pseudogene' > pseudogene.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' pseudogene.gtf | sed 's/gene_id //g' | tr -d '"' > pseudogene.bed
snoRNA
transcript_type "snoRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snoRNA' > snoRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snoRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snoRNA.bed
snRNA
transcript_type "snRNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snRNA' > snRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snRNA.bed
srpRNA
gene_name "RN7SL*"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "RN7SL' > srpRNA.gtf
belong to misc_RNA
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' srpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/srpRNA.bed
tRNA
gene_type "Pseudo_tRNA"
gencode_tRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' gencode_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' | sed '1,5d' | grep -v chrM > ../bed_by_biotype/tRNA.bed
Y_RNA
gene_name "Y_RNA"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "Y_RNA' > Y_RNA.gtf
belong to misc_RNA
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' Y_RNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/Y_RNA.bed
misc (extract srpRNA, YRNA independently, the others may be ignored sometimes)
misc_RNA + ribozyme + scaRNA + vaultRNA/vault_RNA etc
misc_RNA: 7SK etc (need to remove Xiist etc)
-
-
-
tucpRNA
tcat "tucp"
awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "tucp' > tucpRNA.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' tucpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/tucpRNA.bed
circRNA
Note: exon (for reference, not included)
$3=="exon"
awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf
grep -v chrM > exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep -v chrM > exon_mitranscriptome.gtf; cat exon_gencode.gtf exon_mitranscriptome.gtf > exon.gtf
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' exon_gencode.gtf
sed 's/gene_id //g'
tr -d '"' > ../bed_by_biotype/exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/exon_mitranscriptome.bed; cat ../bed_by_biotype/exon_gencode.bed ../bed_by_biotype/exon_mitranscriptome.bed > ../bed_by_biotype/exon.bed
3. More Potential Human RNAs
enhancer
repeats
promoter
Last updated