RNA Types in Genome
Files curated and organized by Yuhuan
Reference bed files: /BioII/lulab_b/taoyuhuan/RNA_biotype/bed_by_biotype
0. Total RNA Clean Reads V.S. Human Transcriptome
I. Total RNA Clean Reads (rRNA depleted) = Clean Reads (remove adaptor and short reads) - spikeIn - Univec - nucleus_rRNA - Mt_rRNA
(1) Mapped to Human Nucleus Genome
(2) Mapped to MT Genome
(3) Mapped to Microbial Genomes
(4) Unmapped
II. Human Transcriptome see the following section 2-3
1. Mitocondrial Transcriptome
Mitocondrial_Transcriptome | Definition | Biotype or name annotation in gtf | Command for gtf_by_biotype | Note | Command for gtf2bed | ||||
---|---|---|---|---|---|---|---|---|---|
MT_tRNA | transcript_type "Mt_tRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_tRNA' > MT_tRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_tRNA.bed | ||||||
MT_lncRNA | transcript_type "Mt_lncRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping' | grep chrM > MT_lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep chrM > MT_lncRNA_mitranscriptome.gtf; cat MT_lncRNA_gencode.gtf MT_lncRNA_mitranscriptome.gtf > MT_lncRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_lncRNA_gencode.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/MT_lncRNA_gencode.bed ../bed_by_biotype/MT_lncRNA_mitranscriptome.bed > ../bed_by_biotype/MT_lncRNA.bed | ||
MT_mRNA | transcript_type "Mt_mRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep chrM > MT_mRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_mRNA.bed | ||||||
Note: MT_exon (for reference, not included) | $3=="exon" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep chrM > MT_exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep chrM > MT_exon_mitranscriptome.gtf; cat MT_exon_gencode.gtf MT_exon_mitranscriptome.gtf > MT_exon.gtf | tcat "mixed_read_through" | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_exon_gencode.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' MT_exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_exon_mitranscriptome.bed; cat ../bed_by_biotype/MT_exon_gencode.bed ../bed_by_biotype/MT_exon_mitranscriptome.bed > ../bed_by_biotype/MT_exon.bed | ||
Note: MT_rRNA (need to be revmoed) | transcript_type "Mt_rRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "Mt_rRNA' > MT_rRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' MT_rRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/MT_rRNA.bed |
2. Human Transcriptome for common usage
Human_Transcriptome for common usage | Definition | Biotype or name annotation in gtf | Command for gtf_by_biotype | Note | Command for gtf2bed | ||||
---|---|---|---|---|---|---|---|---|---|
lncRNA | Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping. | transcript_type "3prime_overlapping_ncRNA", transcript_type "antisense", transcript_type "bidirectional_promoter_lncRNA", transcript_type "lincRNA", transcript_type "macro_lncRNA", transcript_type "non_coding", transcript_type "processed_transcript", transcript_type "sense_intronic", transcript_type "sense_overlapping", tcat "lncrna" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "3prime_overlapping_ncRNA|transcript_type "antisense|transcript_type "bidirectional_promoter_lncRNA|transcript_type "lincRNA|transcript_type "macro_lncRNA|transcript_type "non_coding|transcript_type "processed_transcript|transcript_type "sense_intronic|transcript_type "sense_overlapping' | grep -v chrM > lncRNA_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "lncrna' | grep -v chrM > lncRNA_mitranscriptome.gtf; cat lncRNA_gencode.gtf lncRNA_mitranscriptome.gtf > lncRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' lncRNA_gencode.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/lncRNA_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' lncRNA_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/lncRNA_mitranscriptome.bed; cat ../bed_by_biotype/lncRNA_gencode.bed ../bed_by_biotype/lncRNA_mitranscriptome.bed > ../bed_by_biotype/lncRNA.bed | |
mRNA | transcript_type "protein_coding" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "protein_coding' | grep -v chrM > mRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' mRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/mRNA.bed | ||||||
pseudogene | Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following: processed_pseudogene, polymorphic_pseudogene, retrotransposed, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, unitary_pseudogene, unprocessed_pseudogene. | transcript_type "polymorphic_pseudogene", transcript_type "processed_pseudogene", transcript_type "retrotransposed",transcript_type "transcribed_processed_pseudogene", transcript_type "transcribed_unitary_pseudogene", transcript_type "transcribed_unprocessed_pseudogene", transcript_type "translated_processed_pseudogene", transcript_type "translated_unprocessed_pseudogene", transcript_type "unitary_pseudogene", transcript_type "unprocessed_pseudogene" | awk -F '\t' '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "polymorphic_pseudogene|transcript_type "processed_pseudogene|transcript_type "retrotransposed|transcript_type "transcribed_processed_pseudogene|transcript_type "transcribed_unitary_pseudogene|transcript_type "transcribed_unprocessed_pseudogene|transcript_type "translated_processed_pseudogene|transcript_type "translated_unprocessed_pseudogene|transcript_type "unitary_pseudogene|transcript_type "unprocessed_pseudogene' > pseudogene.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' pseudogene.gtf | sed 's/gene_id //g' | tr -d '"' > pseudogene.bed | |||||
snoRNA | transcript_type "snoRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snoRNA' > snoRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snoRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snoRNA.bed | ||||||
snRNA | transcript_type "snRNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'transcript_type "snRNA' > snRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' snRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/snRNA.bed | ||||||
srpRNA | gene_name "RN7SL*" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "RN7SL' > srpRNA.gtf | belong to misc_RNA | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' srpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/srpRNA.bed | |||||
tRNA | gene_type "Pseudo_tRNA" | gencode_tRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' gencode_tRNA.gtf | sed 's/gene_id //g' | tr -d '"' | sed '1,5d' | grep -v chrM > ../bed_by_biotype/tRNA.bed | ||||||
Y_RNA | gene_name "Y_RNA" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep 'gene_name "Y_RNA' > Y_RNA.gtf | belong to misc_RNA | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' Y_RNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/Y_RNA.bed | |||||
misc (extract srpRNA, YRNA independently, the others may be ignored sometimes) | misc_RNA + ribozyme + scaRNA + vaultRNA/vault_RNA etc | misc_RNA: 7SK etc (need to remove Xiist etc) | - | - | - | ||||
tucpRNA | tcat "tucp" | awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep 'tcat "tucp' > tucpRNA.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' tucpRNA.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/tucpRNA.bed | ||||||
circRNA | |||||||||
Note: exon (for reference, not included) | $3=="exon" | awk -F "\t" '$3=="exon"' ../gtf/gencode.gtf | grep -v chrM > exon_gencode.gtf; awk -F "\t" '$3=="exon"' ../gtf/mitranscriptome.gtf | grep -v chrM > exon_mitranscriptome.gtf; cat exon_gencode.gtf exon_mitranscriptome.gtf > exon.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[1],0,$7}' exon_gencode.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/exon_gencode.bed; awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";");print $1,$4-1,$5,a[3],0,$7}' exon_mitranscriptome.gtf | sed 's/gene_id //g' | tr -d '"' > ../bed_by_biotype/exon_mitranscriptome.bed; cat ../bed_by_biotype/exon_gencode.bed ../bed_by_biotype/exon_mitranscriptome.bed > ../bed_by_biotype/exon.bed |
3. More Potential Human RNAs
More Potential Human RNAs | Definition | Biotype or name annotation in gtf | Command for gtf_by_biotype | Note | Command for gtf2bed |
---|---|---|---|---|---|
enhancer | |||||
repeats | |||||
promoter |
Last updated