Abstract:
Background: MicroRNAs (miRNAs) participate in
diverse cellular and physiological processes through the post-transcriptional
gene regulatory pathway. Hairpin is a crucial structural feature for the
computational identification of precursor miRNAs (pre-miRs), as its formation is
critically associated with the early stages of the mature miRNA biogenesis. Our
incomplete knowledge on the number of miRNAs present in the genomes of
verterbrates, worms, plants, and even viruses necessitate thorough understanding
of their sequence motifs, hairpin structural characteristics, and topological
descriptors. The findings will promote more accurate guidelines and distinctive
criteria for the prediction of novel pre-miRs with improved
performances.
Results: In this in-depth study, we investigate a
comprehensive and heterogeneous collection of 2241 published (non-redundant)
pre-miRs across 41 species (miRBase 8.2), 8494 pseudo hairpins extracted from
the human RefSeq genes, 12387 (non-redundant) ncRNAs spanning 457 types (Rfam
7.0), 31 full-length mRNAs randomly selected from GenBank, and four sets of
synthetically generated genomic background corresponding to each of the native
RNA sequence. Our large-scale characterization analysis reveals that pre-miRs
are significantly different from other types of ncRNAs, pseudo hairpins, mRNAs,
and genomic background according to the non-parametric Kruskal-Wallis ANOVA (p
< 0.001). We examine the intrinsic and global features at the sequence,
structural, and topological levels including %G+C content, normalized base
pairing propensity P(S), normalized Minimum Free Energy of folding MFE(s),
normalized Shannon Entropy Q(s), normalized base pair distance D(s), and degree
of compactness F(S), as well as their corresponding Z-scores of P(S), MFE(s),
Q(s), D(s), and F(S).
Conclusions: A definitive criterion for
identifying and classifying accurately promising precursor transcripts as bona
fide pre-miRs within a single genome has not yet been discovered. Moreover,
discriminative features used in existing (quasi) de novo classifiers
have achieved far from
satisfactory specificity and sensitivity. Our findings have
been incorporated into the development of a new and better performing de
novo classifier, wholly independent of phylogenetic conservation.
Keywords: precursor microRNAs; Minimum Free Energy of folding; Shannon Entropy; Z-scores; second eigenvalue;