Research Abstract |
In this research, we have focused on gene models which are capable of finding genes from genome sequences. First, we have developed a general purpose algorithm which finds genes by combining plural existing gene-finders. The algorithm has been implemented into a novel gene-finder named DIGIT. An outline of the algorithm is as follows. First, existing gene-finders are applied to an uncharacterized genomic sequence (input sequence). Next, DIGIT produces all possible exons from the results of gene-finders, and assigns them their exon types, reading frames and exon scores. Finally, DIGIT searches a set of exons whose additive score is maximized under their reading frame constraints. Bayesian procedure and a hidden Markov model (HMM) are used to infer exon scores and search the exon set, respectively. We have designed DIGIT so as to combine the results of FGENESH, GENSCAN and HMMgene, and have assessed its prediction accuracy by using recently compiled benchmark data sets. For all data sets,
… More
DIGIT successfully discarded many false-positive exons predicted by individual gene-finders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single gene-finder. Second, we have developed a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such frame recoveries are not observed in other conserved regions. We prepared two gene models: a model which finds gene by using sequence similarity and intrinsic gene measures (basic model), and the other model which finds gene by using frame recovery index in addition to sequence similarity and intrinsic gene measures (frame recovery model). We evaluated the prediction accuracies of the two models, and our benchmark test revealed that frame recovery model significantly improved the prediction accuracy in comparison with basic model. Third, we have developed GeneDecoder which is a gene finding technology for eukaryotes, based on HMMs. The algorithm, using dynamic programing method and statistic models trained by annotated genome sequences, divides the input nucleic acid sequence into some meaningful segments. Besides, GeneDecoder has some additional features: (1) multi-stream architecture, (2) incorporation of similarity search and (3) SVM-driven putative splice sites screening. (1) In addition to nucleic acid sequences, GeneDecoder allows any other data streams to be added. Typically, dicodon bigram values can be calculated in advance and be aligned on a 'Direct' stream, which makes state transition networks much simpler. Any other meaningful features extracted in advance can be incorporated to. gene-finding process using this scheme. (2) Combining calculation of coding potential and similarity search with known sequence database realizes more reliable putative exons. For this purpose, GeneDecoder has ability both to embed known motif models in exon models and to use segments with which similarity to known sequence was found by BLAST search. (3) Support Vector Machine (SVM) is one of the pattern re cognition techniques known to have high classification capability and has succes sfully been applied to splice site prediction. In GeneDecoder, this fearure is implemented as well as PWM-based splice site mod els. While parsing, putative splice sites derived from the PWM-based models but have poor support by the SVMs designed as splice site classifiers are excluded. Less
|