2004 Fiscal Year Final Research Report Summary

Modeling and prediction of genome sequence information by using information representation models

Research Project

Project/Area Number	12208010
Research Category	Grant-in-Aid for Scientific Research on Priority Areas
Allocation Type	Single-year Grants
Review Section	Biological Sciences
Research Institution	Kyoto University (2003-2004) The University of Tokyo (2001-2002) The Institute of Physical and Chemical Research (2000)
Principal Investigator	YADA Tetsushi Kyoto University, Graduate School of Informatics, Associate Professor, 情報学研究科, 助教授 (10322728)
Co-Investigator(Kenkyū-buntansha)	ASAI Kiyoshi The University of Tokyo, Graduate School of Frontier Science, Professor, 新領域創成科学研究科, 教授 (30356357)
Project Period (FY)	2000 – 2004
Keywords	bioinformatics / sequence analysis / gene finding / stochastic model / machine learning
Research Abstract	In this research, we have focused on gene models which are capable of finding genes from genome sequences. First, we have developed a general purpose algorithm which finds genes by combining plural existing gene-finders. The algorithm has been implemented into a novel gene-finder named DIGIT. An outline of the algorithm is as follows. First, existing gene-finders are applied to an uncharacterized genomic sequence (input sequence). Next, DIGIT produces all possible exons from the results of gene-finders, and assigns them their exon types, reading frames and exon scores. Finally, DIGIT searches a set of exons whose additive score is maximized under their reading frame constraints. Bayesian procedure and a hidden Markov model (HMM) are used to infer exon scores and search the exon set, respectively. We have designed DIGIT so as to combine the results of FGENESH, GENSCAN and HMMgene, and have assessed its prediction accuracy by using recently compiled benchmark data sets. For all data sets, … More DIGIT successfully discarded many false-positive exons predicted by individual gene-finders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single gene-finder. Second, we have developed a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such frame recoveries are not observed in other conserved regions. We prepared two gene models: a model which finds gene by using sequence similarity and intrinsic gene measures (basic model), and the other model which finds gene by using frame recovery index in addition to sequence similarity and intrinsic gene measures (frame recovery model). We evaluated the prediction accuracies of the two models, and our benchmark test revealed that frame recovery model significantly improved the prediction accuracy in comparison with basic model. Third, we have developed GeneDecoder which is a gene finding technology for eukaryotes, based on HMMs. The algorithm, using dynamic programing method and statistic models trained by annotated genome sequences, divides the input nucleic acid sequence into some meaningful segments. Besides, GeneDecoder has some additional features: (1) multi-stream architecture, (2) incorporation of similarity search and (3) SVM-driven putative splice sites screening. (1) In addition to nucleic acid sequences, GeneDecoder allows any other data streams to be added. Typically, dicodon bigram values can be calculated in advance and be aligned on a 'Direct' stream, which makes state transition networks much simpler. Any other meaningful features extracted in advance can be incorporated to. gene-finding process using this scheme. (2) Combining calculation of coding potential and similarity search with known sequence database realizes more reliable putative exons. For this purpose, GeneDecoder has ability both to embed known motif models in exon models and to use segments with which similarity to known sequence was found by BLAST search. (3) Support Vector Machine (SVM) is one of the pattern re cognition techniques known to have high classification capability and has succes sfully been applied to splice site prediction. In GeneDecoder, this fearure is implemented as well as PWM-based splice site mod els. While parsing, putative splice sites derived from the PWM-based models but have poor support by the SVMs designed as splice site classifiers are excluded. Less

Research Products
(26 results)

All 2005 2004 2003 2002 2001 2000

All Journal Article (25 results) Book (1 results)

[Journal Article] Genome sequencing and analysis of Aspergillus oryzae2005
- Author(s)
  Machida M., Asai K., et al.
- Journal Title
  
  Nature 438
  
  Pages: 1157-1161
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Sequencing of Aspergillus nidulans and comparative analysis with A. fumitatus and A. oryzae2005
- Author(s)
  Galagan JE, Calvo SE, Cuomo C, et al.
- Journal Title
  
  Nature 438
  
  Pages: 1105-1115
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Genome sequence of pathogenic and allergenic filamentous fungus Aspergillus fumigatus2005
- Author(s)
  Nierman WC, Pain A, Anderson MJ, et al.
- Journal Title
  
  Nature 438
  
  Pages: 1151-1155
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Extracting relations between promoter sequences and their strengths from microarray data2005
- Author(s)
  Kiryu, H., Oshima, T., Asai, K.
- Journal Title
  
  Bioinformatics 21
  
  Pages: 1062-1068
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Finishing the euchromatic sequence of the human genome2004
- Author(s)
  International Human Genome Sequencing Consortium
- Journal Title
  
  Nature 431
  
  Pages: 931-945
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Complete sequencing and characterization of 21,243 full-length human cDNAs2004
- Author(s)
  Ota, T., Suzuki, Y., Nishikawa, T., et al.
- Journal Title
  
  Nat. Genet. 36
  
  Pages: 40-45
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Minimizing the Cross Validation Error to Mix Kernel Matrices of Heterogeneous Biological Data2004
- Author(s)
  Tsuda, K., Uda, S., Kin, T., Asai, K.
- Journal Title
  
  Neural Processing Letters 19
  
  Pages: 63-72
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] DIGIT : a novel gene finding program by combining gene-finders2003
- Author(s)
  Yada, T., Totoki, Y., Takaeda, Y., Sakaki, Y., Takagi, T.
- Journal Title
  
  Proc. of Pacific Sympo. on Biocomputing ' 03
  
  Pages: 375-387
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Whole-genome screening indicates a possible burst of formation of processed pseudogenes and alu repeats by particular 11 subfamilies in ancestral primates2003
- Author(s)
  Ohshima, K., Hattori, M., Yada, T., Gojobori, T., Sakaki, Y., Okada, N.
- Journal Title
  
  Genome Biol. 4
  
  Pages: R74
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] DIGIT : a novel gene finding program by combining gene-finders2003
- Author(s)
  Yada, T., Totoki, Y., Takaeda, Y., Sakaki, Y., Takagi, T.
- Journal Title
  
  Proc. of Pacific Sympo. on Biocomputing 03
  
  Pages: 375-387
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] Statistics for Biological Sequences2003
- Author(s)
  Kishino, H., Asai, K.
- Journal Title
  
  Iwanami Publisher
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] A novel index which precisely derives protein coding regions from cross-species genome alignments2002
- Author(s)
  Noguchi, H., Yada, T., Sakaki, Y.
- Journal Title
  
  In Proc. of Genome Informatics Workshop 2002
  
  Pages: 183-191
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Marginalized kernels for biological sequences2002
- Author(s)
  Tsuda, K., Kin, T., Asai, K.
- Journal Title
  
  Bioinformatics 18
  
  Pages: 268S-275S
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Marginalized Kernels for RNA Sequence Data Analysis2002
- Author(s)
  Kin, T., Tsuda K., Asai, K.
- Journal Title
  
  In Proc. of Genome Informatics Workshop 2002
  
  Pages: 112-122
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Modeling Splicing Sites with Pairwise Correlations2002
- Author(s)
  Arita, M., Tsuda, K., Asai, K.
- Journal Title
  
  Bioinformatics 18
  
  Pages: 27S-34S
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Chromosome-wide assessment of replication timing for human chromosomes 11q and 21q : disease-related genes in timingswitch regions2002
- Author(s)
  Watanabe, Y., Fujiyama, A., Ichiba, Y., Hattori, M., Yada, T., Sakaki, Y., Ikemura, T.
- Journal Title
  
  Human Molecular Genetics 11
  
  Pages: 13-21
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] A novel index which precisely derives protein coding regions from cross-species genome alignments2002
- Author(s)
  Noguchi, H., Yada, T., Sakaki, Y.
- Journal Title
  
  Proc. of Genome Informatics Workshop 2002
  
  Pages: 183-191
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] Marginalized Kernels for RNA Sequence Data Analysis2002
- Author(s)
  Kin, T., Tsuda K., Asai, K.
- Journal Title
  
  Proc. of Genome Informatics Workshop 2002
  
  Pages: 112-122
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] Initial sequencing and analysis of the human genome2001
- Author(s)
  International Human Genome Sequencing Consortium
- Journal Title
  
  Nature 409
  
  Pages: 860-921
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] A physical map of the human genome2001
- Author(s)
  The International Human Genome Mapping Consortium
- Journal Title
  
  Nature 409
  
  Pages: 934-941
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] A novel bacterial gene-finding system with improved accuracy in locating start codons2001
- Author(s)
  Yada, T., Totoki, Y., Takagi, T., Nakai, K.
- Journal Title
  
  DNA Res. 8
  
  Pages: 97-106
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Differential display analysis of mutants for the transcription factor pdrlp regulating multidrug resistance in the budding yeast2001
- Author(s)
  Miura, F., Yada, T., Nakai, K., Sakaki, Y., Ito., T.
- Journal Title
  
  FEBS Letters 505
  
  Pages: 103-108
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Differential display analysis of mutants for the transcription factor pdr1p regulating multidrug resistance in the budding yeast2001
- Author(s)
  Miura, F., Yada, T., Nakai, K., Sakaki, Y., Ito., T.
- Journal Title
  
  FEBS Letters 505
  
  Pages: 103-108
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] The DNA sequence of human chromosome 212000
- Author(s)
  The Chromosome 21 Mapping and Sequencing Consortium
- Journal Title
  
  Nature 405
  
  Pages: 311-319
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] The DNA sequence of human chromosome 212000
- Author(s)
  The Chromosome 21 Mapping, Sequencing Consortium
- Journal Title
  
  Nature 405
  
  Pages: 311-319
- Description
  「研究成果報告書概要(欧文)」より
[Book] 統計科学のフロンティア9生物配列の統計 : 核酸・タンパクから情報を読む2003
- Author(s)
  岸野洋久, 浅井潔
- Total Pages
  264
- Publisher
  岩波書店
- Description
  「研究成果報告書概要(和文)」より

2004 Fiscal Year Final Research Report Summary

Modeling and prediction of genome sequence information by using information representation models

Principal Investigator

YADA Tetsushi Kyoto University, Graduate School of Informatics, Associate Professor, 情報学研究科, 助教授 (10322728)

Research Products

[Journal Article] Genome sequencing and analysis of Aspergillus oryzae2005

Author(s)

Journal Title

Description

[Journal Article] Sequencing of Aspergillus nidulans and comparative analysis with A. fumitatus and A. oryzae2005

Author(s)

Journal Title

Description

[Journal Article] Genome sequence of pathogenic and allergenic filamentous fungus Aspergillus fumigatus2005

Author(s)

Journal Title

Description

[Journal Article] Extracting relations between promoter sequences and their strengths from microarray data2005

Author(s)

Journal Title

Description

[Journal Article] Finishing the euchromatic sequence of the human genome2004

Author(s)

Journal Title

Description

[Journal Article] Complete sequencing and characterization of 21,243 full-length human cDNAs2004

Author(s)

Journal Title

Description

[Journal Article] Minimizing the Cross Validation Error to Mix Kernel Matrices of Heterogeneous Biological Data2004

Author(s)

Journal Title

Description

[Journal Article] DIGIT : a novel gene finding program by combining gene-finders2003

Author(s)

Journal Title

Description

[Journal Article] Whole-genome screening indicates a possible burst of formation of processed pseudogenes and alu repeats by particular 11 subfamilies in ancestral primates2003

Author(s)

Journal Title

Description

[Journal Article] DIGIT : a novel gene finding program by combining gene-finders2003

Author(s)

Journal Title

Description

[Journal Article] Statistics for Biological Sequences2003

Author(s)

Journal Title

Description

[Journal Article] A novel index which precisely derives protein coding regions from cross-species genome alignments2002

Author(s)

Journal Title

Description

[Journal Article] Marginalized kernels for biological sequences2002

Author(s)

Journal Title

Description

[Journal Article] Marginalized Kernels for RNA Sequence Data Analysis2002

Author(s)

Journal Title

Description

[Journal Article] Modeling Splicing Sites with Pairwise Correlations2002

Author(s)

Journal Title

Description

[Journal Article] Chromosome-wide assessment of replication timing for human chromosomes 11q and 21q : disease-related genes in timingswitch regions2002

Author(s)

Journal Title

Description

[Journal Article] A novel index which precisely derives protein coding regions from cross-species genome alignments2002

Author(s)

Journal Title

Description

[Journal Article] Marginalized Kernels for RNA Sequence Data Analysis2002

Author(s)

Journal Title

Description

[Journal Article] Initial sequencing and analysis of the human genome2001

Author(s)

Journal Title