Detection of protein coding sequences using a mixture model for local protein amino acid sequence

TitleDetection of protein coding sequences using a mixture model for local protein amino acid sequence
Publication TypeJournal Article
Year of Publication2000
AuthorsThayer, E. C., Bystroff C., & Baker D.
JournalJournal of computational biology : a journal of computational molecular cell biology
Date Published2000 Feb-Apr
KeywordsAlgorithms, Amino Acid Sequence, Biometry, DNA, DNA, Fungal, Fungal Proteins, Genome, Fungal, Humans, Models, Genetic, Primary Publication, Proteins, Saccharomyces cerevisiae, Sequence Analysis, Protein

Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.

Alternate JournalJ. Comput. Biol.
thayer00A.pdf244.63 KB