Open Access Highly Accessed Open Badges Research

Thousands of missed genes found in bacterial genomes and their analysis with COMBREX

Derrick E Wood123*, Henry Lin2, Ami Levy-Moonshine4, Rajiswari Swaminathan4, Yi-Chien Chang5, Brian P Anton6, Lais Osmani4, Martin Steffen47, Simon Kasif45 and Steven L Salzberg38

  • * Corresponding author: Derrick E Wood

  • † Equal contributors

Author Affiliations

1 Department of Computer Science, University of Maryland, College Park, MD, 20742, USA

2 Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA

3 McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA

4 Department of Biomedical Engineering, Boston University, Boston, MA, 02215, USA

5 Bioinformatics Program, Boston University, Boston, MA, 02215, USA

6 New England Biolabs, 240 County Road, Ipswich, MA, 01938, USA

7 Department of Pathology and Laboratory Medicine, Boston University School of Medicine, Boston University, Boston, MA, 02218, USA

8 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, 21205, USA

For all author emails, please log on.

Biology Direct 2012, 7:37  doi:10.1186/1745-6150-7-37

Published: 30 October 2012



The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST.


By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX.


Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website.


This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).