Published September 24, 2016 | Version v1
Journal article Open

COMPARISON OF POPULAR BIOINFORMATICS DATABASES

  • 1. Bioresources Development Centre, Kano National Biotechnology Development Agency (NABDA), Abuja - Nigeria
  • 2. Faculty of Informatics and Computing, University Sultan Zainal Abidin (UniSZA), Terengganu, Malaysia

Description

Bioinformatics is the application of computational tools to capture and interpret biological data. It has wide applications in drug development, crop improvement, agricultural biotechnology and forensic DNA analysis. There are various databases available to researchers in bioinformatics. These databases are customized for a specific need and are ranged in size, scope, and purpose. The main drawbacks of bioinformatics databases include redundant information, constant change, data spread over multiple databases, incomplete information, several errors, and sometimes incorrect links. Also, standard database, naming conventions, and nomenclature are not clearly defined for many aspects of biological information. Hence, these make information extraction more difficult. In this paper, most widely used bioinformatics databases are presented. These databases are notable for their level of redundancy and annotation, structure coverage and accessibility. They are GenBank, Protein Information Resource (PIR), DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL), Protein Data Bank (PDB), Universal Protein Resource (UniProt), Swiss-Prot, Structural Classification of Protein (SCOP) and Class Architecture Topology Homology (CATH) databases. The key features of the databases are demonstrated and detailed comparisons of the databases were made based on primary and secondary form of databases, and their uniqueness were also highlighted. The databases are foundation stones of bioinformatics and are useful for performing a rigorous benchmarking.

Files

4.pdf

Files (588.9 kB)

Name Size Download all
md5:e5d3b1444e7bc4c5ae0788dc501dcdaa
588.9 kB Preview Download

Additional details

References

  • 1. Ramsden Jeremy J, Bioinformatics: An Introduction 2nd edition (Springer-Verlag Limited, London), 2009. 2. Duck G, Nenadic G, Brass A, Robertson DL, Stevens R. Extracting Patterns of Database and Software usage from the Bioinformatics Literature. BMC Bioinformatics. 2014 Aug; 30(17):i601–i608. doi: 10.1093/bioinformatics/btu471 PMID: 25161253. 3. Babu, P. A., Boddepalli, R., Lakshmi, V. V., & Rao, G. N. (2005). Dod: Database of databases–updated molecular biology databases. In silico biology, 5(5, 6), 605-610. 4. Duck, G., Nenadic, G., Brass, A., Robertson, D. L., & Stevens, R. (2013). bioNerDS: exploring bioinformatics’ database and software use through literature mining. BMC bioinformatics, 14(1), 1.doi: 10.1186/1471- 2105-14-194 PMID: 23768135. 5. Duck G, Nenadic G, Brass A, Robertson DL, Stevens R. bioNerDS: Exploring Bioinformatics’ Database and Software use through Literature Mining. BMC Bioinformatics. 2013; 14(1):194. doi: 10.1186/1471- 2105-14-194 PMID: 23768135. 6. Köhler, Jacob. "Integration of life science databases." Drug Discovery Today: BIOSILICO 2.2 (2004): 61-69. 7. Babu, P. A., Udyama, J., Kumar, R. K., Boddepalli, R., Mangala, D. S., & Rao, G. N. (2007). DoD2007: 1082 molecular biology databases. Bioinformation, 2(2), 64-67.Available from: http://www.ncbi.nlm.nih.gov/pmc/.doi: 10.6026/97320630002064. 8. Galperin MY, Cochrane GR. The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Research. 2011 dec; 39(Database issue):D1–D6. doi: 10.1093/nar/gkq1243 PMID: 21177655. 9. Discala C, Benigni X, Barillot E, Vaysseix G. DBcat: A Catalog of 500 Biological Databases. Nucleic Acids Research. 2000 Jan; 28(1):8–9. doi: 10.1093/nar/28.1.8 PMID: 10592168. 10. Fox, J. A., Butland, S. L., McMillan, S., Campbell, G., & Ouellette, B. F. (2005). The Bioinformatics Links Directory: a compilation of molecular biology web servers. Nucleic acids research, 33(suppl 2), W3-W24.doi: 10.1093/nar/gki594 PMID: 15980476 11. Eales, J. M., Pinney, J. W., Stevens, R. D., & Robertson, D. L. (2008). Methodology capture: discriminating between the" best" and the rest of community practice. BMC bioinformatics, 9(1), 1.doi: 10.1186/1471- 2105-9-359 PMID: 18761740. 12. Duck G, Nenadic G, Brass A, Robertson DL, Stevens R. bioNerDS: Exploring Bioinformatics’ Database and Software use through Literature Mining. BMC Bioinformatics. 2013; 14(1):1. doi: 10.1186/1471- 2105-14-194 PMID: 23768135. 13. Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2013). GenBank. Nucleic acids research, 41(D1), D36-D42. 14. National Center for Biotechnology Information (NCBI). GenBank Release Notes 213.0.http://www.ncbi.nlm.nih.gov/genbank/release/213/. Accessed on 20th July 19, 2016. 15. Hertz‐Fowler C, Peacock CS, Wood V, Aslett M, Kerhornou A, Mooney P, Tivey A, Berriman M, Hall N, Rutherford K, Parkhill J. (2004). GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic acids research, 32(suppl 1), D339-D343. 16. File Transfer Protocol (FTP) site for GenBank Nucleotide Sequence. ftp://ftp.ncbi.nih.gov/genbank/. Accessed on 20th July, 2016. 17. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ. (2005). The universal protein resource (UniProt). Nucleic acids research, 33(suppl 1), D154-D159. 18. O'Donovan, C., Martin, M. J., Gattiker, A., Gasteiger, E., Bairoch, A., & Apweiler, R. (2002). High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Briefings in bioinformatics, 3(3), 275-284. 19. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., & Wu, C. H. (2007). UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23(10), 1282-1288. doi:10.1093/bioinformatics/btm098. 20. Oxford University Press. UniProt: A Hub for Protein Information. ,” Nucleic Acids Research, 2014. doi:10.1093/nar/gkq989. 21. Bhat TN, Bourne P, Feng Z, Gilliland G, Jain S, Ravichandran V, Schneider B, Schneider K, Thanki N, Weissig H, Westbrook J. (2001). The PDB data uniformity project. Nucleic Acids Research, 29(1), 214-218. 22. Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, Green RK. (2005). The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic acids research, 33(suppl 1), D233-D237. 23. M. Kanehisa and S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes, Nuc. Acids Res., 28(1): 27–30, 2000. 24. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25-29. 25. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO. (2004). Database resources of the National Center for Biotechnology Information: update. Nucleic acids research, 32(suppl 1), D35-D40. 26. RCSB Protein Data Bank. http://www.rcsb.org/pdb/home/home.do. Accessed 19th July 19, 2016. 27. Berman, H. M. (2008). The protein data bank: a historical perspective. Acta Crystallographica Section A: Foundations of Crystallography, 64(1), 88-95. doi:10.1107/S0108767307035623. 28. C. H. Wu, L. S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Z. Hu, R. S. Ledley, P. Kourtesis, B. E. Suzek, C. R. Vinayaka, J. Zhang, W. C. Barker, The Protein Information Resource, Nuc. Acids Res., 31: 345–347, 2003. 29. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS.. (2004). PIRSF: family classification system at the Protein Information Resource. Nucleic acids research, 32(suppl 1), D112-D114. 30. Hu, Z. Z., Mani, I., Hermoso, V., Liu, H., & Wu, C. H. (2004). iProLINK: an integrated protein resource for literature mining. Computational biology and chemistry, 28(5), 409-416 31. Cochrane G, Aldebert P, Althorpe N, Andersson M, Baker W, Baldwin A, Bates K, Bhattacharyya S, Browne P, van den Broek A, Castro M. M. (2006). EMBL nucleotide sequence database: developments in 2005. Nucleic acids research, 34(suppl 1), D10-D15. 32. J. T. L. Wang, C. H. Wu, and P. P. Wang, Computational Biology and Genome Informatics, Singapore: World Scientific Publishing, 2003. 33. UniProtKB/Swiss-Prot Release Statistics. http://web.expasy.org/docs/relnotes/relstat.html. Accessed on 20th July 2016. 34. Brooksbank, C., Bergman, M. T., Apweiler, R., Birney, E., & Thornton, J. (2014). The european bioinformatics institute’s data resources 2014. Nucleic acids research, 42(D1), D18-D25. 35. Gibson, R., Alako, B., Amid, C., Cerdeño-Tárraga, A., Cleland, I., Goodgame, N., ten Hoopen, P., Jayathilaka, S., Kay, S., Leinonen, R. and Liu, X., 2016. Biocuration of functional annotation at the European nucleotide archive. Nucleic acids research, 44(D1), pp.D58-D66. Doi:10.1093/nar/gkv1311. 36. K. Okubo, H. Sugawara, T. Gojobori, and Y. Tateno, DDBJ in Preparation for Overview of Research Aactivities behind Data Submissions Nuc. Acids Res., 34(1): D6–D9, 2006. 37. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., & Thornton, J. M. (1997). CATH–a hierarchic classification of protein domain structures. Structure, 5(8), 1093-1109. doi:10.1016/S0969-2126(97)00260-8. 38. Cuff, A.L., Sillitoe, I., Lewis, T., Clegg, A.B., Rentzsch, R., Furnham, N., Pellegrini-Calace, M., Jones, D., Thornton, J. and Orengo, C.A., 2011. Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic acids research, 39(suppl 1), pp.D420-D426. doi:10.1093/nar/gkq1001. 39. Andreeva, D. Howorth, S. E. Brenner, T. J. Hubbard, C. Chothia and A. G. Murzin, “SCOP Database in 2004: Refinements Integrate Structure and Sequence Family Data,” Nucleic Acids Research, 32(suppl 1), 2004, pp. D226-D229. doi:10.1093/nar/gkh039.