Dataset Open Access

PhenoMiner database

Collier, NIgel

Phenotypes play a key role in inferring the complex relationships between genes and human heritable diseases. PhenoMiner is a research project aimed at the capture and encoding of phenotypes in the scientific literature. This should provide insights into the complex processes involved in human diseases as well as enabling semantic interoperability with existing biomedical ontologies such as those that describe human anatomy, genetics and behaviours.

The PhenoMiner database contains the results of an FP7 Marie Curie fellowship project on text/data-mining technology - natural language processing, machine learning and conceptual analysis. It builds on insights gained from semantic parsing to extract structured information about phenotypes from whole sentences - in contrast to existing techniques which often apply string matching. The system exploits the wealth of scientific data locked within the scientific literature in databases such as PubMed Central and Europe PMC to extract the semantic vocabulary of phenotypes that scientists use. The system will provide scientists, clinicians and informaticians with the data and tools they need to gain new insights into Mendelian diseases.

The database currently contains over 4800 phenotype terms automatically mined from full scientific articles and then associated to Online Mendelian Inheritance of Man (OMIM) disorders. All data is provided without manual filtering.

Please contact the author for further information and comments/suggestions.

- Nigel Collier (

Files (14.7 MB)
Name Size
14.7 MB Download
  • Collier, N., Tran, M. V., Le, H. Q., Oellirch, A., Kawazoe, A., Hall-May, M. and Rebholz-Schuhmann, D. (2012), "A hybrid approach to finding phenotype candidates in genetic texts", in Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, December 10-14.

  • Collier, N., Oellrich, A. and Groza, T. (2013), “Toward knowledge support for analysis and interpretation of complex traits”, Genome Biology 14(9):214.[html]

  • Groza, T., Oellrich, A., & Collier, N. (2013), “Using silver and semi-gold standard corpora to compare open named entity recognisers”, in 2013 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2013, pp. 481-485.

  • Collier, N., Tran, M., Le, H. Ha, Q., Oellrich, A. Rebholz-Schuhmann, D. (2013), “Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking”, PLoS One 8(10): e72965.

  • Collier, N., Paster, F. and Tran, M. V (2014), "The impact of near domain transfer on biomedical named entity recognition", in Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) at EACL, pp. 11-20.

  • Collier, N., Oellrich, A. and Groza, T. (2014), "Concept selection for phenotypes and disease-related annotations using support vector machines" in Proc. PhenoDay and Bio-Ontologies at ISMB 2014.

All versions This version
Views 119119
Downloads 77
Data volume 103.2 MB103.2 MB
Unique views 117117
Unique downloads 77


Cite as