Published January 18, 2020 | Version v1
Dataset Open

NCI Semantic Competency Query Review

  • 1. Oregon Health & Science University
  • 2. RENCI, University of North Carolina
  • 3. National Cancer Institute
  • 4. Johns Hopkins University
  • 5. University of Pittsburgh
  • 6. Oregon State University
  • 7. University of Chicago

Description

Overview

NCI held a Workshop on Semantics to support the NCI Cancer Research Data Commons (CRDC) in May 2018 at the National Cancer Institute in Rockville, MD. This workshop brought together experts in various areas of semantics, data integration and harmonization, Natural Language Processing (NLP) and other relevant areas to discuss and gather recommendations on semantic support for the CRDC.

The workshop goals were to:
1) Identify high-level requirements to address semantic needs and potential approaches for evaluation testing of the Cancer Data Aggregator (CDA)

2) A set of options for using and/or extending current methods and resources (e.g. NLP) to

  • support semantic query capabilities

  • facilitate metadata annotation

  • minimize efforts for data validation and submission

3) Develop recommendations to support ongoing engagement with the community to ensure the semantics underlying the CDA improve and evolve as people contribute to and use the CDA

Participants

In total, 33 participants attended the meeting, coming from various backgrounds including clinicians, ontologists, bioinformaticians, data scientists, and project managers. Participants had expertise in semantic technologies, software and infrastructure development, data standards, data integration, clinical research, open source tool development. 

Competency Queries

At the workshop, participants were asked to brainstorm ‘competency queries’, potential queries or questions that they would ask the future Cancer Data Aggregator (CDA) in order to retrieve data from across the CRDC. At the workshop, the breakout groups documented 237 queries for the CDA. 

Competency queries are often used to inform requirements to build a data model and/or ontology. They can help inform the scope of the model:what queries should the model support; 2) the content of the model, in terms of what types of entity types and attributes are needed to answer these queries; 3) the structure of the model in terms of what types of relationships between entities are needed to efficiently answer queries; 4) the semantics of the data, meaning which terminologies/ontologies would be useful for representing data to support query needs; 5) how to test and improve a completed model to ensure it can efficiently support queries determined to be in scope.

After the workshop, a small subgroup assessed, organized and summarized the 237 queries that were noted at the workshop. A spreadsheet was created containing the queries and the keywords in each query were highlighted. From the highlighted keywords, a column was added to capture the core search parameters or classifications for each query. In evaluating the queries it was observed that some were really not a query, but expressed various observations about the data that one might hope to make. For example “Patients with a certain temporal pattern of diagnoses, both cancer and comorbidities”. This submission indicates that the data returned would need to include diagnosis and other conditions and can help to inform the requirements for CRDC data models.   

Using the information from the keyword analysis, the queries were initially categorized across various classifications, such as queries that included exposure information, diagnosis or cancer types, anatomical location of tumor, etc. In total, the queries were classified amongst 25 different parameters, or an ‘other’ category, where the query did not fit the classification scheme, or was out of scope. To further refine this list into a more manageable list, 82 representative queries were pulled out, with at least 2 examples from every classification parameter. The goal was to identify a minimal or at least smaller subset that was still representative. This list of 82 queries was then reviewed with a larger group of experts and further refined. Additional classification parameters were added, for a total of 31 parameters and an ‘other’ category. Some classifications were subdivided into more granular classifications, such as treatment was subdivided into surgery/radiation and protocols/regimens.

In classifying each query, the exact words from each query that fit the classification scheme were noted. For example, consider the query, “What environmental exposures are typically associated with the development of salivary gland cancer?”; this query is classified as an exposure (environmental exposure), a diagnosis or specific cancer type (salivary gland cancer), and a tumor location (salivary gland). 

For a central query to be effective, we felt that the use of preferred terms for each classification parameter would be useful, so each was mapped to a relevant terminology or ontology. For example, exposure data is represented in the Environmental and Exposures Ontology (ECTO), as well as NCIt. The Uber Anatomy Ontology (Uberon) contains classifications of anatomical structures, which can be used to classify tumor locations. Many of the parameters covered by specialized terminologies are also covered by the NCIt. In some cases, the parameters were covered by multiple ontologies. At some point a preferred terminology will need to be selected for each parameter, perhaps informed by an assessment of what is being used in the CRDC data. It is likely that all terminologies and ontologies will need to be extended to cover all the terminology needed. Mappings between these terminologies and those used in the CRDC data will need to be developed. 

Finally, categories were prioritized based on how relevant and feasible they were for the CRDC as either high priority, nice to have or low priority.

Files

Files (231.5 kB)

Name Size Download all
md5:1ea674787ddcb4e9df6a987f017c6a31
231.5 kB Download