Using logical constraints to validate information in collaborative knowledge graphs: a study of COVID-19 on Wikidata

. Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research paper, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 disease, its causative virus, and key aspects of the resulting pandemic. These tasks assess relational and statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to other methods for validation of linked web data.


Introduction
Since December 2019, the COVID-19 disease has spread to become a global pandemic.This disease is caused by a zoonotic coronavirus called SARS-CoV-2 (Severe Acute Respiratory Syndrome CoronaVirus 2) and is characterized by the onset of acute pneumonia and respiratory distress.The global impact, with more than 77 million infections and almost 1.7 million deaths globally (as of December 21, 2020 3 ), is frequently compared to the 1918 Spanish Flu [1].Emerging mRNA vaccines entail serious distribution and storage challenges and no therapies are especially effective against late-stages of the disease.As with all zoonotic diseases, its abrupt introduction to humans demands an outsized effort for data acquisition, curation and integration to drive evidence-based medicine, predictive modeling and public health policy [2,3].
Agile data sharing and computer-supported reasoning about the COVID-19 pandemic and SARS-CoV-2 virus allow us to quickly understand more about the disease's epidemiology, pathogenesis, and physiopathology.This understanding can then inform the required clinical, scholarly and public health measures to fight the condition and handle its nonmedical ramifications [4][5][6].Consequently, initiatives have rapidly emerged to create datasets, web services and tools to analyse and visualise COVID-19 data.Examples include Johns Hopkins University's COVID-19 dashboard [2] and the Open COVID-19 Data Curation Group's epidemiological data [3].Some of these resources are interactive and return their results based on combined clinical and epidemiological information, scholarly information and social network analysis [7][8][9].However, a significant shortfall in interoperability is common: although these dashboards facilitate examination of their own slice of the data, most lack general integration with other sites or datasets.The lack of technical support for interoperability is exacerbated by legal restrictions: despite being free to access, most are issued under All Rights Reserved terms or licenses.Similarly, >80% of the 96608 COVID-19related projects on the GitHub repository for computing projects are under All Rights Reserved 4 terms (as of 21 December 2020).Restrictive licensing of data sets and applications severely impedes their dissemination and integration, ultimately undermining their value.For complex and multifaceted phenomena such as the COVID-19 pandemic, there is a particular need for a collaborative, free, machine-readable, interoperable and open knowledge graph to integrate the varied data.
Wikidata 5 just fits the need as a CC0 6 licensed, large-scale, multilingual knowledge graph used to represent human knowledge in a structured format (Resource Description Framework or RDF) [10,11].It therefore has the advantage of being inherently findable, accessible, interoperable, and reusable, i.e.FAIR [12].It was initially developed in 2012 as an adjunct to Wikipedia but has grown significantly beyond its initial parameters.As of now, it is a centralized, cross-disciplinary meta-database and knowledge base for storing structured information in a format optimized to be easily read and edited by both machines and humans [13].Thanks to its flexible representation of facts, Wikidata can be automatically enriched using information retrieved from multiple public domain sources or inferred from synthesised data [11].This database includes a wide variety of pandemic-related information, including clinical knowledge, epidemiology, biomedical research, software development, geographic, demographic and genetics data.It can consequently be a vital large-scale reference database to support research and medicine during the COVID-19 pandemic [11,12].
The key hurdle to overcome for projects such as Wikidata is that several of their features can make them at-risk of inconsistent structure or coverage: 1) collaborative projects use decentralised contribution rather than central oversight, 2) large-scale projects 4 80002 of 96608 as of 2020-12-21: https://github.com/search?q=covid-19+OR+covid19+OR+coronavirus+OR+cord19+OR+cord-19 5 https://www.wikidata.org/ 6CC0 is a rights waiver similar to Creative Commons licenses, used to publish material into the public domain.It waives as much copyright as possible within a given jurisdiction.Further information can be found at https://creativecommons.org/publicdomain/zero/1.0/ operate at a scale where manual checking is not possible, and 3) interdisciplinary projects script the acquisition of data to integrate a wide variety of data sources.To maximise usability of the data, it is therefore important to minimise inconsistencies in its structure and coverage.As a result, methods of evaluating the existing knowledge graphs and ontologies, integral to knowledge graph maintenance and development, are of crucial importance.Such an evaluation is particularly relevant in the case of collaborative semantic databases, such as Wikidata.
Knowledge graph evaluation is therefore necessary to assess the quality, correctness, or completeness of a given knowledge graph against a set of predetermined criteria [14].There are a number of possible approaches to evaluating a knowledge graph based on external information (so-called extrinsic evaluation), including: comparing its structure to a paragon ontology, comparing its coverage to source data, applying it to a test problem and judging the outcomes, and manual expert review of its ontology [15].Different systematic approaches have been proposed for the comparison of ontologies and knowledge graphs, including NLP techniques, machine learning, association rule mining, and other methods [16][17][18].The criteria for evaluating ontologies typically include: Accuracy, which determines if definitions, classes, properties and individual entries in the evaluated ontology are correct; Completeness, referring to the scope of coverage of a given knowledge domain in the evaluated ontology; Adaptability, determining the range of different anticipated uses of the evaluated ontology (versatility); and Clarity, determining the effectiveness of communication of intended meanings of defined terms by the evaluated ontology [14,[19][20][21].However, extrinsic methods are not the only ones that are used for evaluating such a set of criteria.Knowledge graphs can be also assessed through an intrinsic evaluation that assesses the structure of the analyzed knowledge graph thanks to the inference of internal description logics and consistency rules [14].
In this research paper, we emphasize the usefulness of intrinsic methods to evaluate knowledge graphs by presenting our solution to the quality assurance checks and corrections of COVID-19 semantic data in Wikidata.This consists of a catalogue of automatable tasks based on logical constraints expected of the knowledge graph.Most of these constraints were not explicitly available in the RDF validation resources of Wikidata before the pandemic and are designed in this work to support new types of COVID-19 information in the assessed knowledge graph, including epidemiological and social data.We implement these constraints with SPARQL and test them on Wikidata using the SPARQL endpoint of this knowledge graph, available at https://query.wikidata.org.We introduce the value of Wikidata as a multi-purpose collaborative knowledge graph for the flexible and reliable representation (Section 2) and validation (Section 3) of COVID-19 knowledge.Furthermore, we cover the use of SPARQL to query this knowledge graph (Section 4).Then, we demonstrate how logical constraints can be captured in structural schemas and consequently used to validate and encourage the consistent usage of relation types to represent COVID-19 knowledge (Section 5) and we show how statistical constraints can be applied to verify epidemiological data related to the pandemic (Section 6).Finally, we compare our constraint-based approaches with other methods through the analysis of the outcomes of previous research papers related to knowledge graph validation (Section 7), and draw conclusions for future directions (Section 8).

Wikidata as a collaborative knowledge graph
Wikidata currently serves as a semantic framework for a variety of scientific initiatives, such as GeneWiki [22], allowing different teams of scholars to upload valuable academic data into a collective and standardized pool.Its versatility and interconnectedness are making it a standard for interdisciplinary data integration and dissemination across fields as diverse as linguistics, information technology, film studies, and medicine [11,[23][24][25][26][27][28], although its popularity and recognition across fields still vary significantly [29].
It contains concepts, linked by their taxonomic relations, allowing embedding and creating instances of subclasses of classified data and links between them.Its multilingual nature enables fast-updating dynamic data reuse across different language versions of a resource such as Wikipedia [30], with fewer inconsistencies from local culture [31] or language biases [32,33].
The data structure employed by Wikidata is intended to be highly standardized, whilst maintaining the flexibility to be applied across highly diverse use-cases.There are mainly two essential components: Items, which represent objects, concepts or topics; and properties, which describe how one item relates to another.A statement, therefore, consists of a subject item (S), a property that describes their nature of the statement (P), and an object (O) that can be an item, a value, an external ID, or a string, etc.While items can be freely created, new properties require community discussion and vote, with 7851 properties7 currently available.Statements can be further modified by any number of qualifiers to make them more specific and be supported by references to indicate the source of the information.
Thus, Wikidata forms a continuously growing, single, unified network graph, with 88M items forming the nodes, and 1127M statements8 forming the edges.A live SPARQL endpoint and query service, regular RDF dumps, as well as linked data APIs and visualization tools, form a backbone of Wikidata uses [34,35].
Importantly, Wikidata is based on free and opensource philosophy and software and is a database that anyone can edit, similarly to the very popular online encyclopedia, Wikipedia [36].As a result, the emerging ontologies are created entirely collaboratively, without centralized coordination [37], and developed in a community-driven fashion [38].This approach allows for the dynamic development of areas of interest for the user community but poses challenges, e.g., to systematize and proportionate class completeness across topics [39].Also, since the edit history is available to anyone, tracing human and non-human contributions, as well as detecting and reverting vandalism is available by design and relies on community management [40] as well as on software tools like ORES [41].
Other ontological databases and knowledge graphs exist [42,43].However, much like the factors that led Wikipedia to rise to be a dominant encyclopedia [33,44], Wikidata's close connection to Wikimedia volunteer communities and wide readership provided by Wikipedia have quickly given it a competitive edge.The system, therefore, aims to combine the wisdom of the crowds with advanced algorithms.For instance, Wikidata editors are assisted by a property suggesting system, proposing additional properties to be added to entries [45].Wikidata has subsequently exhibited the highest growth rate of any Wikimedia project and was the first amongst them to pass one billion contributions [46].
As a collaborative venture, its governance model is similar to Wikipedia [47], but with some important differences.Wide permissions to edit Wikidata are manually granted to approved bots and to Wikimedia accounts that are at least 4 days old and have made at least 50 edits using manual modifications or semiautomated tools for editing Wikidata 9 .These accounts are supervised by a limited number of experienced administrators to prevent misleading editing behaviors (such as vandalism, harassment, and abuse) and to ensure a sustainable consistency of the information provided by Wikidata 10 .As such, Wikidata is highly relevant to the computersupported collaborative work (CSCW) field, yet the number of studies of Wikidata from this perspective is still very limited [48].To understand the value of using SPARQL to validate the usage of relation types in collaborative ontologies and knowledge graphs, it is important to understand the main distinctive features of Wikidata as a collaborative project.
Much as Wikidata is developed collaboratively by international editors, it is also designed to be language-neutral.As a result, it is quite possible to contribute to Wikidata with only a limited command of English and to effectively collaborate whilst sharing no common human language -an aspect unique even in the already rich ecosystem of collaborative projects [49].It may well be an early sign of other language-independent cooperative knowledge creation initiatives, such as Wikilambda, which is an abstract Wikipedia currently developed on the basis of Wikidata [50].
It is also possible to build Wikipedia articles, especially in underrepresented languages, based on Wikidata data only, and create article placeholders to stimulate encyclopedia articles' growth [51].This stems from combining concepts that are relatively easily intertranslatable between languages (e.g.professions, causes of death, capitals) with languageagnostic data (e.g.numbers, geographical coordinates, dates).As a result, Wikidata is a paragon example of not only cross-cultural cooperation but 9 For an overview of the semi-automated editing tools for Wikidata, please see https://www.wikidata.org/wiki/Wikidata:Tools 10Further information about the rights and governance of users in Wikidata is shown at https://www.wikidata.org/wiki/Wikidata:User_access_levelsalso human-bot collaborative efforts [37,52].Given the large-scale crowdsourcing efforts in Wikidata and the use of bots and semi-automated tools to mass edit Wikidata, its current volume is higher than what can be reviewed and curated by administrators manually.It is quite intuitive: as the general number of edits created by bots grows, so grows the number of administrative tasks to be automated.Automation may include simplifying alerts, fully and semiautomated reverts, better user tracking, or automated corrections.However, the creation of automated methods for the verification and validation of the ontological relations it contains is required most.

Knowledge graph validation of Wikidata
As Wikidata properties are assigned labels, descriptions and aliases in multiple languages (Red in Fig. 6), multilingual information of these properties can be used alongside the labels, descriptions, and aliases of Wikidata items to verify and find sentences supporting biomedical statements in scholarly outputs [53].Such a process can be based on various natural language processing techniques, including word embeddings [53,54] and semantic similarity [55].These techniques are robust enough to achieve an interesting level of accuracy, and some of them can achieve better accuracy when the Wikidata classes of the subject and object of semantic relations are given as inputs [56,57].
The subjects and objects of Wikidata relations can likewise be aligned to other biomedical semantic resources such as MeSH and UMLS Metathesaurus [11].Thus, benchmarks for relation extraction based on one of the major biomedical ontologies can be converted into a Wikidata friendly format and used to automatically enrich Wikidata with novel biomedical relations or to automatically find statements supporting existing biomedical Wikidata relations [58].Furthermore, MeSH keywords of scholarly publications can be converted into their Wikidata equivalents, refined using citation and co-citation analysis [59], and used to verify and add biomedical Wikidata relations, e.g. by applying deep learningbased bibliometric-enhanced information retrieval techniques [60,61].
Another option of validating biomedical statements based on the labels of their subjects, predicates, and objects in Wikidata can be the use of these labels for the reformulation of a query to search bibliographic databases and consequently to find appropriate references for the assessed Wikidata statements (Example in Fig. 5).Several bots and bot frameworks have been successfully built using this principle such as Wikidata Integrator 11 that extracts the Wikidata statements of a given gene or protein using SPARQL, compare them with their equivalents in other structured databases like NCBI's Gene resources and Uniprot and adjust them if needed, and RefB 12 (Fig. 1) that extracts biomedical Wikidata statements not supported by references using SPARQL and identifies the sentences supporting them in scholarly publications using PubMed Central search engine and a variety of techniques such as concept proximity analysis.In addition to their multilingual set of labels and descriptions, Wikidata properties are assigned object types using wikibase:propertyType relations (Blue in Fig. 2).These relations allow the assignment of appropriate objects to statements, so that nonrelational statements cannot have a Wikidata item as an object, while objects of relational statements are not allowed to have data types like a value or a URL [10].Just like a Wikidata item, a property can be described by statements (Green in Fig. 2).The predicates of these statements link a property to its class (instance of [P31]), to its corresponding Wikidata item (subject item of this property [P1629]), to example usages (Wikidata property example [P1855]), to equivalents in other IRIs 13(equivalent property [P1628]), to Wikimedia categories that track its usage on a given wiki (property usage tracking category [P2875]), to its inverse property (inverse property [P1696]), or to its proposal discussion (property proposal discussion [P3254]), etc.
These statements can be interesting for various knowledge graph validation purposes.In fact, the class, the usage examples and the proposal discussion of a Wikidata property can be useful through the use of several natural language processing techniques, particularly semantic similarity, to provide several features of the use of the property such as its domain of application (e.g. the subject or object of a statement using a Wikidata property related to medicine should be a medical item) and consequently to eliminate some of erroneous use by screening the property usage tracking category.The class of the Wikidata item corresponding to the property can be used to identify the field of work of the property and thus flag some inappropriate applications.In addition, the external identifiers of such an item can be used for the verification of biomedical relations by their identification within the semantic annotations of scholarly publications built using the SAT+R (Subject, Action, Target, and Relations) model [62].The inverse property relations can identify missing Wikidata statements (C 1 , P, C 2 ), which are implied by the presence of inverse statements (C 2 , P -1 , C 1 ) in other Wikidata resources.Here, P -1 is the Wikidata property that is the inverse of P, C S is a common class of the subjects of P, and C O is a common class of the objects of P.
Despite the importance of these statements defining properties, property constraint [P2302] relations (Brown in Fig. 2) are the semantic relations that are primarily used for the validation of the usage of a property.In essence, they define a set of conditions for the use of a property, including several heuristics for the type and format of the subject or the object, information about the characteristics of the property, and several description logics for the usage of the property as shown in Table 1.Property constraints are either manually added by Wikidata users or inferred with an excellent accuracy from the knowledge graph of Wikidata or the history of human changes to Wikidata statements [63,64].As shown in Fig. 2, a property constraint is defined as a relation where the property type is featured as an object and the detailed conditions of the constraint to be applied on Wikidata statements are integrated as qualifiers to the relation.When a property constraint is violated, the corresponding statement is automatically included in a report of property constraint violations 14 and is marked by an exclamation mark on the page of the subject item (Fig. 3) so that it can be quickly processed and adjusted by the community or by Wikidata bots if applicable.Although these methods are important to verify and validate Wikidata, they are not the only ones that are used for these purposes.In 2019, Wikidata announced the adoption of Shape Expressions language (ShEx) as part of the Mediawiki entity schemas extension 15 .ShEx was proposed following an RDF validation workshop that was organized by W3C 16 in 2014 as a concise, high-level language to describe and validate RDF data [65].This Mediawiki extension uses ShEx to store structure definitions (EntitySchemas or Shapes) for sets of Wikidata entities which are selected by some query pattern (frequently the involvement of said entities in a Wikidata class).This provides collaborative quality control where the community can iteratively develop a schema and refine the data to conform to that schema.For those familiar with XML, ShEx is analogous to XML Schema or RelaxNG.SHACL (Shapes Constraint Language), another language used to constraint RDF data models, uses a flat list of constraints, analogous to XML's Schematron.It was adapted from SPIN (SPARQL Inference Notation) by the W3C Data Shapes working group in 2014 and became a W3C recommendation in 2017 [66].However, ShEx was chosen to represent EntitySchemas in Wikidata, as it has a compact syntax which makes it more human-friendly, supports recursion, and is designed to support 14 https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations 15 https://www.mediawiki.org/wiki/Extension:EntitySchema 16https://www.w3.org/2012/12/rdf-val/report distributed networks of reusable schemas [67].
Besides the possibility to infer ShEx expressions from the screening of a large set of concerned items, they can be easily written by humans in an intuitive way.
In Wikidata, ShEx-based EntitySchemas are assigned an identifier (a number beginning with an E) as well as labels, descriptions, and aliases in multiple languages, so that they can be easily identified by users.Entity schemas are defined using the ShEx-compact syntax 17 , which is a concise, human-readable syntax.A schema usually begins by some prefix declarations similar to SPARQL.An optional start definition declares the shape which will be used by default.In the example (Fig. 4), the shape <app> will be used, and its declaration contains a list of properties, possible values, and cardinalities.By default, shapes are open, which means that other properties apart from the ones declared are allowed.In this example, the values of property wdt:P31 are declared to be either a COVID-19 dashboard (wd:Q90790055), a search engine (wd:Q91136116) or a dataset (wd:Q91137337).The EXTRA directive indicates that there can be additional values for property wdt:P31 that differ from the specified ones.The value for property wdt:P1476 is declared to be zero or more literals.The cardinality indicators come from regular expressions, where ‗?' means zero or one, ‗*'; means zero or more, and ‗+' means one or more.While the values for the other properties are declared to be anything (the dot indicates no constraint) zero or more times, except for the properties wdt:P577 and wdt:P7103 that are marked as optional using the question mark.Further documentation about ShEx can be found at http://shex.io/and in Labra Gayo et al. (2017) [67].Due to the ease of using ShEx to define EntitySchemas, it has been used successfully to validate Danish lexemes in Wikidata [68] and biomedical Wikidata statements [69].During the COVID-19 pandemic, Wikidata's data model of every COVID-19-related class as well as of all major biomedical classes has been converted to an EntitySchema, so that it can be used to validate the representation of COVID-19 Wikidata statements [12].These EntitySchemas were successfully used to enhance the development and the robustness of the semantic structure of the data model underlying the COVID-19 knowledge graph in Wikidata and are accordingly made available at a subpage of Wikidata's WikiProject COVID-19, accessible via https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19/Data_models.

4.
SPARQL as a semantic query language SPARQL was officially created in 2008 as a query language and protocol to search, add, modify or delete RDF data available over the Internet.Its name is a recursive acronym which stands for "SPARQL Protocol and RDF Query Language".SPARQL allows a query to be composed of triple patterns, conjunctions, disjunctions, and optional patterns and can consequently be used to retrieve contextualized information from knowledge graphs.As it has been designed to extract a searched pattern from a semantic graph [70], SPARQL queries have also been used to query the competency questions 18 , so as to evaluate ontologies and knowledge graphs in a 18 Competency questions: A set of requirements ensuring consistency of a knowledge graph, constraints determining what knowledge to be involved in a knowledge graph [71].
context-sensitive way [72][73][74].Indeed, a sister project presents how SPARQL can be used to generate data visualisations 19 [35,75].Validating RDF data portals using SPARQL queries has been regularly proposed as an approach that gives great flexibility and expressiveness [76].However, academic literature is still far from revealing a consensus on methods and approaches to evaluate ontologies using this query language [77], and other approaches have been proposed for validation [69,78].
SPARQL 20 is a human-friendly language based on defining triples as conditions [70] and defines prefixes to abbreviate IRIs similarly to like ShEx (Blue in Fig. 4).It also uses the skeleton of SQL to define queries to knowledge graphs in RDF format [79].For example, SPARQL shares most of SQL's clauses used to retrieve variables and aggregate functions used to compute new variables, as shown in Table 2 [79][80][81].SPARQL also defines new aggregate function-based variables in the SELECT clause using the (function(variable) AS new_variable) format, and constant values and strings are put between quotation marks.It defines logical conditions in the HAVING clause for variables based on aggregate functions or in the WHERE clause for variables to be retrieved from the source database as FILTER (condition) [80,81]  In contrast to SQL, the variables in SPARQL are preceded by an interrogation mark and are not separated by a comma in the SELECT clause [81,82], and even the declaration of statements in the WHERE clause using SPARQL is different from the one using SQL.In the latter, the declaration of the statements in a WHERE clause can only be done in a single line [79].When multiple statement conditions should be fulfilled, they have to be linked using the INTERSECT operator [83].When a unique condition from a list of statements should be respected, the list's statements should be linked using the UNION operator [83].Where results fulfilling a given condition should be eliminated, the condition must be preceded by the MINUS operator [83].In SPARQL, the WHERE clause can include multiple lines between curly brackets, where each line is in the form of a subject-predicate-object triple [79].When the statements between brackets are in the form of a triple, they should end with a period.When two successive statements have the same subject, the first statement can end with a semicolon.In this particular situation, the subject of the second statement can be omitted [81,82].An exception to this is the FILTER() function allowing the definition of a logical condition to be considered or the BIND() function allowing the creation of a new variable based on the retrieved characteristic of a single result row [81,82].Although the MINUS and UNION operators can be used as in SQL, the INTERSECT operator is useless and is forsaken in SPARQL and the MINUS and UNION operators should be preceded and followed by statements between curly brackets like the WHERE clause [81,82].SPARQL has also the advantage to allow including entries where a set of statements in the WHERE clause is not respected by putting these statements after the OPTIONAL operator between curly brackets [81,82].
In Wikidata, the Wikidata Query Service (https://www.wikidata.org)allows to query the knowledge graph using SPARQL [11,34].The required Wikidata prefixes are already supported in the backend of the service and do not need to be defined [34].What the user needs to do is to formulate their SPARQL query (Black in Fig. 5) and click on the Run button (Blue in Fig. 5).After a compilation period, the results will appear (Green in Fig. 5) and can be downloaded in different formats (Brown in Fig. 5), including JSON, TSV, CSV, HTML, and SVG.Different modes for the visualization of the query results can be chosen (Purple in Fig. 5), particularly table, charts (line, scatter, area, bubble), image grid, map, tree, timeline, and graph.The query service also allows users to use a query helper (Red in Fig. 5) that can generate basic SPARQL queries and get inspired by sample queries (Yellow in Fig. 5), especially when they lack experience.It also allows us to generate a short link for the query (Pink in Fig. 5) and codes to embed the query results in web pages and computer programs (Brown in Fig. 5) [34].The statements in the WHERE clause should be defined such that known subjects and objects are preceded by wd prefix whatever they are Wikidata items or properties and that the predicate should be a Wikidata property, and it is preceded by wdt prefix as clearly shown in Fig. 6.Other Wikidata prefixes can be used to parse Wikidata qualifiers (pq and pqv) and references (pr and prv) or to link between a Wikidata statement to one of its components (p, prov, ps, and psv).The wikibase prefix can be used to return the characteristics of an item, a property or a statement.For example, wikibase:directClaim and wikibase:Claim can shift a property from a Wikidata prefix to another one (e.g.shifting Wikidata properties from wdt to wd), and wikibase:rank can be useful to return the level of importance assigned by the community to a statement.

Constraint-driven inference of biomedical property constraints
As described above, Wikidata properties are assigned property constraints and statements as logical conditions for the use of the types of triples to represent knowledge in Wikidata (Fig. 2).Screening Wikidata items in a class to identify common features of the assessed entities based on a set of formal rules has been previously proposed [64,84]  Once retrieved, the common inverse property statements (C O , P -1 , C S ) of the given Wikidata property P will be used to identify relations that use P in an uncommon and probably wrong way, to identify missing inverse relations of P(S,O) corresponding to the most used (C O , P -1 , C S ) scheme, and to identify the Wikidata items missing statements using P as shown in Table 3.The assessment of the usage of the given Wikidata property will not be restricted to these tasks, as it also involves the identification of relations using P not supported by references and the identification of Wikidata properties used to define references for relations using P. Identify inverse properties of P corresponding to each common use case: (CS, P -1 ,CO) statements Identifying the deficiencies of the scheme T3 For each returned P -1 , identify P(S,O) relations supported by references and corresponding to the most common (CS, P -1 , CO) statement but not available in Wikidata T4 Identify P(S,O) relations not corresponding to the most common scheme of P Assessing the reference support of relations using the studied 21 Use case: A set of conditions for the use of a relation type P.

Wikidata property T5
Identify Wikidata properties used to define references for relations using P This task set is useful to assess and adjust the reference support, the language support, the quantity, and the quality of the relations using P and P -1 at a given point in time and can be easily completed using the Wikidata Query Service.The SPARQL query of each task is given in Appendix A, where <PropertyID> is the Wikidata ID of the studied property P, <SubjectID> is the Wikidata ID of the subject class C S that is most used with this property, and <ObjectID> is the Wikidata ID of the most used object class C O .
For Tasks T1 and T2, we eliminated property use cases where classes C S and C O are first-order metaclasses (Q24017414), so that we do not get nonspecific use cases.Additionally, we only considered use cases applied to more than a defined usage threshold (here set as 100 but can change according to context) in order to omit statements that are not widely used in Wikidata.For Task T4, we used logical constraints to find statements where the subject is not an instance of the most used subject class C S (G1), then to find statements where the object is not an instance of the most used object class C O (G2).After that, we identified the statements that exist in both G1 and G2 as the most likely wrong statements (G1 ∩ G2) as they correspond neither to the most used subject class nor to the most used object class of the studied property.Such a task can either identify an accurate relation where the subject and object are not assigned to the corresponding Wikidata class due to the lack of completeness of Wikidata taxonomy or recognize a wrong Wikidata statement.For Task T5, Wikidata properties used to define fewer than a threshold number of references using P were not considered (again, here set to 100).Our analysis was performed on September 20, 2019, following the Zika outbreak as a proactive action to build the data model infrastructure to support clinical information about future infectious epidemics in Wikidata (the date is relevant due to the rapidly expanding nature of the database).
To assess the effectiveness of the use of logical constraints to generate conditions for the verification and validation of the use of relation types to enrich the Wikidata ontology, we applied our method to the main six Wikidata properties that can be used to represent COVID-19-related knowledge (Table 4).Task T1 was effective at sorting the common use cases of the studied Wikidata properties as shown in Table 5.All the retrieved use cases were proven to be logically accurate when compared to the descriptions of Wikidata properties available in Table 4.The most common use cases for drugs used for treatment [P2176], therapeutic area [P4044], significant drug interactions [P769], or medical condition treated [P780] corresponded to 72 percent or more of the supported statements.However, there was a significant lack of availability of common use cases for route of administration [P636] and symptoms [P780].This data deficiency may be due to human limitations (inexperience with wikidata or the medical logic being entered) from inconsistencies between languages (which often derive from slight differences in the naming and framing of articles in different language Wikipedias).These shortfalls could be alleviated by clearer taxonomy in attributing Wikidata items to corresponding classes.These statements can be directly added to Wikidata using tools for the automatic enrichment of Wikidata, particularly QuickStatements [11], as they are supported by external references and are already stated in a Wikidata-friendly format.
Task T4 efficiently identified the statements not corresponding to the most common use case of the related Wikidata property as shown in Table 8.In fact, 11236 statements not corresponding to the most used subject class of the studied Wikidata properties and 7354 statements not corresponding to the most used object class of the studied Wikidata properties were identified.Among these statements, 5217 relations corresponded neither to the most common subject class nor to the most common object class of the considered properties.When applying expert validation to 800 randomly selected relations among the 5217 studied ones, we found that only 6.6% of these relations (53) were truly inaccurate and that the remaining 93.4% (747) were accurate but identified due to the lack of assignment of their subjects and objects to their hypernyms (i.e. a significant lack in defining relations between Wikidata items and corresponding classes).An example of such accurate relations is (alcohol withdrawal syndrome [Q2914873], Drug used for treatment [P2176], (RS)baclofen [Q413717]) where alcohol withdrawal syndrome is not an instance of disease [Q12136] and (RS)-baclofen is not declared as an instance of medication [Q12140].The precision rate of the identification of deficient relations using this method seems to vary considering the studied property but does not exceed 10% (Fig. 8).Accordingly, the results sorted by Task T4 should be manually verified and validated by experts, so that users can use true identified relations (False positive) to enrich their respective subject and object Wikidata items with corresponding missing classes and find the reasons behind the deficiency of wrong identified relations (True positive) to develop automatic methods to solve them.The insufficiencies of wrong relations can either be due to ontological reasons (64%) or medicine-related reasons (36%) as shown in Fig. 9  Efforts in crowdsourcing ontology verification of other biomedical ontologies such as SNOMED-CT confirmed the existence of both types of errors and stipulated that not adjusting these lexical resources and using them in clinical decision support can generate harmful recommendations [85].Task T5 was efficient in finding the Wikidata properties used to define the references of the statements for each studied relation type (Table 9).For the studied Wikidata relation types, we found that references are mainly defined using three properties: stated in [P248], retrieved [P813], and reference URL [P854] (Example in Fig. 10).One of the highest priority tasks on Wikidata is for experts to find and add appropriate references using these three properties to currently unsupported Wikidata relations.Once the references are in the system, further refinement is possible, e.g. a reference URL [P854] containing (or pointing to a page that contains) an external identifier for which Wikidata has a suitable property -e.g.Digital Object Identifier [P356] -then that property could be added to an item about the cited references, and the P854 statement replaced by a P248 statement pointing to that item.

Constraint-driven heuristics-based validation of epidemiological data
The characterization of epidemiological data is possible using a variety of statistical measures that show the acuteness, the dynamics, and the prognosis of a given disease outbreak.These measures include the simple cumulative count of cases (P1603 [199569 statements, Orange in Fig. 11], noted c, as defined before), deaths (P1120 [243250 statements 23 11], noted m), basic reproduction number (P3492, noted R 0 ), minimal incubation period in humans (P3488, noted mn), and maximal incubation period in humans (P3487, noted mx) [86].For all these statistical data, every information should be coupled by a point in time (P585, noted Z) qualifier defining the date of the stated measurement and by a Determination method (P459, noted Q) qualifier identifying the measurement method of the given information as these variables are subject to change over days or according to used methods of computation.From simple count statistics (c, t, d, h, and r statements), it is possible to compare regional epidemiological variables and their variance for a given date (Z) or date range, and relate these to the general disease outbreak (each component defined as a part of [P361] of the general outbreak) as shown in Table 10.Tasks V1 and V2 have been generated from the evidence that COVID-19 started in late 2019 and that its clinical discovery can only be done through medical diagnosis techniques [87].Tasks V3 and V4 have been derived from the fact that c, d, r, and t are cumulative counts.Consequently, these variables are only subjects to remain constant or increase over days.Task V5 is motivated by the fact that a simple epidemiological count cannot return negative values.Tasks V6, V7, V8, and V9 are due to the evidence that d, r, and h cannot be superior to c as a patient needs to be affected by SARS-CoV-2 to die or be hospitalized due to the contraction of COVID-19 [86] and that a patient needs to undergo COVID-19 testing to be confirmed as a case of the disease [87].V10 is built upon the assumption that c, d, r, h, and t values can be geographically aggregated [86].
This task set has easily been applied using ten simple SPARQL queries that can be found in Appendix B where <PropertyID> is the Wikidata property to be analyzed and has returned 5496 deficiencies in the COVID-19 epidemiological information as shown in Table 11.Among these mistaken statements, 2856 were number of cases statements, 2467 were number of deaths statements, 189 were number of recoveries statements, 9 were number of clinical tests statements, and 10 were number of hospitalized cases statements.This distribution of the deficiencies among epidemiological properties is explained by the dominance of number of cases and number of deaths statements on the COVID-19 epidemiological information.Most of these mistakes are linked to a violation of the cumulative pattern of major variables.These deficiencies can be removed using tools for the automatic enrichment of Wikidata like QuickStatements (cf.Turki, et al., 2019 [11]) or adjusted one by one by active members of WikiProject COVID-19.

Table 10
Tasks for the heuristics-based evaluation of epidemiological data using the Wikidata SPARQL endpoint  Concerning the variables issued from the integration of basic epidemiological counts (m, R 0 , mn and mx statements), they give a summary overview of the statistical behavior of the studied infectious pandemic and that is why they can be useful to identify if the stated evolution of the morbidity and mortality caused by the outbreak is reasonable [88].However, the validation of these variables is more complicated due to the complexity of their definition [88][89][90].The basic reproduction number (R 0 ) is meant to be a constant that characterizes the dissemination power of an infectious disease.It is defined as the expected number of people (within a community with no prior exposure to the disease) that can contract a disease via the same infected individual.This variable should exceed the threshold of 1 to define a contagious disease [88].Although R 0 can give an idea about the general behavior of an outbreak of a given disease, any calculated value depends on the model used for its computation (e.g.SIR Model) as well as the underlying data and is consequently a bit imprecise and variable from one study to another [88].That is why it is not reliable to use this variable to evaluate the accuracy of simple epidemiological counts for a given pandemic.The only heuristic that can be applied to this variable is to verify if its value exceeds 1 for diseases causing large outbreaks.The incubation period of a disease gives an overview of the silent time required by an infectious agent to become active in the host organism and cause notable symptoms [89,90].This variable is very important as it reveals how many days an inactive case can spread the disease in the host's environment before the host is being symptomatically identified.As a result, it can give an idea about the contagiousness of the infectious disease and its basic reproduction number (R 0 ).However, the determination of the incubation period -especially for a novel pathogen -is challenging, as a patient often cannot identify with precision the day when they had been exposed to the disease, at least if they did not travel to an endemic region or had not been in contact with a person they knew to be infected.This factor was behind the measurement of falsely small incubation periods for COVID-19 at the beginning of COVID-19 epidemic in China [89].Furthermore, the use of minimal (mn) and maximal (mx) incubation periods in Wikidata to epidemiologically describe a disease instead of the median incubation period is a source of a lack of accuracy of the extracted values [89,90].In fact, minimal and maximal incubation periods for a given disease are obtained in the function of the mean ( ) and standard deviation () of the measures of the confidence interval of observed incubation periods in patients.Effectively, mn is equal to  −  *   and mx is equal to  +  *   where n is the number of analyzed observations and z is a characteristic of the hypothetical statistical distribution and of the statistical confidence level adopted for the estimation [91].As a consequence, mn and mx variables are modified according to the number of observations (n) with a smaller difference between the two variables for higher values of n.As well, the two measures also vary according to the used statistical distribution and that is why different values of mn and mx were reported for COVID-19 when applying different distributions (Weibull, gamma and log-normal distribution) using a confidence level of 0.95 on the same set of observed cases [89].Similarly, the two variables can change according to the adopted confidence level (p -1) when using the same statistical distribution where a higher confidence level is correlated with a higher difference between the calculated mn and mx values, as shown in Fig .12 [91,92].Given these reasons and despite the significant importance of the two measures, these two statistical variables cannot be used to evaluate statistical epidemiological counts for COVID-19 due to their lack of precision and difficulty of determination.As for the reported case fatality rate (m), its definition is less intricate than the ones of the basic reproduction number and of the incubation period, as m is only the quotient of the cumulative number of deaths (d) by the cumulative number of cases (c) as stated in official reports.It is consequently easy to validate for a given disease by comparing its values with simple reported counts of cases and deaths [86].

Task
Here, two simple heuristics can be applied using SPARQL queries as shown in Appendix C. As the number of deaths is less than or equal to the number of cases of a given disease, m values should be set between 0 and 1.That is why Task M1 is defined to extract m statements where m > 1 or m < 0. Also, as As a result of these three tasks, we interestingly identified 143 deficient m statements and 7116 missing m statements.133 of the mistaken statements are identified thanks to Task M2 and concern 25 Wikidata items and 31 distinct dates and only 10 deficient statements related to 3 Wikidata items and 8 distinct dates are found using Task M1.These statements should be verified against reference datasets to verify their values and to determine the reason behind their deficiency.Such a reason can be the integration of the wrong case and death counts in Wikidata or a bug or inaccuracy within the source code of the bot making or updating such statements.The verification process can be automatically done using an algorithm that compares Wikidata values (c, d and m statements) with their corresponding ones in other databases (using file or API reading libraries) and subsequently adjusts statements using the Wikidata API directly or via tools like QuickStatements [11].As for the missing m statements returned by M3, they are linked to 395 disease outbreak items and to 205 distinct dates and concern 70% (7116/10168) of the (case count, death count) pairs available in Wikidata.The outcome of M3 proves the efficiency of comparative constraints to enrich and assess the completeness of epidemiological data available in a knowledge graph, particularly Wikidata, based on existing information.Consequently, derivatives of Task M3 can build to infer d values based on c and m statements or to find c values based on d and m statements.The missing statements found by such tasks can be integrated in Wikidata using a bot based on Wikidata API and Wikidata Query Service to ameliorate the completeness and integrity of available mortality data for epidemics, mainly the COVID-19 pandemic [11].

Discussion
The results presented here demonstrate the value of our relational and statistical constraints-based validation approach for knowledge graphs like Wikidata across a range of features.In particular: identifying use cases of key relation types (Tables 5  and 6), verifying the completeness of inverse statements (Table 7), and aiding experts in finding deficiencies within the taxonomy and the nontaxonomic relations to manually address (Table 8 and Figures 8 and 9).These tasks successfully address most of the competency questions, particularly conceptual orientation (clarity), coherence (consistency), strength (precision) and full coverage (completeness).Combined with previous findings in the context of bioinformatics [84,[93][94], this proves that the efficiency of rule-based approaches to evaluate semantic information from scratch displays a similar accuracy as other available ontology evaluation algorithms [95,96].
The efficiency of these constraint-based assessment methods can be further enhanced by using machine learning techniques to perform imputations and adjustments on deficient data [97].The scope of rule-based methods can be similarly expanded to cover other competency questions such as non-redundancy (conciseness) through the proposal of other logical constraints to tackle them such as a condition to find taxonomic relations to trim in a knowledge graph (Examples can be found at https://www.wikidata.org/wiki/Wikidata:Database_evaluation).The main limitation of applying the logical constraints using SPARQL in the context of Wikidata is that the runtime of a query that infers or verifies a complex condition or that analyzes a huge amount of class items or property use cases can exceed the timeout limit of the used endpoint [34].
These evaluation assignments covered by our approach can be done by other rule-based (structurebased and semantic-based) ontology evaluation methods.Structure-based methods verify if a knowledge graph is defined according to a set of formatting constraints and semantic-based methods check if concepts and statements of a knowledge graph meet logical conditions [14].Some of these methods are software tools, particularly Protégé extensions such as OWLET [98] and OntoCheck [99].OWLET infers the JSON schema logics of a given knowledge graph, converts them into OWL-DL axioms, and uses the semantic rules to validate the assessed ontological data [98].OntoCheck screens an ontology to identify structural conventions and constraints for the definition of the analyzed relational information and consequently to homogenize the data structure and quality of the ontology by eliminating typos and pattern violations [99].Here, the advantage of applying constraints using SPARQL is that its runtime is faster, as it does not require the download of the full dumps of the evaluated knowledge graph [34].The benefit of our method and other structure-based and semantic-based web-based tools for knowledge graph validation like OntoKeeper [95] and adviseEditor [100] when compared to software tools is that the maximal size of the knowledge graphs that can be assessed by web services is larger than the one that can be evaluated by software tools because the latter depends on the requirements and capacities of the host computer [98,99].It is true that these drawbacks of other structure-based tools can be solved through the simplification of the knowledge graph by reducing redundancies using techniques like ontology trimming [101] or through the construction of an abstraction network to decrease the complexity of the analyzed knowledge graph [14,102].However, knowledge graph simplification processes are time-consuming and resulting time gain can consequently be insignificant [14,[101][102].
Such tasks can be also solved using data-driven ontology evaluation methods.These techniques process texts in natural languages to validate the concepts and statements of a knowledge graph and currently include intrinsic (lexical-based) and extrinsic (cross-validation, big data-based and corpus-based) methods [14].
Lexical-based methods use rules implemented in SQL or SPARQL to retrieve items and glosses corresponding to a concept and their semantic relations (mostly subclass of statements) [103,104].These items are then compared against a second set of rules to identify inconsistencies in their labels, descriptions or semantic relations [14].The output can then be analyzed using natural language processing techniques such as hamming distance measures [104], semantic annotation tools [103] and semantic similarity measures [14] to comparatively identify deficiencies in the semantic representation, labelling and symmetry of the assessed knowledge graph.
Conversely, extrinsic data-based methods extract the usage and linguistic patterns from raw text corpuses such as bibliographic databases and clinical records (Corpus-based methods) or from gold standard semantic resources like large ontologies and knowledge graphs (Cross-validation methods) or from social media posts and interactions, Internet of Things data or web service statistics (Big data-based methods) [14,[105][106][107] using structure-based and semantic-based ontology evaluation methods as explained above [106] as well as a range of techniques including machine learning [58,108], topic modelling using latent dirichlet analysis [109], word embeddings [53], statistical correlations [110] and semantic annotation methods [111].The returned features of the analyzed resources are compared to the ones of the analyzed knowledge graph to assess the accuracy and completeness of the definition and use of concepts and properties [14].
When compared to our proposed approach, lexical-based methods have the advantage to identify and adjust characteristics of a knowledge graph item based on its natural language information of a knowledge graph item, particularly terms and glosses [103,104].The drawbacks of using semantic similarity, word embeddings and topic modelling approaches in such approaches is that these techniques are sensitive to the used parameters, to input characteristics and to the chosen models of computation and can consequently give different results according to the context of determination [56,57].The current role of constraints in the extraction of lexical information and respective semantic relations [103,104] proves that the scope of constraint-based validation should not only be restricted to rule-based evaluation but also to lexicalbased evaluation.Yet, the function of logical conditions should be expanded to refine the list of pairs (lexical information, semantic relation) to more accurately identify deficient and missing semantic relations and defective lexical data and to support multilingual lexical-based methods.This would build on the many SPARQL functions that analyze strings in knowledge graphs 24 such as STRLEN (length of a string), STRSTARTS (verification of a substring beginning a given string), STRENDS (verification of a substring finishing a given string), and CONTAINS (verification of a substring included in a given string) [81,82].
As for the extrinsic data-driven methods, they are mainly based on large-scale resources that are regularly curated and enriched.Raw-text corpuses are mainly composed of scholarly publications [21] and blog posts [112].Information in scholarly publications is ever-changing according to the dynamic advances in scholarly knowledge, particularly medical data [113].This expansion of scientific information in scholarly publications is highly recognized in the context of COVID-19 where detailed information about COVID-19 disease and the SARS-CoV-2 virus is published within less than six months [9].Big data is the set of real-time statistical and textual information that is generated by web services including search engines and social media and by Internet of Things objects including sensors [105].This data is characterized by its value, variety, variability, velocity, veracity and volume [105] and can be consequently used to track the changes of the community knowledge and consciousness over time [109,114].Large semantic resources are ontologies and knowledge graphs that are built and curated by a community of specialists and that are regularly verified, updated and enriched using human efforts and computer programs [115].These resources represent broad and reliable information about a given specialty through machine learning techniques [58] and the crowdsourcing of scientific efforts [85] and can be consequently compared to other semantic databases for validation purposes.Examples of these resources are COVID-19 Disease Map [8] and SNOMED-CT 25  [115].
Large-scale knowledge graphs are dynamic corpuses.Changes in the logical and semantic conditions for the definition of knowledge in a particular domain need to be identified to adjust the assessed knowledge graph accordingly.Rule-based and lexical-based approaches (especially constraintsbased methods) are therefore less simple to apply than extrinsic data-driven methods [14].Nonetheless, the growing and changing nature of gold standard resources require continuous human efforts and an advanced software architecture to maintain (e.g.structure-based and semantic-based methods), process (e.g.word embeddings and latent dirichlet analysis) and store (e.g.Hadoop and MapReduce) these reference resources [85,105,115].This architecture has advanced hardware requirements and its results are subject to change according to used parameters [105].
These tasks are in line with the usage of Shape Expressions as well as property constraints and relations for the validation of data quality and completeness of the semantic information of class items in knowledge graphs as shown in the -Knowledge graph validation of Wikidata‖ section.A ShEx ShapeMap is a pair of a triple pattern for selecting entities to validate and a shape against which to validate them.This allows for the definition of the properties to be used for the items of a given class [12,65] and property constraints and relations based on the meta-ontology (i.e.data skeleton) of Wikidata.Expressions written in shape-based property usage validation languages for RDF (e.g.SHACL) can be used to state conditions and formatting restrictions for the usage of relational and non-relational properties [13,69,107].SPARQL can be more efficient in inferring such information than the currently existing techniques that screen all the items and statements of a knowledge graph one by one to identify the conditions for the usage of properties (e.g.SQID) mainly because SPARQL is meant to directly extract information according to a pattern without having to evaluate all the conditions against all items of a knowledge graph [64,70,84]. 25Systematized Nomenclature Of Medicine -Clinical Terms The separate execution of value-based constraints is common in the quality control of XML data.Typically, structural constraints are managed by RelaxNG or XML Schemas, while value-based constraints are captured as Schematron.Much as Schematron rules are typically embedded in RelaxNG, the consistency constraints presented above can be embedded in Shape Expressions Semantic Actions or in SHACL-SPARQL as shown in Fig. 11 [116].These supplement structural schema languages with mechanisms to capture value-based constraints and in doing so, provide context for the enforcement of those constraints.The implementation of value-based constraints shown in the -Constraint-driven heuristics-based validation of epidemiological data‖ section can likewise be implemented in a shapes language [78].Parsing the rules in Table 3 and 10 would allow the mechanical generation or augmentation of shapes, providing flexibility for how the rules are expressed while still exploiting the power of shapes languages for validation.More generally, ontology-based and knowledge graph-based software tools have the potential to provide wide data and platform interoperability, and thus their semantic interoperability is relevant for a range of downstream applications such as IoT and WoT technologies [117].

Conclusion
In this paper, we investigate how to best assess COVID-19 knowledge in collaborative ontologies and knowledge graphs (particularly Wikidata) using relational and statistical constraints.Collaborative databases produced through the cumulative edits of thousands of users are able to generate huge amounts of structured information [11] but as a result of their entirely uncoordinated development, they often result in uneven coverage of crucial information and inconsistent expression of that information.The resulting gaps are a significant problem (false negatives, false positives, reasoning deficiencies, and missing references).Avoiding, identifying, and closing these gaps is therefore of top importance.We presented a standardized methodology for auditing key aspects of data quality and completeness for these resources 26 .
This approach complements and informs shapebased methods for data conformance to communitydecided schemas.The SPARQL execution does not require any pre-processing, and is not only restricted to the validation of the representation of a given item according to a reference data model but also to the comparison of the assessed relational and statistical statements.Our method is demonstrated as useful for measuring the overall accuracy and data quality on a subset of Wikidata and is consequently a necessary first step in any pipeline for detecting and fixing issues in collaborative ontologies and knowledge graphs.
This work has shown the state of the knowledge graph as a snapshot in time.Future work will extend this to investigate how the knowledge base evolves as biomedical knowledge is integrated into it over time.This will require incorporating the edit history in the SPARQL endpoint APIs of knowledge graphs [40,118] to dynamically visualize time-resolved SPARQL queries.We will also couple the information inferred using this method 27 with Shape Expressions and the explicit constraints of relation types to provide a more effective enrichment, refinement, and adjustment of collaborative ontologies and knowledge graphs.

Fig. 2 .
Fig. 2. Wikidata page of a clinical property [Source: https://w.wiki/aeF,Derived from: https://w.wiki/aeG,License: CC BY-SA 4.0].It includes the labels, descriptions and aliases of the property in multiple languages (Red), the object data type (Blue),statements where the property is the subject (Green) as well as property constraints (Brown).

Fig. 6 .
Fig. 6.RDF data structure of Wikidata knowledge graph [Available at: https://w.wiki/any,adapted from source: https://w.wiki/ZUA,Michael F. Schönitzer, CC-BY 4.0] . These features involve common characteristics of the data model of the concerned class as well as patterns of used Wikidata properties such as symmetry and are later used to verify the completeness of the class and validate the statements related to the evaluated class using SPARQL queries.In this work, we propose a similar protocol fully based on logical constraints fully implementable using SPARQL queries to infer constraints for the assessment of the usage of relation types (P) on Wikidata based on the most frequently used corresponding inverse statements (C O , P -1 , C S ).These constraints can be later used to define COVID-19-related Wikidata statements and to generate ShEx schemas for COVID-19-related Wikidata classes.Fig. 7 represents the scheme of the given relation type that will be used to assess and validate the use of Wikidata properties.If we consider COVID-19 <drug used for treatment> tocilizumab as an accurate relational statement in Wikidata, COVID-19 is the subject (S), drug used for treatment is the relation type (P), tocilizumab is the object (O), medical condition treated is the inverse relation type (P -1 ), disease is the subject class (C S ) and medication is the object class (C O ).

Fig. 7 .
Fig. 7. Scheme of a given Wikidata property [Source: https://w.wiki/anw,License: CC BY 4.0]: S and O are respectively the subject and the object of the statement, P is the predicate of the statement, P-1 is the inverse property of P, CS is the class of the subject, and CO is the class of the object.

Fig. 8 .
Fig. 8. Relations returned by Task T4 for the studied Wikidata properties [Available at: https://w.wiki/ao2,License: CC BY 4.0].Extracted relations verified by expert validation as deficient are represented in red.Note: log x-axis.
and cannot consequently be handled only by computer scientists.Ontological reasons include the substitution of the accurate subject or object of a given relation with a semantically related item (e.g.The object of (Renvela [Q29006419], Therapeutic area [P4044], hemodialysis [Q391744]) should be renal dialysis [Q202301]) as well as the Subject-Object Inversion (e.g. the subject and object in (botulism [Q154845], Medical condition treated [P2175], Heptavalent botulism antitoxin [Q17148719]) have to be permuted) and the use of wrong property (e.g.The Wikidata property in (MK-608 [Q23309937], Significant drug interaction [P769], Zika virus RNA-dependent RNA polymerase NS5 [Q22954521]) should be target of action).
Description Sample filtered deficient statement Validating qualifiers of COVID-19 epidemiological statements V1 Verify Z as a date > November 01, 2019 COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 20 V2 Verify Q as any subclass of (P279*) of medical diagnosis (Q177719) COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020 <determination method> COVID-19 Dashboard Ensuring the cumulative pattern of c, d, r, and t V3 Identify c, d, r and t statements having a value in date Z+1 not superior or equal to the one in date Z (Verify if dZ ≤ dZ+1, rZ ≤ rZ+1, tZ ≤ tZ+1, and cZ ≤ cZ+1) (COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020) AND (COVID-19 pandemic in X <number of cases> 6 <point in time> March 24, 2020) V4 Find missing values of c, d, r and t in date Z+1 where corresponding values in dates Z and Z+2 are equal (COVID-19 pandemic in X <number of cases> 5 <point in time> March 24, 2020) AND (COVID-19 pandemic in X <number of cases> 6 <point in time> March 26, 2020) AND (COVID-19 pandemic in X <number of cases> no value <point in time> March 25, 2020) Validating values of epidemiological data for a given date V5 Identifying c, d, r, h, and t statements with negative values COVID-19 pandemic in X <number of cases> -5 <point in time> March 25, 2020 V6 Identify h statements having a value superior to the number of cases for a date Z(COVID-19 pandemic in X <number of hospitalized cases> 15 <point in time> March 25, 2020) AND (COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020) V7 Identify c statements having a value superior or equal to the number of clinical tests for a date Z (COVID-19 pandemic in X <number of clinical tests> 4 <point in time> March 25, 2020) AND (COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020) V8 Identify c statements having a value inferior to the number of deaths for a date Z (COVID-19 pandemic in X <number of deaths> 10 <point in time> March 25, 2020) AND (COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020) V9 Identify c statements having a value inferior to the number of recoveries for a date Z (COVID-19 pandemic in X <number of recoveries> 10 <point in time> March 25, 2020) AND (COVID-19 pandemic in X <number of cases> 5 <point in time> March 25, 2020) V10 Comparing the epidemiological variables of a general outbreak with the ones of its components (COVID-19 pandemic in X <number of cases> 10 <point in time> March 25, 2020) AND (COVID-19 pandemic in Y <number of cases> 5 <point in time> March 25, 2020) WHERE X is a district of Y
m = d / c for a date Z, m values that are not close to the corresponding quotients of deaths by disease cases should be identified as deficient and m values should be stated for a given date Z if mortality and morbidity counts exist.Thus, Task M2 is created to extract m values where the absolute value of (m -d/c) is superior to 0.001, and Task M3 is developed to identify (item, date) pairs where m statements are missing and c and d statements are available in Wikidata.Absolute values for Task M2 are obtained using SPARQL's ABS function, and deficient (item, date) pairs are eliminated in Task M3 where m > 1 and c < d.

Table 1
Constraint types for the usage of Wikidata properties class of constraints on the value of a statement with a given property.For constraint: use specific items (e.g."value type constraint", "value requires statement constraint", "format constraint", etc.

Table 3
Tasks for quality assessment of the usage of Wikidata relation types using the Wikidata SPARQL endpoint

Table 4
Wikidata properties assessed in this study

Table 5
Common use cases of the studied Wikidata properties

Table 6
Inverse properties corresponding to each common use case of the studied Wikidata relation types

Table 7
Number of missing inverse statements of Wikidata relations supported by references and corresponding to the most used scheme of each Wikidata property

Table 7 .
Only relations corresponding to the most common use case of the related Wikidata property and supported by references are considered.

Table 8
Number of statements not corresponding to the most common use case of each Wikidata property: Statements where the subject class is not the most used one (G1), statements where the object class is not the most used one (G2)

Table 9
Number of statements not corresponding to the most common use case of each Wikidata property: Statements where the subject class is not the most used one (G1), statements where the object class is not the most used one (G2) Wikidata properties used to define references for studied Wikidata relation types: stated in (P248), retrieved (P813), language of work or name (P407), National Drug File Reference Terminology ID (P2115), reference URL (P854), and European Medicines Agency product number (P3637)

Table 11
Number of deficient statements for every type of epidemiologicalWikidata property identified by each task (As ofAugust 8, 2020)