Open Government Data: Usage trends and metadata quality

Open Government Data (OGD) have the potential to support social and economic progress. However, this potential can be frustrated if these data remain unused. Although the literature suggests that OGD data sets’ metadata quality is one of the main factors affecting their use, to the best of our knowledge, no quantitative study provided evidence of this relationship. Considering about 400,000 data sets of 28 national, municipal and international OGD portals, we have programmatically analysed their usage, their metadata quality and the relationship between the two. Our analysis has highlighted three main findings. First, regardless of their size, the software platform adopted, and their administrative and territorial coverage, most OGD data sets are underutilised. Second, OGD portals pay varying attention to the quality of their data sets’ metadata. Third, we did not find clear evidence that data sets’ usage is positively correlated to better metadata publishing practices. Finally, we have considered other factors, such as data sets’ category, and some demographic characteristics of the OGD portals, and analysed their relationship with data sets’ usage, obtaining partially affirmative answers.


Introduction
Open Government Data 1 (OGD), data which traditionally originate from governments [1], usually refer to public records (e.g. on transportation, infrastructures, education, health and environment) that can be re-used and redistributed by anyone either for free or at a marginal cost. Access and free use of government data are believed to support unprecedented social and economic developments [2,3]. A study commissioned by European Commission and carried out by Deloitte foresees that the total direct economic value of public sector information is expected to increase from 52 billion in 2018 up to 215 billion by 2028 [4] (p. 402). OGD have found application in many sectors, such as environmental protection, security, mobility and agriculture, and several success stories report promising aspects, thus promoting the spread of best practices that can be adopted in other similar contexts [5][6][7].
However, many administrations still rely on the number of published data sets to measure the success of their Open Data programs [8], although these numbers say nothing about data sets' quality or their actual reuse [8,9]. Moreover, although the exponential growth of OGD offers consumers an enormous amount of data, it also forces them to question the value of these sources, often unknown, in meeting their information needs, curbing their use [9]. This fact leaves the data providers with the uncomfortable feeling that a large part of their data remains unused [10,11]. This concern transpires in the reports of some US Chief Data Officers presented in Stone [12], with some civic leaders claiming that 'We counted the clicks and we saw that these portals just weren't being used'. (n.p). 2 Although OGD are considered a driving force for transparency [13,14] and for stimulating economic growth [9], they have limited value if not utilised [15]. However, knowledge about Open Data use is scarce, with a few studies addressing citizen usage of OGD [16,17]. These observations were anecdotally reinforced in the course of our past activities [18], where, having accessed to various Open Data portals, we often noticed that much of their data sets are little used. For these reasons, in Quarati and De

OGD portals
By adopting dedicated software platforms, the OGD portals make public data available under the release policies in force in their administrations. Through these platforms, portal managers publish their data sets, assigning specific metadata with which to organise them into categories. Based on these metadata, users can access the data of their interest through more or less advanced search and browsing functionalities. These platforms usually provide APIs with which it is possible to programmatically query the portals to download both metadata and data sets [31]. Among the platforms adopted in OGD portals, the open-source CKAN and the commercial Socrata stand out in number [31,32].
A key feature distinguishing the CKAN portals from the Socrata ones is that the latter allow users directly visualise the data sets of their catalogue in a tabular form via a browser. These tables can further be downloaded in different formats. By contrast, in CKAN-based portals, the actual content is not (generally) immediately displayed by the browser.

OGD quality and use
From database systems to multimedia information delivered by Web technologies, poor data quality affects information retrieval, knowledge discovery and data reuse [33]. Data quality is acknowledged as a 'multifaceted concept' [34] (p. 6) involving different dimensions (e.g. correctness, completeness, relevancy, availability and consistency). Several methodological frameworks and technological proposals have been defined to evaluate the various quality dimensions of information resources, appropriately measured by quality metrics [35][36][37][38][39][40]. More recently, some of these works focused on assessing and monitoring Open Data portals' performance, taking into exam the quality of the data as well as of the associated metadata. Almost all these works clearly highlight deficits in metadata quality [26][27][28], pointing out that most of the evaluated data sets lacked metadata [15,41] or presented a naive version of them [42]. Looking at specific factors affecting OGD data and metadata quality, Vetró et al. [41] found that centralised (i.e. with standardised data structures) data disclosure yields better quality data and metadata than decentralised (i.e. no common data structure). Máchová and Lnenicka [15] observed that national portals powered by CKAN achieved better quality scores than the others, by contrast, according to Zhu and Freeman [43], US cities' Socrata-based portals show better performance based on their portal overall score. The population of the city is deemed to affect OGD portals' qualitative performance [43,44]. Based on their findings, some authors make suggestions to improve OGD portals' effectiveness, by recommending the creation of automatic evaluation platforms, thus to constantly monitor OGD metadata quality [26,28,32]; by suggesting that the results of the quality assessments are used to improve the performance of the OGD portals, to make them more usable and therefore increasing their adoption [15]; by encouraging portal managers to make better use of their platforms for greater user participation [43]. Finally, Zhu et al. [43] conclude that 'more research is needed to understand who actually uses the portals and the data, and for what purposes' (p. 36).
None of these works provide empirical evidence of a relationship between (metadata) quality and OGD data sets usage. Nor is this relationship considered in papers that have focused on other barriers to the release of OGD [21,22]. At present, only a limited number of studies have examined data on user experience. In their survey, Safarov et al. [11] highlighted that most of the examined studies focus on aspects related to the release of OGD, taking for granted, but not empirically testing, the use in its various forms. They urge empirical analysis research works to evaluate 'if the estimated effects of OGD are actually measurable'. (p. 16). Wirtz et al. [17] found that citizens' intention to use OGD is primarily driven by their usefulness, and that collaboration is a usage motivating factor, as well as is the ease-of-use (i.e. the effort required to consume OGD) of OGD enabling platforms. Lassinantti et al. [16] carried out a document analysis to clarify what motives urge people to use OGD, and what important users' types exist. They identified five main relevant social groups according to the motives that drive people towards Open Data use: (1) exploring for creativity; (2) creating business value; (3) enabling local citizen value; (4) addressing global societal challenges; (5) advocating the Open Data agenda. Ruijer et al. [45] posit that 'in order to better understand the complexity of OGD usage' (p. 6), it is necessary to examine the interactions between governments and citizens on OGD platforms and what 'people actually do with Open Data' (p. 2) in their daily lives. They observed a discrepancy between user needs and what is offered by the existing data sets, and pointed out, that 'A more participatory approach to the development of OGD platforms is needed to prevent a situation where there are great technological opportunities but no usage' (p. 14). These works, analysing the use of OGD, theoretically or based on empirical research focused on surveys, achieve qualitative conclusions on the type of users, their motivations and the relationships between them. However, besides our mentioned paper [19], to the best of our knowledge, the only other work providing quantitative analysis on the use of OGD data sets dates back to 2014 and focused on 20 US Socrata-based city [46]. The reported results, on the trend of the number of views, agree with those we found in Quarati and De Martino [19].

Material and methods
The main objective of this work is to analyse the use of open government data sets, the quality of their metadata and the relationship between the two. Furthermore, in light of the current literature, we examine other factors that can influence the use of OGD data sets. Accordingly, the research questions guiding this study are: RQ1. Which are the forms of OGD data sets usage? RQ2. Which are the OGD data sets' metadata quality and compliance to FAIR principles, as assessed by two existing tools? RQ3. Do the OGD data sets' metadata quality and compliance to FAIR principles affect data sets usage? RQ4. Does the OGD data sets categorisation influence their use? RQ5. Can some OGD portals' demographic features drive users' attention towards their data sets?
To answer these questions, our experimental investigation first identified the metrics to measure data sets usage and the criteria for selecting the 28 OGD portals. The collection and processing phase was carried out programmatically. For each selected portal, we have extracted its data sets' usage information, and we have assessed its data sets' metadata quality and their compliance with the FAIR principles.

Usage metrics
According to Dawes et al. [47], 'Data use includes activities to search, identify, and download data for a variety of purposes including analysis and application development' (p. 18). However, monitoring the use of data sets as well as which applications use them is still an open challenge [48]. Therefore, to get insights on the data demand by users, we analysed two parameters sometimes associated with data sets in some portals: the total number of views and the total number of downloads [46,[49][50][51]. We mean by Views the total number of times the Web page of a data set has been loaded in users' browsers and by Downloads the total number of users' requests for retrieving the full content of a particular data set [52] ( Table 1). These values, that can be found on the data set access page and returned by portal API, supply an indicator of use for the activities of direct users, that is, those users who consult a data set directly [11].
Views and Downloads can certainly help measure the impact of OGD [53] initiatives, but they are not exhaustive. In fact, several government portals encourage other forms of OGD use [54] through the provision of data analytics tools [55], the deployment of citizen services [56] or publication of reports based on OGDs. A more focused usage indicator should, for example, evaluate the number of these indirect users, together with the number of these applications [11]. However, if it can be difficult to collect and directly assign to a data set the number of users of an application [48,54], the applications reusing a given data set could be gathered as done by the French 5 and Portuguese 6 portals. This parameter could help improve the perception of data sets' usefulness, also providing an indicator for the evaluation of the indirect use of the portal. To facilitate this recognition, third-party applications that use a data set should always be encouraged to cite it among their sources [57]. In addition to facilitating the discovery of data and supporting the monitoring of its impact and its reuse, data citation 7 helps users to know the provenance of the original data, thus making the products of these applications more reliable [39].

Portals' selection and data collection
The experimental investigation considered three types of OGD portals based on their administrative coverage: national, municipal and international. Among these portals, we have selected those that publish usage information, in particular the number of data sets' views and downloads, and which have APIs for programmatic retrieval of this information.
As for the national portals, we initially considered the 94 countries ranked, in 2017, by Global Open Data Index (GODI) 8 based on their publication practices of Open Data. To these, we added Korea, Spain, Ireland and Estonia, not Total number of users' requests for retrieving the full content of a particular data set [52] included by GODI, but ranked by the OURdata Index 9 performed by Organisation for Economic Co-operation and Development (OECD). Among these 98 portals, we first selected the 15 portals that display at least one usage metric for each data set. From these, we further picked the eight portals providing APIs to collect the usage values programmatically (Table 2). Following a similar approach, we have selected 16 US municipal portals. Our choice to focus on the US city portals, besides the fact that many of these city governments were among the first to adhere to Open Data practices [58], under the government's impetus by President Obama, is due to the interest showed towards these portals in the literature. In fact, in recent years, following the considerable efforts made by US cities to publish Open Data and encourage citizens to use them [43,46], several scholars have investigated their characteristics [8,59,60] and analysed their performance [43,44,46]. For example, Zhu et al. [43], in a study on 34 US cities, found that the city population is positively related both to the number of its portal's data sets as well to an overall score based on a set of portal-level metrics they defined and manually tested. A strong correlation between the number of data sets available for a city and its population was also verified by Barbosa et al. [46] in the case of 20 US municipalities. Similar results have been reported by Thorsby et al. [44] in a study on a set of 37 US cities' portals, that found population size was related both to the number of portals' data sets, and on an 'Open Government Data Portal Index', based on some features of Open Data portals, designed and manually evaluated by the authors. Based on the results that emerged from these studies, we deemed it appropriate to contribute with an analysis of these portals' usage, also examining its relationship with their metadata quality. Table 2 shows that these portals are mainly based on Socrata, which is, in fact, the most used software platform by US cities [32].
The sample of four international portals considered, albeit small, can be deemed of significant public utility and interest, thanks to the heterogeneity of the thematic coverage offered by the data sets of these portals in the financial, aerospace, legislative and humanitarian sectors. The Humanitarian Data Exchange (HDX) portal promoted by the United Nations Office for the Coordination of Humanitarian Affairs is aimed at sharing data on the areas of crisis in the world, thus supporting humanitarian aid organisations in making 'quick, life-saving and informed decisions'. 10 The European Union Open Data Portal (EUODP) primarily aimed at EU citizens and organisations, gives access to more than 14,000 data sets published by EU institutions, agencies and bodies. The DATA.NASA.GOV catalogue collects and makes available for the public about 10,000 NASA data sets, aggregating data harvested from different archives (e.g. Planetary Data System and National Oceanographic and Atmospheric Agency). Although available to everyone, this portal primarily targets at space and environment scientists. The Open Finances portal makes available public financial data and portfolio information from across all World Bank Group's entities in one place. The data have been collected in the period 18 December 2019 to 3 January 2020. These data provide a snapshot of the overall use of the data sets of the 28 portals, in terms of Views and Downloads, up to that moment. We have published the data from these portals and the analyses carried out in this study as Open Data in the Zenodo Open Data repository [61].

Gathering usage information
By means of Python code, we retrieved the portals' usage data by appropriately exploiting the metadata discovery APIs provided by the various portals' platforms. Each data set's metadata content was then extracted and stored in an internal database for subsequent analysis. The various discovery and retrieval APIs act as follows.
The CKAN API provides information relating to the number of views in a tracking_summary metadata field, containing two values: total and recent (i.e. Views in the last 14 days). According to the definition of the Views metric in Table 1, we took the total value for our analysis. It must be said that the presence of usage information, in the GET response, is not guaranteed by default but has to be explicitly requested in the GET call, and has to be enabled server side. 11 Besides, not all CKAN-based portals return usage metadata in the tracking_summary field. Instead, as in the case for the Ireland, EUODP and the HDX portals, usage information is contained in other fields or slightly different metadata structures. Furthermore, there are cases, such as the HDX portal, based on a CKAN extension, in which the metadata recovery API returns, in addition to the number of views, also the total downloads.
The RESTfull Socrata Open Data API (SODA 12 ) allows retrieving data sets metadata but returns a different and somehow smaller set of metadata fields compared with the ones retrieved by CKAN. For instance, the downloadable formats of data set content are not reported. Conversely to CKAN, SODA metadata returns usage information without having to explicitly require for that. Furthermore, this information includes, in addition to the total number of views (page_views), also the total number of downloads (download_count) at the current date.
Finally, some portals, like the French and Polish, use other software platforms, which provide different APIs than CKAN and Socrata, both in the call format and in the type of information returned. In particular, the French portal returns the total number of views at the current date in the API JSON response (in a sub-field named views), but not in the metadata visible to users. Besides, the API returns two other usage indicators, namely, Reuse number and Number of followers.

Evaluating metadata quality
According to Batini et al. [34], data quality is 'a multifaceted concept' (p. 6) involving several dimensions, where a quality dimension can be seen as a set of 'quality attributes that represents a single aspect or construct of data quality', as observed by Wang et al. [35] (p. 6). A quality metric serves to measure a specific aspect of a given dimension. Quality dimensions and metrics are central at evaluating whether a piece of data meets the information users' needs in a specific situation [62]. However, while methodological frameworks [35,63] and surveys [34,38] provide theoretical background and general guidelines to deploy data quality assessments in several fields [41,62,64], as observed by Reiche et al. [28] it is crucial the availability of data quality platforms to monitor 'the quality of different public government data repositories' (p. 241). To develop our quantitative analysis on OGD portals, driven by our five research questions, we sought already tested solutions quickly integrable with our code for portals usage retrieval. Therefore, as to the metadata quality assessment, we relied on the 'Open Data Portal Watch' platform code 13 presented by Neumaier et al. [26], which implements a 'concrete computation and automated assessment of quality metrics based on the DCAT metadata schema' (p. 23). At the core of the platform implementation, there is a mapping between the heterogeneous data sets' metadata, retrieved by APIs from the OGD portals, and the W3C DCAT 14 vocabulary. The platform implements 17 quality metrics (see Table 8 in Appendix 1 for the complete list) to assess the compliance of ingested metadata with DCAT requirements. These metrics concern three quality dimensions: (1) Existence, that is, do specific metadata fields exist?; (2) Conformance, that is, do metadata values adhere to a certain format?; (3) Data Open, that is, may the specified format and licence information classify a data set as open? The eight Existence metrics evaluate if metadata supply useful information to discover (i.e. is there a data set description, a title and some keywords?) and access (i.e. are there URIs to access and download?) the associated data set, to contact the owner or the publisher. The presence of licence information is also evaluated, as well as the dates of creation and modification of the metadata and the data set. The Preservation metric assesses the existence of metadata information regarding the format, size and the update frequency of the data sets. The Spatial and Temporal metrics 15 ascertain if some spatial (e.g. polygon and shape) or temporal (e.g. start or end of the period of the time covered by the data set) information exists. The six Conformance metrics assess the syntactical validity of the access URI, the contact email address and URI, and the date format; the licence conformance is checked by analysing a list of licence descriptions provided by the Open Definition; 16 the validity of the file format is checked against a list of registered formats and media types supplied by IANA. 17 As to the three Data Open metrics they ascertain the data sets compliance to the Open (Knowledge) Definiton, 18 assessing if the data sets are supplied in a machinereadable and open format, and according to an open licence.
The quality assessment has been carried out on each portals' data set, resulting in a single, Boolean or floating (in the [0,1] range), qv value for each metadata quality metric. For each data set, after converting Boolean values into 0 and 1, we aggregated the 17 metrics according to the Simple Additive Weighting (SAW) decision-making method, by assigning equal weight (w j = 1=17) to every metric, thus resulting in a data set overall metadata quality value omq = P 17 j = 1 qv j * w j , omq ∈ ½0, 1. For each data set, we stored its computed omq value in an internal database along with the metadata and usage information. The 'Open Data Portal Watch' platform code was integrated with our usage extraction code and extended to elaborate and produce analytic and reporting.
We point out that, to give an image as analytical as possible of the quality of the portals and their use, the quality assessment that we have carried out is intrinsically objective (aka structural), measurable through impartial physical characteristics (e.g. item counts and ratios) of the OGD portals [65]. It has ignored subjective (aka contextual) aspects, capable of taking into account users' needs and purposes and informing their usage choices [39,65], but which are out of the scope of an experimental investigation such as that proposed by us, that programmatically evaluates a large number of data sets, belonging to different public administrations and organisations, based on their metadata.
As a final remark, we point out that the 'Open Data Portal Watch' platform was designed and implemented based on the previous version of W3C DCAT. For this reason, it does not evaluate some new properties introduced in the 2020 revision that extends the original DCAT vocabulary. In particular, following DWBP practices Five and Six, which recommend publishing data sets by providing provenance and quality metadata, the new version of DCAT supplies 'more details for the ways of representing data set provenance and quality'. In the first case suggesting using the W3C Provenance Ontology (PROV-O), 19 and in the second case employing the W3C Data Quality Vocabulary (DQV). 20 In our previous work [39], we exemplified the adoption of these two technologies by providing W3C compliant metadata to document the quality assessment of a set of e-Government controlled vocabularies.

Assessing compliance with FAIR
Based on the FAIR principles [29], Wilkinson et al. have designed a framework [30] and developed a 'FAIR Evaluator' 21 tool implementing a series of metrics to test the compliance of a Web resource with those principles. We rely on such a tool to check how (a sample of) the selected OGD portals support the 'machines to automatically find and use' [29] (p. 1) the published data sets, 'in addition to supporting its reuse by individuals' [29] (p. 1). The framework implements '22 Maturity Indicators Tests', grouped by the four principles, respectively, in eight (Findable), five (Accessible), seven (Interoperable) and two (Reusable) (see Table 9 in Appendix 1). The tool allows users to select the whole 22 tests, or one of the four subgroups, for assessing the FAIRness of a given (Web) resource by supplying its Globally Unique Identifier (GUID). At the end of the evaluation process, an assessment report summarises the successes and failures of the resource for the selected metrics. Users may interact with the evaluator either directly by Web interface or via APIs.
Following the latter option, we submitted a sample of the selected OGD data sets, gathered the evaluation results metadata, and extracted from them the values of interest, finally integrating them with the other data sets metadata stored on our internal database. The integration and normalisation step has been made following the same approach presented in the previous section.
The decision to carry out the compliance assessment on a sample rather than on all the 400,000 data sets is due to the long waiting times required to perform a single data set assessment. In fact, after performing several checks on data sets belonging to different portals, we realised that the waiting time for each evaluation carried out (on the 22 available tests) by the 'FAIR Evaluator' tool was never less than 5 min with peaks of 30 min or more. The reason for this delay is probably because, for every single data set GUID, a series of multiple calls to remote sites 22 is necessary to check if a given element present in the metadata of the data set conforms to the underlying principle. For this reason, we restricted our analysis to a subset of our initial sample, consisting of US citizen portals. However, as these portals amount to about 15,000 data sets and the expected completion times would have been in the order of weeks, we have given up assessing all these portals' data sets. Instead, we have examined for each portal a sample of 80 data sets taken in a (pseudo) random way considering the number of Views that took place. We then randomly selected 40 data sets among those with the number of Views below the first quartile and 40 above the third quartile. In this way, we have tried to bring out, as far as possible, the potential relationships between the usage of a data set and its compliance with the FAIR principles.  Figure 1 shows the distribution of the number of Views for all the portals: national, US cities and international. On the X-axis, the number of Views is grouped into five non-linear classes (i.e. 0-10, 10-100, 100-1000, 1000-10,000 and > 10,000). On the Y-axis is reported the percentage of data sets for each class.
We can see that Views are not distributed normally, and that all portals show that just a very low percentage of their data sets is viewed more than a thousand times. By contrast, several of them show very low (10-100) to extremely low ( < 10) numbers of Views for the great percentage of their data sets. Figure 1 shows that this fact is particular true for six portals, that is, the ones of the United States, France, Ireland and Slovenia, as well as the HDX and NASA ones, with more than 90% of their data sets has been viewed no-more than 100 times. These results are further clarified by the Views descriptive statistics (i.e. mean, standard deviation, first, second and third quartile) listed in Table 3.
By examining the first two quartiles, we notice that almost the 50% of those six portals' data sets are barely viewed by users (with the higher median equal to 12 for the US portal), and another 25% just more visited (the higher third quartile is 26 for HDX). This fact is particularly unexpected if considering that these portals belong to countries with not negligible numbers of inhabitants (particularly the United States and France), with a full-blown tradition of attention to Open Data (i.e. the United States), or of noticeable interest for the whole scientific community (i.e. NASA). It is worth noting that four out of six of these portals are based on the CKAN platform, one on the French platform and just one on Socrata (i.e. NASA). The views' numbers for the other 22 portals are more promising, showing higher values, even by two or three orders of magnitude, for all quartiles. Looking at the median values, the World Bank Finances portal and some US cities' portals, like Chicago, Nashville, Boston, San Francisco and New Orleans, mostly based on Socrata, achieve quite high numbers.
The values in Table 3 suggest quite clearly that Socrata-based portals are viewed more than others. To test the general hypothesis that differences in the usage of data sets are due to the underlying portals platform, we carried out the Kruskal-Wallis non-parametric test. It analyzes whether the independent variable 'platform' (taking the values 'CKAN', 'Socrata' and 'Others') affects the dependent variable 'number of views'. The values reported by the test execution: chi À squared = 74, 437, df = 2 and pÀvalue < 2:2e À 16, confirmed the hypothesis. Besides, the post hoc pairwise test confirmed that there are significant usage differences between all three platforms.
We also checked the hypothesis that differences in the usage of data sets are due to the administrative coverage of their portals. Indeed, nations, municipalities and international organisations may have different policies or follow diverse administrative operations to publish Open Data. Through Kruskal-Wallis test, we analysed whether the independent variable 'administrative coverage' (taking the values 'national', 'municipal' and 'international') affects the dependent variable 'number of views'. The test reported values: chi À squared = 32, 762, df = 2 and pÀvalue < 2:2e À 16, confirmed the hypothesis. The post hoc pairwise test confirmed that there are significant usage differences between all three types of administration.

Downloads.
Considering that the number of downloads is not always present for all the 28 portals, we show in Figure 2 the total amount of views and downloads only for the 14 US cities' portals based on Socrata that also include the download numbers in the metadata. We do not show the Downloads' distribution graphs, as they follow a similar distribution as the Views. For the same reason and sake of brevity, we do not include detailed statistics of Downloads, although we report that Downloads' statistics are in line with those of the Views. 23 The Downloads' values for the three quartiles are always minor than the respective Views. Notwithstanding, for all portals, there is a percentage of data sets with higher number of Downloads than Views (this can explain, for instance, the two histograms for Chicago in Figure 2). This situation is not inconsistent as reported to us by the Socrata support desk, according to which automatic tools such as bots, or APIs, can be used to autonomously download the data sets, without having to visit the portal sites. We also note that in the case of Chicago, the total value of the downloads is heavily influenced by a single data set, 24 which has been downloaded over 14 million times but just viewed one-tenth of the time (1,457,673).

RQ2
: which are the OGD data sets' metadata quality and compliance to FAIR principles, as assessed by two existing tools? 5.2.1. Metadata quality assessment. As can be seen from Table 4 reporting the descriptive statistics of the overall metadata quality (omq), plus the mean values for the three quality dimensions, for the data sets of the 28 portals, it is possible to subdivide the portals essentially into two main groups according to the adopted platform.
CKAN  By inspecting the returned metadata, we notice that the main reason for the lower quality values of the Socrata-based portals is related to the less rich metadata returned by the SODA APIs if compared with the metadata fields returned by CKAN and by the French platform. For example, information about the format of the downloadable files is not present in the Socrata metadata. For this reason, the four metrics assessing the existence (Preservation), conformity (FileFormat) and openness (OpenFormat and MachineRead) of data sets format always return a value of 0. This is ironic considering that, from a user perspective, Socrata-based portals allow downloading data sets' content in a multitude of formats, thus satisfying one of the golden Open Data rules. 25 Despite the common platform, there are differences between the Socrata-based portals. Because of the same type of administration, it is interesting to examine the case of US cities, from which one would expect similar quality values, considering that the respective portals can be handled likewise. For some of them, such as New York, Baltimore and Chicago, the mean and median values are around 0.3, while other portals such as Seattle and Los Angeles have median values around 0.45. By examining the metadata, we realised that this difference largely depends on the presence or absence of licence information. In fact, in the case of New York, licence information is absent for 97% of the data sets, while more than 96% of Seattle's data sets have this information. Besides for Seattle these licences are well-formed and open, thus increasing further the total quality value. However, if we compare the usage statistics for the same two portals (Table 3), we observe that New York has higher values than Seattle for each statistical indicator. This fact may be due to the larger population of the former than the latter. In the next section, trying to answer the research question RQ5, we explore possible relationships between the use of OGD, the city population, and the number of data sets of a portal.
More generally, looking at how each dimension contributes to the overall quality score, we notice from Table 4 that the Open Data dimension presents the lower values. With the sole exception of Ireland, Poland, Latvia and Boston, with mean values over 0.5, and, partly, of the France, HDX and Albuquerque portals, the other OGD portals seem to overlook the importance of fully adhere to the Open Data principles. In other words, 75% of the portals appear to not supply a large part of their data sets according to open and machine-readable formats and through open licences. Or if they do, they do not declare such information in the data sets' metadata, as discussed in the case of Socrata-based portals for the format information. Altogether 11 portals have an average Existence value equal to or greater than 0.5, and as to Conformance, this average value is reached (or exceeded) from 8 out of 28 portals. As to the Existence dimension, two metrics, that is, Access and Contact, outperformed all others with values over 0.9, for all portals, except Poland whose metadata does not contain any URL to download the associated resource files neither any contact information. More than two-thirds of the portals have at least half of their data sets providing discovery information and some kind of licence. Data set creation or modification times are present in at least 50% of the metadata of all portals. Except for four CKAN-based portals (i.e. the United States, Ireland, Latvia and Boston), the always zeroing Spatial and Temporal metrics contribute heavily to lower the means of the Existence dimension. In addition to the actual lack of spatiotemporal information, these null values can depend on the difficulty in identifying such information in the metadata. As noted by Neumaier et al. [26], spatio-temporal chunks can be spread into heterogeneously distributed and portaldependent metadata fields, which are not easily identifiable automatically. The mean values for the Conformance dimension are, for 23 out of 28 portals, lower than those of Existence, thus suggesting for these portals a possible lack of care, or inattention, in compiling the information in the metadata fields. This fact is confirmed, for instance, by looking at the contact information (i.e. contact email and URL). Only ten portals have at least half of their data sets with a correct contact email address, while just France, Ireland and Latvia provide some conform URL to contact the data sets' provider. Furthermore, only 11 portals have most of the data sets provided with a compliant licence, that is, a licence description that can be mapped on one present in the list of licence descriptions provided by the Open Definition.
This analysis highlights that, although the platform's metadata schema affects the quality results (such as the lack of information for the file format, in the case of Socrata-based portals), it is the responsibility of portal managers and metadata compilers to ensure to supply published data sets with the most accurate and complete description of the metadata. Figure 3 show the statistics of the absolute number of successes (i.e. number of tests passed) related to the FAIR compliance assessment carried out by applying the 22 metrics (see Table 9 in Appendix 1), implemented by the 'FAIR Evaluator' tool [30] on 1120 data sets. No data set of any portal passed all the 22 tests. Eight, out of fourteen, portals have a median value equal to 9, and the other six have higher medians (11, and 14 for five portals). The third quartile is often higher, up to 16 hits for San Francisco and New Orleans (i.e. a 72% success rate).

Metadata FAIRness. The boxplots in
To have a glimpse of the differences in values obtained by the assessment tool, even within a single portal, let us consider two data sets of the New Orleans portal. The first 26 passed 16 tests while the second 27 just 9. By comparing the two assessment reports, 28 we note that for the first data set, both the weak and the strong test, that look if the metadata contains an explicit pointer to the licence (sub-principle R1.1), succeed. The tests failed for the second data set by reporting the message 'No License property was found in the metadata'. A quick look at the data sets' pages confirms that a 'CC0 1.0 Universal' licence is supplied with the former, while no licence is available for the latter. Similarly, the test checking if the metadata contains the unique identifier to the data (sub-principle F3) found an identifier (i.e. 'https://data.nola.gov/ api/views/d2is-2r79/rows.rdf?accessType=DOWNLOAD') just for the first data set.
To get insights on the extent of the results obtained for each FAIR principle, we report in Table 5 the aggregate statistics for the 22 metrics (see Table 9 in Appendix 1) grouped by the four principles. Table 5 shows both the absolute and relative values of the number of passed tests for each principle. Only the two metrics related to the R(eusable) principle get very low values. The other 20 metrics measuring the compliance to the F(indable), A(ccessible) and I(nteroperable) principles achieve higher values, with a mean of successes of 0.5, median around 0.5 (i.e. 0.5 for F, 0.4 for A and 0.6 for I) and with still higher values, between 60% and 80%, for the data sets in the third quartile. These results show a not contemptible overall degree of compliance with FAIR, even considering that 'Partly FAIR may be fair enough', as remarked by several of the originals authors of the FAIR principles in Mons et al. [66] (p. 52 and Figure 1, p. 53).

Discussion
The first objective of this work was to verify the reception of OGD portals by users, based on a representative set of them (RQ1). Our findings revealed a general trend of low OGD data sets usage, with differences partly due to the specificity of the platforms underlying the portals and to the portals' administrative coverage. Answering to RQ2, we reported relevant differences between the metadata quality of one portal's data sets and another, possibly due to platform implementation choices, and by different portal managers' publishing practices.
In the following, we first discuss the implication of those findings by analysing the relationship between data sets usage and metadata quality (RQ3). Then, to explore other factors that could affect OGD data sets usage, we supply answers to RQ4 (Does the OGD data sets categorisation influence their use?) and RQ5 (Can some OGD portals' demographic features drive users' attention towards their data sets?).
6.1. RQ3: do the OGD data sets' metadata quality and compliance to FAIR principles affect data sets usage? 6.1.1. Metadata quality versus usage. We verified the correlation between the number of views and metadata quality, through the Spearman's rho (ρ) non-parametric test, instead of Pearson's test, as that the Views' frequencies do not follow a normal distribution (see Figure 1). By applying Spearman to all the collected data sets, independently by their portals, we obtained a rho value ρ = À 0:32 with p = 0 indicating a low, even if significant, negative correlation. We then analysed the behaviour of rho on the individual portals, and the results reported in the first column of Table 6 testify that in some few cases, the ρ values are at most close to 0.4 (i.e. Polland, Providence, New Orleans and Albuquerque), therefore a correlation value generally considered medium-low [67]. However, in most cases, significant values are far lower, often near 0, and for two cases (i.e. Colombia and Puerto Rico) not significant at all. Besides, for the 26 portals with a significant rho, the sign of the correlation is alternatively positive (16) or negative (10). Table 6. Correlation values between the number of Views and overall metadata quality (omq), and between the number of Views and the three quality dimensions for the selected portals. To get further insights, in Table 6, we also report the correlations between Views and each single quality dimension. The table shows that the ρ values and signs of one dimension to the others varying for each portal, although with a prevalence of positive signs. Besides, also the contribution of each quality dimension to data set usage seems to vary case-bycase. For instance, the US portal data sets' usage is positively related to Existence and Open Data dimension and inversely with Conformance, while the correlation signs of these dimensions invert for HDX. Furthermore, just for eight portals, the signs of the three dimensions (with significant ρ) agree: they are all positive for Poland, Dallas, Los Angeles, New Orleans and Nashville, and all negative for Austin, Buffalo and Albany.
These controversial values indicate that data set metadata quality, as measured by our approach, is not always positively correlated with their use. Rather, the negative signs suggest that in many cases, users prefer viewing data sets with low metadata quality. This result seems to contradict the common assumption that good quality metadata is a prerequisite data sets' usage [26,28]. Indeed, comparing the Views' usage statistics (Table 3) and the metadata quality statistics (Table 4), it is clear that the portals most used are those based on Socrata with the sole exception of the NASA portal. One possible reason for Socrata's success is suggested by Barbosa et al. [46], who noted how the visualisation features of this platform allow users to quickly inspect the data sets, in tabular form, using their web browsers (without necessarily having to download it). If the improved usability of Socrata-based (RQ4, does the OGD data sets categorisation influence their use?) portals can facilitate re-visits of data sets, or extend user stay, it will not necessarily attract new users to a portal they have never seen before. Unfortunately, we cannot quantify the accuracy of Barbosa et al.'s observation based on the data in our possession, since the reported number of Views also includes re-visits without being able to distinguish them. We hypothesise that a second explanation of the greater use of Socrata-based portals compared with the others, and in any case linked to their direct availability in a tabular form, is due to the presence of structural information in the returned metadata, also visible to users, that describes the structure (i.e. the columns of the table) and the type of content (i.e. columns description and format) of these tables. This observation resonates with the DWBP Best Practice 3: Provide structural metadata, which recommends providing metadata that helps data consumers better understand the meaning of data and its structure, such as human-readable information related to 'the properties or columns of the data set schema' (see also the implementation example29). Structural information will also enable software agents 'to automatically process distributions'. However, the 17 'Open Data Portal Watch' metrics we adopted do not gauge the presence of structural metadata, somehow penalising the metadata quality evaluation of the Socrata-based portals compared with that of the others that usually lack structural metadata.
Based on these mixed results, we note that, while important, the quality of OGD data sets' metadata alone cannot fully explain their usage trends. Other factors of a social, political, and not only technological nature, can come into play and deserve to be studied [68]. These factors cannot be analysed by the type of quantitative investigation we conducted, which, like that of many other authors [26,[41][42][43], has based on an objective evaluation [65] of the quality of the metadata. However, as we observed in Quarati et al. [62], data quality assessment tasks frequently involve providing judgements on some quality dimensions not generally measurable through a procedure alone, but that require qualitative assertions of their importance for a given scenario. For instance, contextual factors concerning users' needs [69], competencies and skills [45] can affect their approach to OGD resources, and hence the resulting number of views and downloads. As discussed in section 'Usage metrics', our data sets' usage analysis is based exclusively on direct portal users that we measure with the two metrics Views and Downloads. However, other users who, for their needs, exploit the data sets indirectly, for example, through services to citizens [54,56] or data analytic applications [55] developed by third parties, could be facilitated to discover and access these tools thanks to ad hoc metadata associated with them.

FAIRness versus usage.
Due to the non-normality of views distribution, we adopted the non-parametric Spearman's rho test also to analyse the correlation between FAIR compliance and the use of the data set. By applying Spearman to the 1120 data sets of 14 US cities, independently by their portals, we obtained a rho value ρ = 0:04, with p = 0:17, indicating no correlation found between FAIR compliance and data sets' popularity. We reached a similar conclusion after having separately analysed the correlation data of each portal. From Figure 4, we notice that halves portals show no correlation between usage and FAIR compliance, while the seven portals alternate positive (four portals) or negative (three portals) correlation values, ranging from negligible to medium as in the case of New Orleans (ρ = 0:521) and Boston (ρ = 0:571).
Analogously to what reported for the metadata quality effect on data set usage, correlation findings seem to suggest that the lack of compliance with, all or a great part of, the FAIR principles does not prevent users to discover an OGD data set. However, this fact cannot exclude that findability and access problems may arise in the case of an automatic search for information made by a non-human actor. On the sidelines of this discussion, it is interesting to note, as Wilkinson et al. do [70], that to maximise the discovery and reuse of OGD resources, it is useful to adopt metrics to evaluate metadata quality in an FAIR perspective, but 'metrics that assess the popularity of a digital resource are not measuring its FAIRness' (p. 6). And yet we believe, following Sasse et al. [31], that providing information on the popularity of a data set, also through direct usage measurements such as views and downloads, can attract the interest of users towards a portal.

RQ4: does the OGD data sets categorisation influence their use?
Organising a portal's data sets into thematic categories is one of the main features that allow the query and navigation of the OGD portals and is implemented by all platforms. However, as noted in the 2017 EU report 'Reuse of Open Data' [71], not all categories attract the same attention from users. Usually, the categories with which to describe the activities of a community refer to a few major themes such as transportation, economics, government, education and public safety. Notwithstanding, noticeable variations emerge in the way OGD portals select the number and terminology of the categories with which to group their data sets [60]. Zencey [59] analysing the popularity of the topics among the data sets published by 141 US public bodies noted that 'popular data sets varied significantly based on location' 30 and that this variation may 'expresses local preferences and needs'. Unlike our contribution, their work does not discuss data on the use of portals, per se, identifying trends based on detailed usage statistics. Furthermore, their work does not mention any of the quality aspects examined in this article.
In the wake of these results, it makes sense asking whether and to what extent there is a relationship between the use of a data set and its category. To answer RQ4, we focused on examining the 14 US cities' Socrata-based portals, both for a probable similarity of issues inherent in city life, as well as for the use of the same technology for managing the published data [58]. The number of categories of these portals ranges from 3 in Albany to 25 in Nashville, with an average value of 10.4.
To check the hypothesis that differences in the usage of data sets are due to their category we have applied separately the non-parametric Kruskal-Wallis test to each portal. The test analysed whether the independent variable 'category' (taking values in the categories list of the portal) affects the dependent variable 'number of views'. Except for Los Angeles and Buffalo, the results have shown that the differences are significant, and were confirmed for all portals by the post hoc analysis to check the pairwise difference among categories; we carried out by applying the Conover test.
To give an idea of the type of differences between categories, we report some usage data for the New York portal, as it is the portal with the highest number of data sets, as well as belonging to the city with the largest population. Figure 5 shows the percentage of data sets, views and downloads for the 12 categories of the NYC portal.
The figure shows that some category attracts more interest than others. In particular, about two-thirds of the total views concentrate on two categories: Transportation (38.7%) and City Government (28.4%). Furthermore, for some category, the number of downloads is considerably higher than the number of views (e.g. Social services, Public Safety, Environment and Education). This fact can be explained in light of what is reported by the Socrata support desk. Furthermore, Figure 5 reveals another interesting fact about the disparity between the number of data sets published and their use. This aspect is particularly evident in the case of Transportation, Public Safety and Education. Transportation, representing only 7.8% of the published data sets, is viewed 38.7% of times and downloaded 22%. Public safety, although covering only 3.5% of the total data sets, is downloaded about three times as much (10.1%). However, Education, which covers 33.1% of NYC data sets, is visited only 4.8% times and downloaded just twice 10.8%. These observations highlight how the governments' efforts to release Open Data do not always correspond to proportional users' feedback. Berends et al. [71] after having examined the European data portal' data sets and their reuse, arrived at a similar conclusion, recognising that there is a 'mismatch between available datasets and re-used data' (p. 26). 6.3. RQ5: can some OGD portals' demographic features drive users' attention towards their data sets?
As noted by Conradie and Choenni [21], cities have invested considerably in the publication of their data and, also with promotional initiatives such as hackathons, they tend to encourage private and civil actors to use OGD data to develop innovative services. Some previous works aimed at the analysis of US city portals revealed the possible relationship between the number of a portal's data sets and other factors such as its qualitative performance or the population of the city [43,44,46]. To answer RQ5, we have considered our sample of 16 US cities and analysed possible correlations between the population of a city, the size of its portal (i.e. the number of data sets), and its data sets' usage and their overall metadata quality (omq). We have chosen the median value, for the last two parameters, considering it sufficiently robust to denote the usage and quality of a portal. The results are shown in Table 7.
We found a significant, quite negative, Spearman's correlation, ρ = À 0:5, p = 0:04, just between the number of portal data sets and their use. This fact is consistent with the usage values reported in Table 3 where, except for Albuquerque and Albany, the portals with a smaller number of data sets have a higher median number of visits than others. All other tests reported non-significant p-values, suggesting that neither the portal size does influence its metadata quality, nor the city population seems to affect the use of its portal or its quality. Finally, no correlation has been found between portals' size and city populations. This fact seems to contradict previous studies [43,44,46]. We believe that this may be due to two reasons: (1) our sample involves fewer cities than that analysed by other studies and (2) the different data collection times. We think that this second aspect can affect the correlation results if considering that in 3-4 years, the number of data sets of an OGD portal can increase significantly. For instance, the HDX portal doubled the number of data sets from the time of our previous work (March 2019) to the current one (December 2019). The continuous evolution of a portal's number of data sets and its effects on replicating studies have been noted by others [72].

Conclusion
The OGD paradigm promises to unleash the potential of transparency and participation policies and to support economic and social development actions. However, many scholars stigmatise that portal managers still paid little attention to the quality of published data sets and associated metadata, as well as to their reuse, also claiming that low quality can hamper OGD usage. To shed light on these aspects that are crucial to understanding the progress of OGD policies, we carried out an exploratory study in which we collected and analysed the quality of the metadata and the usage, in terms of views and downloads, of approximately 400,000 data sets from 28 OGD portals. Our analysis based on five research questions focused on: (1) the usage trends of OGD data sets measured in terms of Views and Downloads; (2) the data sets' metadata quality and compliance to FAIR; (3) the relationship between data sets' usage and their metadata quality and FAIR compliance; (4) the relationship between data sets' usage and data sets' categorisation; (5) the relationship between data sets' usage and some of their portals' features.
Our investigation revealed three main findings. The first finding relating to RQ1 shows that data sets are mostly underused, albeit with relevant differences between the portals. As to RQ2, on the whole, OGD portals pay varying attention to the quality of the metadata of their data sets and their compliance with the FAIR principles. As to RQ3, our results show that by adopting our methodology, based on two existing quality assessment tools, the use of direct usage metrics, and applied to the 28 portals examined, we did not find a clear positive relationship between data sets usage and metadata quality. Accordingly, we recommend that such a relationship should not be assumed for granted. These findings led us to consider other factors that may influence the use of OGD data sets. To answer RQ4, taking into consideration the portals of 14 US cities, we have considered if data sets' categorisation can be a factor capable of directing users' interest. Statistical analysis confirmed this hypothesis, also highlighting a disparity between the percentage of data sets published in a certain category and their use. Moreover, answering RQ5, we found a negative correlation between OGD portals' usage and the number of their data sets, while no other portals' demographic features considered correlate with data sets' usage. Our analysis, therefore, highlights the need to examine further factors that can determine the success of OGD data sets publishing policies and practices. This research can move in two directions. On one hand, it may be useful to continue the investigation on the maturity and effectiveness of technologies to support the publication of data (e.g. the portals platforms, metadata modification tools and integration with social tools) correlating it with access and reuse of these data. On the other hand, to supply a more exhaustive usage assessment, surveys in the field could be carried out to analyse the effect on data sets usage of contextual aspects such as users' needs and competencies, also investigating the role and influence of indirect OGD usage modes.
A possible limitation of this work concerns the assessment of data sets' FAIR compliance and its relationship with their use. For the technical reasons mentioned, the analysis was applied to US citizen portals only, and involved a small sample of their data sets, for a total of 1120 data sets examined. That said, we believe that the sample of 80 data sets, selected for each portal, equally randomised between data sets below the first and above the third quartile, can provide reasonable population coverage and supports our findings.
From a practical perspective, based on our findings, and following the literature, we recommend portal managers and their political counterparts to constantly monitor the use of the published data sets through, at least, the basic parameters (usually provided by their platforms) that we use: Views and Downloads. Accurately monitoring and analysing user behaviour would have a twofold advantage for portal managers. First, having timely and punctual information on the success of individual data sets. Second, to be able to direct their publication efforts based on the greater or lesser popularity of certain categories. However, we point out that the adoption of the two metrics considered, if able to provide an overview of the use of the OGD data sets, is not sufficient alone to testify the impact of OGD policies. This could be better photographed if detailed information on third-party applications reusing a data set was also collected and published together with the data set. Our work also wants to be a spur for portals' managers to publish usage data along with the data sets. In fact, during the selection of the portals of our sample, we noticed the low propensity to publish such data, especially in most of the national portals. However, as noted by some authors, posting usage information can sometimes encourage potential users to access one portal's data sets at the expense of those from other portals with undisclosed usage data or showing low usage values.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship and/or publication of this article. The metadata quality assessment has been carried out by applying the 17 metrics defined and described by Neumaier et al. [26], we summarised in Table 8. We followed their nomenclature for dimensions and metrics names, except for the Spatial and Temporal metrics, that they declared and implemented only in the 'Open Data Portal Watch' code. The FAIR assessment has been carried out through the 'FAIR Evaluator' tool [30] via the 22 'Compliance Tests' (aka metrics) implementing the 'Maturity Indicator Tests' defined and described in note 31 and summarised in Table 9. For each test, it is also reported the identification of the corresponding FAIR sub-principle [73]. Table 8. Metadata quality dimensions and metrics.

Dimension
Metric Goal Existence Access Tests if metadata contains URIs to access and download the data set Discovery Tests if metadata contains a description and a title for the data set and its resource files, and some keywords helping discovery the data set Contact Tests if there are information to contact the owner or the publisher Rights Tests if licence information is provided Preservation Tests the existence of metadata information regarding the format, size, media type and the update frequency of the data sets Date Tests if the dates of creation and modification of the metadata and the data set are provided Spatial Tests the existence of some spatial information (e.g. polygon and shape) Temporal Tests the existence of some temporal information (e. Tests if the unique identifier of the data resource is probably to be persistent Structured metadata (F2) Tests whether a machine is able to find structured metadata Grounded metadata (F2) Tests whether a machine is able to find 'grounded' metadata Data identifier explicitly in metadata (F3) Tests if the metadata contains the unique identifier to the data Metadata identifier explicitly in metadata (F3) Tests if the metadata contains the unique identifier to the metadata itself Searchable in major search engine (F4) Tests whether a machine is able to discover the resource by search, using Microsoft Bing (continued) Tests a discovered data GUID for the ability to implement authentication and authorisation in its resolution protocol Metadata authentication and authorisation (A1.2) Tests metadata GUID for the ability to implement authentication and authorisation in its resolution protocol Metadata persistence (A2) Test if the metadata contains a persistence policy, explicitly identified by a persistence policy key Interoperable Metadata knowledge representation language (weak) (I1) Tests if the metadata uses a formal language broadly applicable for knowledge representation (anything that can be represented as structured data will be accepted) Metadata knowledge representation language (strong) (I1) Tests if the metadata uses a formal language broadly applicable for knowledge representation (any form of RDF will pass this test) Data knowledge representation language (weak) (I1) Tests if the data use a formal language broadly applicable for knowledge representation (any form of structured data will pass this test) Data knowledge representation language (strong) (I1) Tests if the data use a formal language broadly applicable for knowledge representation (any form of ontologically grounded linked data will pass this test) Metadata uses FAIR vocabularies (weak) (I2) Tests if the linked data metadata contains an explicit pointer to the licence GUID: globally unique identifier.