--------Anticorruption techs to face a global economy - Federated learning, open data catalogues, and “blockchain”

Fragmentation of production across national boundaries results in global supply chains. To govern effectively in this type of economy requires a combination of public data with private initiatives (from business, banks and NGOs). But basically, there are three major challenges to that: a) data comprises distributed and isolated data sets; b) analytics requires models to be trained across these independent data sets; and c) data sovereignty and specific privacy legislations make it difficult to collect, share and analyze data. Our research proposes three complementary paths to overcome these problems: 1) the use of federated learning; 2) the creation of open data catalogues; and 3) the design of distributed ledger technologies. four areas: 1.- New data-driven approaches to detecting and measuring corruption; 2.- Using new data sources and methods to measure the impacts of corruption 3.- Using open data to assess the effectiveness of anti-corruption tools, policies, and interventions 4.- Contributing new knowledge on data access, quality, and issues. Panel AM


Introduction
In market-oriented economies, organizations compete for consumers, inputs, less government restrictions, subsidies and many other forms of leverage. Most of the time such competition is perfectly legal. In other instances, it takes other forms, such as bribery, corruption, smuggling, and black markets. The more the countries control the economy, more space for rent seeking and therefore also illegal forms of competing for it (Krueger 1974). The welfare loss associated with such misconducts are pervasive.
Mounting inequality, severe poverty, extreme hunger and other human rights deficits have among its causal factors the massive illicit financial flows that are often motivated by the aim of dodging taxes, draining the economies and government budgets especially from the less developed countries (Pogge and Mehta 2016). These outflows can be substantially reduced through structural reforms, better policies, and modifications of existing procedures to curtail murkier arrangements. Secrecy facilitates any illicit flows, and technology can enable the transparency needed to mitigate its enormous cost to the global community.
Malfunctioning government institutions that constitute a severe obstacle to investment, entrepreneurship, and innovation (North et al. 2012) need to be exposed first before they can receive an effective judicial oversight to enforce contracts, secure property rights over physical capital, profits, patents, as well as other rights essential to receive incentives and opportunities to invest, innovate, and obtain foreign technology. Therefore transparency is the weapon to curtail corruption that is found to lower investment, thereby lowering economic growth (Mauro 1995).
But official oversights over data in an aggregate level fall short to shed the lights on the mismanagement of resources because in the end individuals who are corrupt, not offices, companies, regions or countries. So, it makes sense anticorruption investigation should therefore surveil individual behavior. This has not been previously possible, yet "the age of bewilderment is starting to give way to greater enlightenment" (The Economist 2021).
Granularity of data is and will continue to increase as digital devices, sensors, digital payments and other online interactions become ubiquitous. Therefore, the ability to spot corruption accurately and speedily will improve.
Private sectors have long been benefiting from technological supply-chain management and timely data has been a source of competitive advantage in the global value chain economy (Baldwin 2016). On the other hand, the public sector has been slower to adopt the same tools and to reform bureaucracy. But the pandemic has become a catalyst for change.
Many governments have started tracking mobile phones, contactless payments and other realtime data being produced. There is no doubt data about health, location, purchases, money transfers, facial biometrics, digital certificates for example can be powerful weapons to fight the virus. However, as a very famous Brazilian composer Caetano Veloso(2021) warns: terror may follow the euphoria, like happened after the Arab Spring 2 .
If the data collected and parameterized is not open and there is no transparency of how the algorithms are producing their analyzes and to what extent the decisions are being based on technological suggestions, we may be heading towards an urgent crisis of digital rights (Reventlow 2020). With the implementation of extensive new electronic and data-centric technology that will clearly help save lives, we must also be careful what we lose (Claypoole, 2020). Real-time granular data can make decisions more transparent, accurate and rules-based, 2 Here is part of the lyrics, originally in Portuguese, freely translated to English by the authors: Some mutilated angels from Silicon Valley; those who live in the dark in full light; They tell you to be virtuous in addiction; From blue screens more than blue; Now my story is a dense algorithm; That sells sales to real sellers; My neurons gained a new rhythm; And more and more and more and more and more; Arab Spring and soon the horror; Some want the world to end; Shadows of love; Clown leaders sprouting macabre; In the empire and in their vast backyards; Seeking to revive millenary empires; Equipped with total control; Angels already mi or bi or trillionaires; They only command their mi, bi, trillions (…) but if the infrastructure is not rightly chosen it can favor misinterpretation, injustices and pervasive controls that can lead to the deterioration of liberties and democracy.
In an effort to combat this conundrum our paper recommends privacy-preserving data access and AI algorithm audit to support government collaboration in fighting corruption.
Fragmentation of production across national boundaries results in global supply chains. To govern effectively in this type of economy requires a combination of public data with private initiatives (from business, banks and NGOs). However, in essence there are three major challenges to that: a. data comprises distributed and isolated data sets; b. analytics requires models to be trained across these independent data sets; and; c. data sovereignty and specific privacy legislations make it difficult to collect, share and analyze data.
So our research proposes three complementary paths to overcome these problems: 1) the use of federated learning: a new class of machine learning models trained on distributed datasets, and equally important, a key privacy-preserving data technology; 2) the creation of open data catalogues to assure that the data collected and parameterized is maintained open, respecting both data regulation concerning privacy and protection of the personal and sensitive information, as well as reassuring trustworthiness (fairness, lawfulness and transparency) of the algorithmic suggestions that eventually produce an array of political decisions; and 3. the design of distributed ledger technology to: manage authorizations of data collection; store data; enable the transparent selling of data or the clearance to access data; manage authorizations to governments and other entities to process data to social usages; use payment tokens to optimize capital deployment; or program smart contracts; all these in a way that can be fully audited The method used was the descriptive analyses of technologies and the literature review of some core topics: public governance finding ways to harvest the gains of co-existing with private and social governance (Mayer and Posthuma 2012); key privacy-preserving data technologies (Śmietanka, Pithadia, and Treleaven 2020); open data catalogues ; design of distributed ledger technology projects (Treleaven, Gendal Brown, and Yang 2017).
Fighting corruption in a globalized economy requires using data from public and private initiatives. In this sense the significance of the research for policymakers is that it demonstrates technologies that enable public governance to find ways to harvest the gains of co-existing with private and social governance. It recommends federated learning, open data catalogues, and "blockchain" to overcome challenges that governments face to implement impactful anticorruption policies in a global economy. The goal is to contribute with new knowledge that assists policymakers and researchers in addressing and overcoming ethical and technical challenges associated with data quality, privacy and access issues in the anticorruption field.
In this way, the research aims to contribute directly to the UN SDGs 16 (justice and effective institutions) and 17 (partnerships and means of implementation). And indirectly to many others UN SDGs because corruption has a disproportionate impact on the most vulnerable, increasing costs and reducing access to products and services, including food, health and education; and also, because it is pervasive in less efficient, more pollutant and unequal societies.

Federated learning
Federated learning is a machine learning model trained on distributed datasets, with privacy-preserving data technology. With the large amount of data for process, organisations are challenged by: a) distributed and isolated data sets; b) analytics requires models to be trained across many types of data sets; and c) data sovereignty/privacy legislation is making collecting, sharing and analysing data increasingly difficult (Śmietanka, Pithadia, and Treleaven 2020). To overcome those obstacles there are two types of federated learning, one using federated data infrastructure for privacy-preserving data access; and the second using federated machine learning applied to distributed data sets. It enables a multitude of participants to construct a joint model without exposing their private training data and handling unbalanced and non-independent and identically distributed (non-IID) from the real world.
Some examples of federated learning applications are next word prediction, visual object detection for safety etc.
Google's federated learning was the first infrastructure to enable mobile phones to collaboratively learn a shared prediction model while keeping all the training data on the respective devices, decoupling the ability to do machine learning from the need to store the data, the federated learning is 'taking the algorithm to the data'. It enables multiple participants to train a globally shared model by exchanging model information without having the risk of exposing private data (Yang, Fan, and Yu 2020, v).
Because this technology enables more data usage without having to gather data and eventually risk data leakages, it decreases the responsibility for poor data management or privacy violations which are increasing with the spread of data regulations across countries.
Our team, recently, made an experiment with this making an automatic translation of contracts to computer understandable rules through Natural Language Processing (Dolga, Treleaven, and Denny 2020) The most challenging aspect, found was to understand the legal meaning of the contract and express it into a structured format readable by computer. This problem was reduced to the Named Entity Recognition and Rule Extraction tasks, the latter handles the extraction of terms and conditions. These two problems are difficult, but deep learning models can tackle them.
That work was one of the first works to approach rule extraction with deep learning and the method is data-hungry, and many details in the data sets could be under protection of data regulations and privacy laws, so the use of technology that takes the algorithms to the data was very important to avert data leakage risks and eventually legal responsibilities relating to it. Additionally, it contributes to the literature by introducing Law-Bert, a model based on BERT which is pre-trained on unlabeled contracts. The results obtained on Named Entity Recognition and Rule Extraction show that pre-training on contracts has a positive effect on performance for the downstream tasks (Dolga, Treleaven, and Denny 2020).

Open data catalogues
Health data, location, purchases, money transfers, facial biometrics, digital certificates can be a powerful tool to manage business or govern society, but if the data collected and parameterized are not open and there is no transparency of how the algorithms are producing their analysis and until the decisions are being based on technological suggestions, we may be heading towards an urgent crisis of digital rights.
A strategy for opening these data that is being collected, shared and feeding the artificial intelligence tools is necessary. Only with a planning and a clear and very transparent strategy it is possible to take into account the necessary considerations and organize the actions required to achieve the social benefits.
The objective of creating a catalog of data is that the massive stock of data that has been the private property of an institution or is owned by a government agency becomes public and free-to-use information, which complies with appropriate technical characteristics and can bring different usages for users. Less susceptible to the particular voluntarism of a group or a particular government it can enable the deployment of new solutions by different actors.
But before the creation of the catalog, it is necessary to know the interests and needs of the different potential users of the data, in order to establish the parameters of the information demand in terms of the amount of information, the quality of what is possible, and them thematic offers that can be satisfied with the formation of this data catalogue. There are several institutions that work with data and the major challenge and the political will to take the necessary steps to compile a significant quantity of data publicly open. For the development of the proposal of the open data catalog, it is necessary to proceed further ahead of the process of identification of demand for data, for what is the status of the information, prioritizing some sets of data against others, making the cleaning and parameterization of data. The investigation, production of the catalog and the opening process allows policy makers, citizens and private companies to better understand "why data are important" for society, how it can be applied to understand current policies and options for future reforms (Data Foundation, 2020).
The creation of the technical capacity to improve the quality of data, the cultural dynamics and the understanding of the value proposition for the use of data may be successful in fulfilling the established objectives if the technical capacities exist to allow it to societal usages with high quality data. The application of standards and specific data requirements on how to publish data in particular formats based on modern technology can improve the quality of the data. Fostering policies and practices that efficiently allow the collection and revision of data can similarly guarantee that the data shared by the government and those shared by private companies are more useful for society. This transparency contributes for the use of data to be ethical; the effective value of the data can only be realized if the data are actually used in practice, and there are multiple ways and strategies to extract value from the data, including its policy investigation, statistics, program evaluation and data science, among others. In each of these options, the ability to link, combine and share data is increasingly relevant. This pillar focuses on strategies for accessing and sharing government data, as well as for guaranteeing public confidence in the protection of personal and sensible data.
The evaluation of the impact of the algorithm is centered on decision making, mainly on the robustness, the fairness and the ability to explain the system. This corresponds to defining usage limits and an expiration date to generate trust between the interested parties, making the system creators responsible for the results of their schedules and providing more transparency, the possibility of rendering people accountable and receiving audits.
Robustness means that the systems must be safe and protected against failures, intrusions or compromise data on which they are capable. The systems should use data and training models without prejudice, to avoid the unfair treatment of certain groups. And explainability, that the systems must provide decisions or suggestions that can be understood by their users and developers.
Our team, recently, made a study case about Brazil concluding that a different strategy for opening data that is being collected, shared and feeding the artificial intelligence tools would be necessary. Only with data planning and a clear and very transparent strategy it is possible to take into account the necessary considerations and organize the actions required to achieve the social benefits. The problem became blatant regarding Covid-19, for example the federal government prohibited disclosing health data that had been public for years, we called that a "datanapping".
Creating a data catalog shift the massive stock of data becomes from the private property of an institution or a government agency and becomes immediately public and freeuse information, which complies with appropriate technical characteristics and interest and utility for the users. Therefore, it would be less susceptible to the particular voluntarism of a group or a particular government.

"Blockchains"
Blockchain is a trending word after the spread of the cryptocurrencies, is a type of distributed ledger technology (DLT). But this kind of infrastructure can have applications far beyond the finance market. An analogy with the accounting system already in place but better can be used to roughly summarize its benefits. In a few words it is a new type of database that allows multiple users to share information and modify it in a secure and reliable way, even if they do not trust each other.
This decentralized form of record keeping can store many different kinds of information, for example to improve the management of the global supply chains, tracing products, its logistics, the financial transactions, eventually from the raw material to the retail operations. DLT can transform business models and processes, reshaping the set of stakeholders and their roles (Deloitte et al. 2017). It is applicable in cases that depend on transparency, accurate monitoring and audits. The advantages are more speed, efficiency, automation, less human error and lower data reconciliation costs. And there are already some governmental on-going initiatives relating to land titles, voting, managing assets, and digital identities. DLT has the potential to become the new layer for economic value transfer particularly enabling the usage of extra layers of programmatically executable rules (also known as smart contracts) that have the capability to automate many processes.
Many practicalities will have to be sorted out until an automatic system can be put in place, but the gains seem worth the efforts of coordinating international cooperation. The outcome of these DLT remains uncertain. But they have the potential to lead to greater decentralization, and the possibility of different and totally independent and even of foreign institutions to cooperate and share authority in the informational sphere. Notwithstanding, different flaws will emerge and a great deal of work will be necessary to oversee the rules written in the codes, and the automated processes put in place by the core developer teams.
The free flow of data could bring economic benefit and also contribute to better assessment and might mitigate corruption. A DLT infrastructure would enable better tracking of the provenance of data, analysis and decision-making systems, and it would help track the origin and destination of resources. Organizations would have traceable data information, could develop models that are safe from unauthorized usages, and therefore could commercialize their models in a reported and auditable way, at any time able to publish where their model data came from. This traceability per se would be a huge asset to those in need of mapping these flows and looking to understand the data economy and algorithmic accountability within it.
DLT technologies would enable the creation of organizations that would be governed with greater transparency, without dependence on neither local government regulations nor central authority decisions, and also unlimited by geographic or jurisdictional borders.
Governance in this decentralized infrastructure would rely on group consensus achieved through cryptographic systems and could additionally root smart contracts to aggregate special entitlements or preferences to some sort of token holders. For example, government agencies could have more access to information in order to provide public goods.
DLT technologies could become gap-filling mechanisms by removing many of the practical barriers that stand in the way of the adoption of a variety of ex ante governance arrangements (Wright and Rohr 2019). It has the potential to increase network collaboration speed and efficiency with greater transparency and reliability. The main feature of this technology is to form distributed networks around cryptography and distributed consensus-based networks across the world . It is a peer-to-peer network without any manager or coordinator that enables the achievement of a consistent and reliable agreement among participants on a record of inputs.
The fact that they are distributed dilutes the risk of failures and typical controls when there is a controller, encryption protects data security and multiple validations cause data to be checked by several. Agreements are verifiable, anyone can attest if they were actually signed with a private key, so there is an authenticity system registered perennially, identified with date and time. In addition, the registered data can only be modified according to strict rules of general agreement. Participants in the network can achieve consensus on changes in the state of this shared database without having to rely on the integrity of any of the participants or external authority.
Depending on the case, if the agents are at least minimally reliable and trust each other the consensus can be simplified so that computational capacity is saved. The features of permissioned DLT are more applicable where there is no need for anonymity or proof of work, since the nodes are reliable; they know each other and even use digital certificates of identification. In a permissioned DLT transactions will be pre validated so that they are computationally more efficient, to save time, reduce costs, reduce risks and increase system reliability and transparency (Mearian 2018).
Civil society organizations that engage in advocacy related to data-driven technologies and government agencies can use a variety of strategies and tactics to audit the DLT and spot misuses. The existence of only one DLT would contribute to easy their capacity to intervene on behalf of affected group of people and claim fairer rewards or different tax values. DLT would also contribute to wane silo-structures that are characteristic nowadays relating to different values, strategies, tactics, networks, language and policies adopted by each data holder. In this sense DLT would enable a more democratic control over automated systems to fight against some of the perils such as: The distributed nature of DLT would contribute to make executives, governments and society empowered to verify and act moving the values and assumptions of the tech development towards transparency freedom of information and favoring investigation of these practices by oversight bodies and the media.
DLT transformation would be data rather than organizationally led, would remove the silos of existing data structures and networks, creating a secure and auditable data collection and their effective usages, history which can be accessed in near real-time and would be able to be under the scrutiny of users as well as by groups of civil society and government agencies to verify the rightness of the system Our team, also made some studies about contemplating the usages of DLT to good

Auditing
In our view, a significant method, by which to ensure that systems are transparent and accountable, is via the auditing of algorithms. We define this elsewhere (Koshiayam et al, 2021) as the 'formal assurance that algorithms are legal, ethical and safe'. In this work we provide a framework with which systems can be audited. This includes the levels of access that need to be granted (from 'black box' to full inspection of code and data), the stage of the audit (at point of project inception, during development, after deployment etc…) and what exactly is going to be audited (model, data, and team). At the core of any audit are the risk verticals being assessed. Drawing from the framework, these are: • Privacy: quality of a system to mitigate personal or critical data leakage. • Fairness: quality of a system to avoid unfair treatment of individuals or organizations.
• Explainability: quality of a system to provide decisions or suggestions that can be understood by their users and developers.
• Robustness: quality of a system to be safe, not vulnerable to tampering.
With this work in mind, we argue that a demand should be in place to ensure that systems are built auditable i.e. viable to third-party inspection (see also Kazim & Koshiyama 2021).

Final considerations
To fight corruption globally and therefore contribute to the goals of ending extreme poverty and boosting shared prosperity, the World Bank's should provide resources and capacity building to initiatives that implement a technological infrastructure that is useful, safe and just. Our recommendation is to seek solutions that combinedly 1) use federated learning; 2) creation open data catalogues; and 3) design distributed ledger technologies for sustainable development.
Knowing the theories and techniques is one thing, but successfully applying them in practice also requires non-trivial effort. Therefore, the present paper with recommendation of best practices and the implementation of some types of technologies can only serve to an extent. Much more study and further research is needed to point out more practical usages on a global scale. Notwithstanding, every tool brings solutions and new challenges. So, an even deeper analysis is needed about each practical application so the benefits are worth the trouble of change.