Conference paper Open Access

Unsupervised Anomaly Detection in Data Quality Control

Poon, Lex; Farshidi, Siamak; Li, Na; Zhao, Zhiming


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nam##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">data quality</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">unsupervised learning</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">data quality control</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">data quality assessment</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">anomaly detection,</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">automated data quality control</subfield>
  </datafield>
  <controlfield tag="005">20220119014910.0</controlfield>
  <controlfield tag="001">5872438</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">15-18 Dec 2021</subfield>
    <subfield code="g">MIDP-2021</subfield>
    <subfield code="a">7th International Workshop on Methods to Improve Big Data Science Projects (MIDP-2021), in IEEE BigData 2021</subfield>
    <subfield code="c">Virtual</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Amsterdam</subfield>
    <subfield code="a">Farshidi, Siamak</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Amsterdam</subfield>
    <subfield code="a">Li, Na</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Amsterdam</subfield>
    <subfield code="0">(orcid)0000-0002-6717-9418</subfield>
    <subfield code="a">Zhao, Zhiming</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">2899991</subfield>
    <subfield code="z">md5:51cca856e3286b38b7b738f7d43e9a86</subfield>
    <subfield code="u">https://zenodo.org/record/5872438/files/2021.workshop.bigdata.midp21.camera.pdf</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="y">Conference website</subfield>
    <subfield code="u">http://www.midp-info.org/</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-12-15</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire</subfield>
    <subfield code="o">oai:zenodo.org:5872438</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">University of Amsterdam</subfield>
    <subfield code="a">Poon, Lex</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Unsupervised Anomaly Detection in Data Quality Control</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">860627</subfield>
    <subfield code="a">CLoud ARtificial Intelligence For pathologY</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">862409</subfield>
    <subfield code="a">Blue-Cloud: Piloting innovative services for Marine Research &amp; the Blue Economy</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">825134</subfield>
    <subfield code="a">smART socIal media eCOsytstem in a blockchaiN Federated environment</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">824068</subfield>
    <subfield code="a">ENVironmental Research Infrastructures building Fair services Accessible for society, Innovation and Research</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Data is one of the most valuable assets of an&lt;/p&gt;

&lt;p&gt;organization and has a tremendous impact on its long-term&lt;/p&gt;

&lt;p&gt;success and decision-making processes. Typically, organizational&lt;/p&gt;

&lt;p&gt;data error and outlier detection processes perform manually and&lt;/p&gt;

&lt;p&gt;reactively, making them time-consuming and prone to human errors.&lt;/p&gt;

&lt;p&gt;Additionally, rich data types, unlabeled data, and increased&lt;/p&gt;

&lt;p&gt;volume have made such data more complex. Accordingly, an&lt;/p&gt;

&lt;p&gt;automated anomaly detection approach is required to improve&lt;/p&gt;

&lt;p&gt;data management and quality control processes. This study&lt;/p&gt;

&lt;p&gt;introduces an unsupervised anomaly detection approach based&lt;/p&gt;

&lt;p&gt;on models comparison, consensus learning, and a combination of&lt;/p&gt;

&lt;p&gt;rules of thumb with iterative hyper-parameter tuning to increase&lt;/p&gt;

&lt;p&gt;data quality. Furthermore, a domain expert is considered a&lt;/p&gt;

&lt;p&gt;human in the loop to evaluate and check the data quality and to&lt;/p&gt;

&lt;p&gt;judge the output of the unsupervised model. An experiment has&lt;/p&gt;

&lt;p&gt;been conducted to assess the proposed approach in the context of&lt;/p&gt;

&lt;p&gt;a case study. The experiment results confirm that the proposed&lt;/p&gt;

&lt;p&gt;approach can improve the quality of&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.1109/BigData52589.2021.9671672</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">publication</subfield>
    <subfield code="b">conferencepaper</subfield>
  </datafield>
</record>
39
48
views
downloads
Views 39
Downloads 48
Data volume 139.2 MB
Unique views 32
Unique downloads 47

Share

Cite as