Project deliverable Open Access

BigDataStack - D2.3 Requirements & State of the Art Analysis – III

Orlando Avila-García; Paula Ta-Shma; Yosef Moatti; Mauricio Fadel; Bin Chen; Ismael Cuadrado; Ana Belén González; Bernat Quesada; Alberto Soler; Stathis Plitsos; Anestis Sidiropoulos; Amaryllis Raouzaiou; Jose María Zaragoza; Jesus Gallego; Sophia Karagiorgou; Panagiotis Gouvas; Dimitris Poulopoulos; Stavroula Meimetea; Maria Kanakari; Christos Doulkeridis; Giannis Poulakis; Dimosthenis Kyriazis; Marta Patino; Richard McCreadie; Miki Kenneth; Luis Tomas; Nikos Drosos; Maurizio Megliola

In the requirements analysis presented in this document, a top-down approach is taken with respect to the user requirements, which have been collected through the BigDataStack use case providers. This is complemented with a bottom-up approach aiming to identify, collect, and analyse the rest of the stakeholder requirements as well as technical requirements from the BigDataStack technology.

Files (3.0 MB)
Name Size
3.0 MB Download
  • G. Beskales, I. F. Ilyas, and L. Golab, "Sampling the repairs of functional dependency violations under hard constraints," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 197–207, 2010.

  • W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, "Towards certain fixes with editing rules and master data," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 173–184, 2010.

  • J. Wang and N. Tang, "Towards dependable data repairing with fixing rules," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 457–468

  • X. Chu, I. F. Ilyas, and P. Papotti, "Holistic data cleaning: Putting violations into context," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 458–469.

  • M. Heinsman, "Trifacta," Trifacta. [Online]. Available at [Accessed: 23- May-2018].

  • M. Dallachiesa et al., "NADEEF: a commodity data cleaning system," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 541–552.

  • J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo, "A sample-and-clean framework for fast and accurate query processing on dirty data," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 469–480.

  • Z. Khayyat et al., "Bigdansing: A system for big data cleansing," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1215–1230.

  • Y. Altowim, D. V. Kalashnikov, and S. Mehrotra, "Progressive approach to relational entity resolution," Proc. VLDB Endow., vol. 7, no. 11, pp. 999–1010, 2014.

  • Z. Li, S. Shang, Q. Xie, and X. Zhang, "Cost reduction for web-based data imputation," in International Conference on Database Systems for Advanced Applications, 2014, pp. 438–452.

  • D. Haas, J. Wang, E. Wu, and M. J. Franklin, "Clamshell: Speeding up crowds for low-latency data labeling," Proc. VLDB Endow., vol. 9, no. 4, pp. 372–383, 2015.

  • C. Gokhale et al., "Corleone: hands-off crowdsourcing for entity matching," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 601–612.

  • B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden, "Scaling up crowd-sourcing to very large datasets: a case for active learning," Proc. VLDB Endow., vol. 8, no. 2, pp. 125–136, 2014.

  • X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, "Data Cleaning: Overview and Emerging Challenges," 2016, pp. 2201–2206.

  • P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, "A cost-based model and effective heuristic for repairing constraints by value modification," in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 143–154.

  • ] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, "Crowder: Crowdsourcing entity resolution," Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012.

  • A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, "Descriptive and prescriptive data cleaning," in Proceedings of the 2014 ACM SIGMOD Int. Conf. on Management of data, 2014, pp. 445–456.

  • L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, "On generating near-optimal tableaux for conditional functional dependencies," Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.

  • G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "On the relative trust between inconsistent data and inaccurate constraints," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 541–552.

  • M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, "Guided data repair," Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.

  • S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, "Activeclean: Interactive data cleaning while learning convex loss models," ArXiv Prepr. ArXiv160103797, 2016.

  • Carbonell, J. (1990). Machine learning: paradigms and methods. Elsevier North-Holland, Inc.

  • Yu, H., Han, J. & Chang, K. C.-C., "PEBL: Positive example -based learning for Web page classification using SVM." In 'Proceedings of ACM SIGKDD 2002 International Conference on Knowledge Discovery and Data Mining'.

  • Agichtein, E., Brill, E. & Dumais, S. T.,"Improving Web search ranking by incorporating user behavior information." In 'Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval'.

  • Liu, T.-Y., "Learning to rank for information retrieval." Foundations Trends Information Retrieval. 3, 225–331.

  • Page, L., Brin, S., Motwani, R. & Winograd, T.,"The PageRank Citation Ranking: Bringing Order to the Web." Technical report. Stanford InfoLab. 1999

  • Macdonald, C., Santos, R. & Ounis, "The whens and hows of learning to rank." Information Retrieval. 2012

  • J. N. Gray, "Notes on data base operating systems," Lecture Notes in Computer Science, vol. 60, pp. 393-481, 1978.

  • H. Sturgis and B. Lampson, "Crash recovery in a distributed data storage system," Computer Science Laboratory, Xerox, Palo Alto, 1976

  • D. Peng and F. Dabek, "Large-scale incremental processing using distributed transactions and notifications," in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI'10), 2010.

  • J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd and S. Melnik, "Spanner: Google's globally-distributed database," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI '12), 2012.

  • D. G. Ferro, F. Junqueira, I. Kelly, B. Reed and M. Yabandeh, "Omid: Lock-free transactional support for distributed data stores," in IEEE 30th International Conference on Data Engineering (ICDE), Chicago, 2014.

  • Apache, "Apache Tephra," [Online]. Available at [Accessed May 2018].

  • Amr Osman, Mohamed El-Refaey, Ayman Elnaggar, Towards Real-Time Analytics in the Cloud, In Proceedings of IEEE SERVICES, 2013

  • Mike Barlow, Real-Time Big Data Analytics: Emerging Architecture, O'Reilly Media, Inc.,2013

  • T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 2011

  • Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proceedings of ICDE, 2011

  • Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANA database: data management for modern business applications. In Proceedings of SIGMOD, 2012.

  • V. Gulisano, R. Jiménez-Peris, M. Patiño-Martínez, C. Soriente, P. Valduriez (2012) StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Trans. Parallel Distrib. Syst. 23(12): 2351-2365.

  • B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M. P. van der Aalst, "The ProM Framework: A New Era in Process Mining Tool Support," in Applications and Theory of Petri Nets 2005, vol. 3536, G. Ciardo and P. Darondeau, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 444–454.

  • International Organization for Standardization, "ISO/IEC/IEEE 29148:2011 – Systems and software engineering — Life cycle processes — Requirements engineering," ISO/IEC/IEEE, Nov. 2011.

  • Open Grid Forum, "Web Services Agreement Specification (WS-Agreement)," Oct. 10, 2011.

  • Open Grid Forum, "WS-Agreement Negotiation Version 1.0," Jan. 31, 2011.

  • P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and M. Seltzer, "Network-Aware Operator Placement for Stream-Processing Systems", 22nd International Conference on Data Engineering (ICDE '06), pp. 49–53, IEEE Computer Society, 2006.

  • V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, "Distributed QoS-aware Scheduling in Storm", 9th ACM International Conference on Distributed Event-Based Systems, pp. 344-347, ACM, 2015.

  • Y. Xing, S. Zdonik, and J.-H. Hwang, "Dynamic Load Distribution in the Borealis Stream Processor", 21st Int. Conf. on Data Engineering (ICDE '05), pp. 791–802, IEEE Computer Society, 2005.

  • M. Hirzel, R. Soule, S. Schneider, B. Gedik, and R. Grimm, "A Catalog of Stream Processing Optimizations", ACM Computing Surveys, vol. 46, Mar. 2014, pp 1–34.

  • MongoDB MongoDB and MySQL Compare. [Accessed: 27/05/2018]

  • L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin, "Fine-grained partitioning for aggressive data skipping," SIGMOD, 2014.

  • L. Sun, S. Krishnan, R. S. Xin, and M. J. Franklin, "A partitioning framework for aggressive data skipping," VLDB, 2014.

  • A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, "A robust partitioning scheme for ad-hoc query workloads," SoCC, 2017.

  • Y. Lu, A. Shanbhag, A. Jindal, and S. Madden, "Adaptdb: Adaptive partitioning for distributed joins," VLDB, 2017.

  • D. McPherson, "Managing Compute Resources with OpenShift/Kubernetes," August 2016. Red Hat. [Accessed June 2018].

  • Mao, H., Netravali, R., & Alizadeh, M. (2017, August). Neural adaptive video streaming with pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (pp. 197-210). ACM.

  • Jiang, J., Ananthanarayanan, G., Bodik, P., Sen, S., & Stoica, I. (2018, August). Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (pp. 253-266). ACM.

  • Rao, J., Bu, X., Xu, C. Z., Wang, L., & Yin, G. (2009, June). VCONF: a reinforcement learning approach to virtual machines auto-configuration. In Proceedings of the 6th international conference on Autonomic computing (pp. 137-146). ACM.

  • Tamraparni Dasu and Ji Meng Loh. 2012. Statistical distortion: Consequences of data cleaning. Proceedings of the VLDB Endowment5, 11(2012), 1674–1683.

  • Tamraparni Dasu, Theodore Johnson, Shanmugauelayut Muthukrishnan, and Vladislav Shkapenyuk. 2002. Mining database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM,240–251

  • Ziawasch Abedjan, Cuneyt G Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. Proceedings of the VLDB Endowment9, 4 (2015), 336–347.

  • Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab FIlyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and NanTang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004

  • Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: FewShot Learning for Error Detection. Proceedings of the 2019 International Conference on Management of Data (2019), 829–846.

  • Zhuoran Yu and Xu Chu. 2019. PIClean: A Probabilistic and Inter-active Data Cleaning System. In Proceedings of the 2019 International Conference on Management of Data. ACM, 2021–2024

All versions This version
Views 4949
Downloads 109109
Data volume 325.7 MB325.7 MB
Unique views 4141
Unique downloads 9090


Cite as