Published August 27, 2020 | Version v1
Project deliverable Open

BigDataStack - D2.2 Requirements & State of the Art Analysis – II

Description

This is the second version of a series of three deliverables specifying the stakeholder as well as technical (software and technology) requirements for BigDataStack.  In the requirements analysis shown in this document, a top-down approach is taken with respect to the user requirements, which have been collected through the BigDataStack use case providers. This is complemented with a bottom-up approach aiming to identify, collect, and analyse the rest of stakeholder requirements as well as technical requirements from BigDataStack technology providers.

Files

BigDataStack_D2.2_v1.0.pdf

Files (2.5 MB)

Name Size Download all
md5:dfb378c1ec13fcc19240885344216933
2.5 MB Preview Download

Additional details

Funding

BigDataStack – High-performance data-centric stack for big data applications and operations 779747
European Commission

References

  • G. Beskales, I. F. Ilyas, and L. Golab, "Sampling the repairs of functional dependency violations under hard constraints," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 197–207, 2010.
  • W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, "Towards certain fixes with editing rules and master data," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 173–184, 2010.
  • J. Wang and N. Tang, "Towards dependable data repairing with fixing rules," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 457–468.
  • X. Chu, I. F. Ilyas, and P. Papotti, "Holistic data cleaning: Putting violations into context," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 458–469.
  • M. Heinsman, "Trifacta," Trifacta. [Online]. Available at https://www.trifacta.com/. [Accessed: 23- May-2018].
  • M. Dallachiesa et al., "NADEEF: a commodity data cleaning system," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 541–552.
  • J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo, "A sample-and-clean framework for fast and accurate query processing on dirty data," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 469–480.
  • Z. Khayyat et al., "Bigdansing: A system for big data cleansing," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1215–1230.
  • Y. Altowim, D. V. Kalashnikov, and S. Mehrotra, "Progressive approach to relational entity resolution," Proc. VLDB Endow., vol. 7, no. 11, pp. 999–1010, 2014.
  • Z. Li, S. Shang, Q. Xie, and X. Zhang, "Cost reduction for web-based data imputation," in International Conference on Database Systems for Advanced Applications, 2014, pp. 438–452.
  • D. Haas, J. Wang, E. Wu, and M. J. Franklin, "Clamshell: Speeding up crowds for low-latency data labeling," Proc. VLDB Endow., vol. 9, no. 4, pp. 372–383, 2015.
  • C. Gokhale et al., "Corleone: hands-off crowdsourcing for entity matching," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 601–612.
  • B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden, "Scaling up crowd-sourcing to very large datasets: a case for active learning," Proc. VLDB Endow., vol. 8, no. 2, pp. 125–136, 2014.
  • X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, "Data Cleaning: Overview and Emerging Challenges," 2016, pp. 2201–2206.
  • P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, "A cost-based model and effective heuristic for repairing constraints by value modification," in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 143–154
  • J. Wang, T. Kraska, M. J. Franklin, and J. Feng, "Crowder: Crowdsourcing entity resolution," Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012.
  • A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, "Descriptive and prescriptive data cleaning," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 445–456.
  • L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, "On generating near-optimal tableaux for conditional functional dependencies," Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.
  • G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "On the relative trust between inconsistent data and inaccurate constraints," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 541–552.
  • M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, "Guided data repair," Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.
  • S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, "Activeclean: Interactive data cleaning while learning convex loss models," ArXiv Prepr. ArXiv160103797, 2016.
  • Carbonell, J. (1990). Machine learning: paradigms and methods. Elsevier North-Holland, Inc.
  • Yu, H., Han, J. & Chang, K. C.-C., "PEBL: Positive example -based learning for Web page classification using SVM." In 'Proceedings of ACM SIGKDD 2002 International Conference on Knowledge Discovery and Data Mining'.
  • Agichtein, E., Brill, E. & Dumais, S. T.,"Improving Web search ranking by incorporating user behavior information." In 'Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval'.
  • Liu, T.-Y., "Learning to rank for information retrieval." Foundations Trends Information Retrieval. 3, 225–331
  • Page, L., Brin, S., Motwani, R. & Winograd, T.,"The PageRank Citation Ranking: Bringing Order to the Web." Technical report. Stanford InfoLab. 1999
  • Macdonald, C., Santos, R. & Ounis, "The whens and hows of learning to rank." Information Retrieval. 2012
  • J. N. Gray, "Notes on data base operating systems," Lecture Notes in Computer Science, vol. 60, pp. 393-481, 1978.
  • H. Sturgis and B. Lampson, "Crash recovery in a distributed data storage system," Computer Science Laboratory, Xerox, Palo Alto, 1976.
  • D. Peng and F. Dabek, "Large-scale incremental processing using distributed transactions and notifications," in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI'10), 2010.
  • J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd and S. Melnik, "Spanner: Google's globally-distributed database," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI '12), 2012.
  • D. G. Ferro, F. Junqueira, I. Kelly, B. Reed and M. Yabandeh, "Omid: Lock-free transactional support for distributed data stores," in IEEE 30th International Conference on Data Engineering (ICDE), Chicago, 2014.
  • Apache, "Apache Tephra," [Online]. Available at http://tephra.incubator.apache.org. [Accessed May 2018].
  • Amr Osman, Mohamed El-Refaey, Ayman Elnaggar, Towards Real-Time Analytics in the Cloud, In Proceedings of IEEE SERVICES, 2013
  • Mike Barlow, Real-Time Big Data Analytics: Emerging Architecture, O'Reilly Media, Inc.,2013
  • T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 2011
  • Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proceedings of ICDE, 2011
  • Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANA database: data management for modern business applications. In Proceedings of SIGMOD, 2012.
  • V. Gulisano, R. Jiménez-Peris, M. Patiño-Martínez, C. Soriente, P. Valduriez (2012) StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Trans. Parallel Distrib. Syst. 23(12): 2351-2365.
  • B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M. P. van der Aalst, "The ProM Framework: A New Era in Process Mining Tool Support," in Applications and Theory of Petri Nets 2005, vol. 3536, G. Ciardo and P. Darondeau, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 444–454.
  • International Organization for Standardization, "ISO/IEC/IEEE 29148:2011 – Systems and software engineering — Life cycle processes — Requirements engineering," ISO/IEC/IEEE, Nov. 2011.
  • Open Grid Forum, "Web Services Agreement Specification (WS-Agreement)," Oct. 10, 2011. http://ogf.org/documents/GFD.192.pdf
  • Open Grid Forum, "WS-Agreement Negotiation Version 1.0," Jan. 31, 2011. https://www.ogf.org/Public_Comment_Docs/Documents/2011-03/WS-AgreementNegotiation+v1.0.pdf
  • P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and M. Seltzer, "Network-Aware Operator Placement for Stream-Processing Systems", 22nd International Conference on Data Engineering (ICDE '06), pp. 49–53, IEEE Computer Society, 2006.
  • V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, "Distributed QoS-aware Scheduling in Storm", 9th ACM International Conference on Distributed Event-Based Systems, pp. 344-347, ACM, 2015.
  • Y. Xing, S. Zdonik, and J.-H. Hwang, "Dynamic Load Distribution in the Borealis Stream Processor", 21st International Conference on Data Engineering (ICDE '05), pp. 791–802, IEEE Computer Society, 2005.
  • M. Hirzel, R. Soule, S. Schneider, B. Gedik, and R. Grimm, "A Catalog of Stream Processing Optimizations", ACM Computing Surveys, vol. 46, Mar. 2014, pp 1–34.
  • MongoDB MongoDB and MySQL Compare. [Accessed: 27/05/2018] https://www.mongodb.com/compare/mongodb-mysql
  • L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin, "Fine-grained partitioning for aggressive data skipping," SIGMOD, 2014.
  • L. Sun, S. Krishnan, R. S. Xin, and M. J. Franklin, "A partitioning framework for aggressive data skipping," VLDB, 2014.
  • A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, "A robust partitioning scheme for ad-hoc query workloads," SoCC, 2017.
  • Y. Lu, A. Shanbhag, A. Jindal, and S. Madden, "Adaptdb: Adaptive partitioning for distributed joins," VLDB, 2017.
  • D. McPherson, "Managing Compute Resources with OpenShift/Kubernetes," August 2016. Red Hat. https://blog.openshift.com/managing-compute-resources-openshiftkubernetes/ [Accessed June 2018].