Project deliverable Open Access
Orlando Avila-García; Paula Ta-Shma; Yosef Moatti; Everton Luís Berz; Ana Juan Ferrer; Ana Belén González Méndez; Bernat Quesada; Alberto Soler; Stathis Plitsos; Konstantinos Giannakakis; Amaryllis Raouzaiou; Pavlos Kranas; Sophia Karagiorgou; Panagiotis Gouvas; Anastasios Zafeiropoulos; Dimitris Poulopoulos; Timoleon Labrinos; Stavroula Meimetea; Dimosthenis Kyriazis; Valerio Vianello; Richard McCreadie; Gal Hammer; Miki Kenneth; Luis Tomas; Nikos Drosos; Maurizio Megliola
This is the second version of a series of three deliverables specifying the stakeholder as well as technical (software and technology) requirements for BigDataStack. In the requirements analysis shown in this document, a top-down approach is taken with respect to the user requirements, which have been collected through the BigDataStack use case providers. This is complemented with a bottom-up approach aiming to identify, collect, and analyse the rest of stakeholder requirements as well as technical requirements from BigDataStack technology providers.
G. Beskales, I. F. Ilyas, and L. Golab, "Sampling the repairs of functional dependency violations under hard constraints," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 197–207, 2010.
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, "Towards certain fixes with editing rules and master data," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 173–184, 2010.
J. Wang and N. Tang, "Towards dependable data repairing with fixing rules," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 457–468.
X. Chu, I. F. Ilyas, and P. Papotti, "Holistic data cleaning: Putting violations into context," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 458–469.
M. Heinsman, "Trifacta," Trifacta. [Online]. Available at https://www.trifacta.com/. [Accessed: 23- May-2018].
M. Dallachiesa et al., "NADEEF: a commodity data cleaning system," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 541–552.
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo, "A sample-and-clean framework for fast and accurate query processing on dirty data," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 469–480.
Z. Khayyat et al., "Bigdansing: A system for big data cleansing," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1215–1230.
Y. Altowim, D. V. Kalashnikov, and S. Mehrotra, "Progressive approach to relational entity resolution," Proc. VLDB Endow., vol. 7, no. 11, pp. 999–1010, 2014.
Z. Li, S. Shang, Q. Xie, and X. Zhang, "Cost reduction for web-based data imputation," in International Conference on Database Systems for Advanced Applications, 2014, pp. 438–452.
D. Haas, J. Wang, E. Wu, and M. J. Franklin, "Clamshell: Speeding up crowds for low-latency data labeling," Proc. VLDB Endow., vol. 9, no. 4, pp. 372–383, 2015.
C. Gokhale et al., "Corleone: hands-off crowdsourcing for entity matching," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 601–612.
B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden, "Scaling up crowd-sourcing to very large datasets: a case for active learning," Proc. VLDB Endow., vol. 8, no. 2, pp. 125–136, 2014.
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, "Data Cleaning: Overview and Emerging Challenges," 2016, pp. 2201–2206.
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, "A cost-based model and effective heuristic for repairing constraints by value modification," in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 143–154
J. Wang, T. Kraska, M. J. Franklin, and J. Feng, "Crowder: Crowdsourcing entity resolution," Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012.
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, "Descriptive and prescriptive data cleaning," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 445–456.
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, "On generating near-optimal tableaux for conditional functional dependencies," Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "On the relative trust between inconsistent data and inaccurate constraints," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 541–552.
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, "Guided data repair," Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, "Activeclean: Interactive data cleaning while learning convex loss models," ArXiv Prepr. ArXiv160103797, 2016.
Carbonell, J. (1990). Machine learning: paradigms and methods. Elsevier North-Holland, Inc.
Yu, H., Han, J. & Chang, K. C.-C., "PEBL: Positive example -based learning for Web page classification using SVM." In 'Proceedings of ACM SIGKDD 2002 International Conference on Knowledge Discovery and Data Mining'.
Agichtein, E., Brill, E. & Dumais, S. T.,"Improving Web search ranking by incorporating user behavior information." In 'Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval'.
Liu, T.-Y., "Learning to rank for information retrieval." Foundations Trends Information Retrieval. 3, 225–331
Page, L., Brin, S., Motwani, R. & Winograd, T.,"The PageRank Citation Ranking: Bringing Order to the Web." Technical report. Stanford InfoLab. 1999
Macdonald, C., Santos, R. & Ounis, "The whens and hows of learning to rank." Information Retrieval. 2012
J. N. Gray, "Notes on data base operating systems," Lecture Notes in Computer Science, vol. 60, pp. 393-481, 1978.
H. Sturgis and B. Lampson, "Crash recovery in a distributed data storage system," Computer Science Laboratory, Xerox, Palo Alto, 1976.
D. Peng and F. Dabek, "Large-scale incremental processing using distributed transactions and notifications," in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI'10), 2010.
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd and S. Melnik, "Spanner: Google's globally-distributed database," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI '12), 2012.
D. G. Ferro, F. Junqueira, I. Kelly, B. Reed and M. Yabandeh, "Omid: Lock-free transactional support for distributed data stores," in IEEE 30th International Conference on Data Engineering (ICDE), Chicago, 2014.
Apache, "Apache Tephra," [Online]. Available at http://tephra.incubator.apache.org. [Accessed May 2018].
Amr Osman, Mohamed El-Refaey, Ayman Elnaggar, Towards Real-Time Analytics in the Cloud, In Proceedings of IEEE SERVICES, 2013
Mike Barlow, Real-Time Big Data Analytics: Emerging Architecture, O'Reilly Media, Inc.,2013
T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 2011
Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proceedings of ICDE, 2011
Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANA database: data management for modern business applications. In Proceedings of SIGMOD, 2012.
V. Gulisano, R. Jiménez-Peris, M. Patiño-Martínez, C. Soriente, P. Valduriez (2012) StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Trans. Parallel Distrib. Syst. 23(12): 2351-2365.
B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M. P. van der Aalst, "The ProM Framework: A New Era in Process Mining Tool Support," in Applications and Theory of Petri Nets 2005, vol. 3536, G. Ciardo and P. Darondeau, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 444–454.
International Organization for Standardization, "ISO/IEC/IEEE 29148:2011 – Systems and software engineering — Life cycle processes — Requirements engineering," ISO/IEC/IEEE, Nov. 2011.
Open Grid Forum, "Web Services Agreement Specification (WS-Agreement)," Oct. 10, 2011. http://ogf.org/documents/GFD.192.pdf
Open Grid Forum, "WS-Agreement Negotiation Version 1.0," Jan. 31, 2011. https://www.ogf.org/Public_Comment_Docs/Documents/2011-03/WS-AgreementNegotiation+v1.0.pdf
P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and M. Seltzer, "Network-Aware Operator Placement for Stream-Processing Systems", 22nd International Conference on Data Engineering (ICDE '06), pp. 49–53, IEEE Computer Society, 2006.
V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, "Distributed QoS-aware Scheduling in Storm", 9th ACM International Conference on Distributed Event-Based Systems, pp. 344-347, ACM, 2015.
Y. Xing, S. Zdonik, and J.-H. Hwang, "Dynamic Load Distribution in the Borealis Stream Processor", 21st International Conference on Data Engineering (ICDE '05), pp. 791–802, IEEE Computer Society, 2005.
M. Hirzel, R. Soule, S. Schneider, B. Gedik, and R. Grimm, "A Catalog of Stream Processing Optimizations", ACM Computing Surveys, vol. 46, Mar. 2014, pp 1–34.
MongoDB MongoDB and MySQL Compare. [Accessed: 27/05/2018] https://www.mongodb.com/compare/mongodb-mysql
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin, "Fine-grained partitioning for aggressive data skipping," SIGMOD, 2014.
L. Sun, S. Krishnan, R. S. Xin, and M. J. Franklin, "A partitioning framework for aggressive data skipping," VLDB, 2014.
A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, "A robust partitioning scheme for ad-hoc query workloads," SoCC, 2017.
Y. Lu, A. Shanbhag, A. Jindal, and S. Madden, "Adaptdb: Adaptive partitioning for distributed joins," VLDB, 2017.
D. McPherson, "Managing Compute Resources with OpenShift/Kubernetes," August 2016. Red Hat. https://blog.openshift.com/managing-compute-resources-openshiftkubernetes/ [Accessed June 2018].