Project deliverable Open Access

BigDataStack - D2.1 State of the art and Requirements analysis - I

Avila-García, Orlando; Paula Ta-Shma; Yosef Moatti; Everton Luís Berz; Ana Juan Ferrer; Ana Belén González Méndez; Bernat Quesada; Alberto Soler; Stathis Plitsos; Konstantinos Giannakakis; Amaryllis Raouzaiou; Pavlos Kranas; Sophia Karagiorgou; Panagiotis Gouvas; Anastasios Zafeiropoulos; Dimitris Poulopoulos; Giorgos Kousiouris; Stavroula Meimetea; Dimosthenis Kyriazis; Valerio Vianello; Richard McCreadie; Gal Hammer; Miki Kenneth; Nikos Drosos; Maurizio Megliola

This is the first version of the state-of-the-art and requirements analysis to drive the architecture and research effort in BigDataStack. User requirements have been collected through the BigDataStack’s use case providers and complemented with emerging technical requirements. They have also been tracked during the project lifetime so far to ensure that the BigDataStack platform will be fully addressed and properly considered.

Files (2.1 MB)
Name Size
BigDataStack_D2.1_v1.0.pdf
md5:1b1ca712b87a0d1489173399a27c5555
2.1 MB Download
  • G. Beskales, I. F. Ilyas, and L. Golab, "Sampling the repairs of functional dependency violations under hard constraints," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 197–207, 2010.

  • W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, "Towards certain fixes with editing rules and master data," Proc. VLDB Endow., vol. 3, no. 1–2, pp. 173–184, 2010.

  • J. Wang and N. Tang, "Towards dependable data repairing with fixing rules," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 457–468.

  • X. Chu, I. F. Ilyas, and P. Papotti, "Holistic data cleaning: Putting violations into context," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 458–469.

  • M. Heinsman, "Trifacta," Trifacta. [Online]. Available: https://www.trifacta.com/. [Accessed: 23- May-2018].

  • M. Dallachiesa et al., "NADEEF: a commodity data cleaning system," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 541–552.

  • J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo, "A sample-and-clean framework for fast and accurate query processing on dirty data," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 469–480.

  • Z. Khayyat et al., "Bigdansing: A system for big data cleansing," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1215–1230.

  • Y. Altowim, D. V. Kalashnikov, and S. Mehrotra, "Progressive approach to relational entity resolution," Proc. VLDB Endow., vol. 7, no. 11, pp. 999–1010, 2014.

  • Z. Li, S. Shang, Q. Xie, and X. Zhang, "Cost reduction for web-based data imputation," in International Conference on Database Systems for Advanced Applications, 2014, pp. 438–452.

  • D. Haas, J. Wang, E. Wu, and M. J. Franklin, "Clamshell: Speeding up crowds for low-latency data labeling," Proc. VLDB Endow., vol. 9, no. 4, pp. 372–383, 2015.

  • C. Gokhale et al., "Corleone: hands-off crowdsourcing for entity matching," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 601–612

  • B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden, "Scaling up crowd-sourcing to very large datasets: a case for active learning," Proc. VLDB Endow., vol. 8, no. 2, pp. 125–136, 2014.

  • X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, "Data Cleaning: Overview and Emerging Challenges," 2016, pp. 2201–2206.

  • P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, "A cost-based model and effective heuristic for repairing constraints by value modification," in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 143–154.

  • J. Wang, T. Kraska, M. J. Franklin, and J. Feng, "Crowder: Crowdsourcing entity resolution," Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012.

  • A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti, "Descriptive and prescriptive data cleaning," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 445–456.

  • L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, "On generating near-optimal tableaux for conditional functional dependencies," Proc. VLDB Endow., vol. 1, no. 1, pp. 376–390, 2008.

  • G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "On the relative trust between inconsistent data and inaccurate constraints," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 541–552.

  • M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, "Guided data repair," Proc. VLDB Endow., vol. 4, no. 5, pp. 279–289, 2011.

  • ] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, "Activeclean: Interactive data cleaning while learning convex loss models," ArXiv Prepr. ArXiv160103797, 2016.

  • Carbonell, J. (1990). Machine learning: paradigms and methods. Elsevier North-Holland, Inc.

  • Yu, H., Han, J. & Chang, K. C.-C., "PEBL: Positive example based learning for Web page classification using SVM." In 'Proceedings of ACM SIGKDD 2002 International Conference on Knowledge Discovery and Data Mining'

  • Agichtein, E., Brill, E. & Dumais, S. T.,"Improving Web search ranking by incorporating user behavior information." In 'Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval'.

  • Liu, T.-Y., "Learning to rank for information retrieval." Foundations Trends Information Retrieval. 3, 225–331.

  • Page, L., Brin, S., Motwani, R. & Winograd, T.,"The PageRank Citation Ranking: Bringing Order to the Web." Technical report. Stanford InfoLab. 1999

  • Macdonald, C., Santos, R. & Ounis, "The whens and hows of learning to rank." Information Retrieval. 2012

  • J. N. Gray, "Notes on data base operating systems," Lecture Notes in Computer Science, vol. 60, pp. 393-481, 1978.

  • H. Sturgis and B. Lampson, "Crash recovery in a distributed data storage system," Computer Science Laboratory, Xerox, Palo Alto, 1976.

  • D. Peng and F. Dabek, "Large-scale incremental processing using distributed transactions and notifications," in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI'10), 2010.

  • J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd and S. Melnik, "Spanner: Google's globally-distributed database," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI '12), 2012.

  • D. G. Ferro, F. Junqueira, I. Kelly, B. Reed and M. Yabandeh, "Omid: Lock-free transactional support for distributed data stores," in IEEE 30th International Conference on Data Engineering (ICDE), Chicago, 2014.

  • Apache, "Apache Tephra," [Online]. Available: http://tephra.incubator.apache.org. [Accessed May 2018].

  • Amr Osman, Mohamed El-Refaey, Ayman Elnaggar, Towards Real-Time Analytics in the Cloud, In Proceedings of IEEE SERVICES, 2013

  • Mike Barlow, Real-Time Big Data Analytics: Emerging Architecture, O'Reilly Media, Inc.,2013

  • T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 2011

  • Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proceedings of ICDE, 2011

  • Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANA database: data management for modern business applications. In Proceedings of SIGMOD, 2012.

  • V. Gulisano, R. Jiménez-Peris, M. Patiño-Martínez, C. Soriente, P. Valduriez (2012) StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Trans. Parallel Distrib. Syst. 23(12): 2351-2365.

  • B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M. P. van der Aalst, "The ProM Framework: A New Era in Process Mining Tool Support," in Applications and Theory of Petri Nets 2005, vol. 3536, G. Ciardo and P. Darondeau, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 444–454.

  • International Organization for Standardization, "ISO/IEC/IEEE 29148:2011 – Systems and software engineering — Life cycle processes — Requirements engineering," ISO/IEC/IEEE, Nov. 2011.

  • Open Grid Forum, "Web Services Agreement Specification (WS-Agreement)," Oct. 10, 2011. http://ogf.org/documents/GFD.192.pdf

  • Open Grid Forum, "WS-Agreement Negotiation Version 1.0," Jan. 31, 2011. https://www.ogf.org/Public_Comment_Docs/Documents/2011-03/WS-AgreementNegotiation+v1.0.pdf

  • P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and M. Seltzer, "Network-Aware Operator Placement for Stream-Processing Systems", 22nd International Conference on Data Engineering (ICDE '06), pp. 49–53, IEEE Computer Society, 2006.

  • V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, "Distributed QoS-aware Scheduling in Storm", 9th ACM International Conference on Distributed Event-Based Systems, pp. 344-347, ACM, 2015.

  • Y. Xing, S. Zdonik, and J.-H. Hwang, "Dynamic Load Distribution in the Borealis Stream Processor", 21st International Conference on Data Engineering (ICDE '05), pp. 791–802, IEEE Computer Society, 2005.

  • M. Hirzel, R. Soule, S. Schneider, B. Gedik, and R. Grimm, "A Catalog of Stream Processing Optimizations", ACM Computing Surveys, vol. 46, Mar. 2014, pp 1–34.

  • MongoDB MongoDB and MySQL Compare. [Accessed: 27/05/2018] https://www.mongodb.com/compare/mongodb-mysql.

  • L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin, "Fine-grained partitioning for aggressive data skipping," SIGMOD, 2014.

  • L. Sun, S. Krishnan, R. S. Xin, and M. J. Franklin, "A partitioning framework for aggressive data skipping," VLDB, 2014.

  • A. Shanbhag, A. Jindal, S. Madden, J. Quiane, and A. J. Elmore, "A robust partitioning scheme for ad-hoc query workloads," SoCC, 2017.

  • Y. Lu, A. Shanbhag, A. Jindal, and S. Madden, "Adaptdb: Adaptive partitioning for distributed joins," VLDB, 2017.

62
42
views
downloads
All versions This version
Views 6262
Downloads 4242
Data volume 86.8 MB86.8 MB
Unique views 5858
Unique downloads 4040

Share

Cite as