Knowledge Graph-Driven Real-Time Data Engineering for Context-Aware Machine Learning Pipelines
Authors/Creators
Description
The novel context-aware machine learning is based on state-of-the-art real-time data engineering processes that operate in shifting entity correlations. To this end, this paper presents a new architecture that combines knowledge graph construction with real-time stream processing to underpin the machine learning flow in a context-aware manner. The proposed system uses graph neural networks (GNNs) for updates and embeddings in real-time for dynamic integration of contextual information into the other machine learning models. This makes the approach ideal as changes in the relations of entities can be captured almost in real time, and models remain valid.
The effectiveness of the architecture can be illustrated by use cases related to customer profiling and equipment failure prognosis. In a consumer classification, one has to continually modify customer profiles as others come across the new interaction to work on effective targeting and the subsequent personalization improvement. Predictive maintenance stores changing information on equipment to predict future failure. These applications show a 40% improvement in model accuracy and take 50% less time than normal methods for feature engineering.
This research bridges computer science, particularly graph theory, and real-world data engineering by demonstrating the value of knowledge graphs and GNNs within machine learning pipelines. By incorporating contextual features, the system provides a feasible and flexible solution for current data trends, allowing for further development of smarter and more sensitive ML systems. The study points out real-time context sensitiveness as central to the advancement of machine learning, a landmark discovery.
Files
EJAET-8-5-65-76.pdf
Files
(415.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:9f114253caf1ee8fcf9f77e3ade70206
|
415.2 kB | Preview Download |
Additional details
References
- [1]. Abdelmoula, W. M., Balluff, B., Englert, S., Dijkstra, J., Reinders, M. J., Walch, A., McDonnell, L. A., & Lelieveldt, B. P. (2016). Data-driven identification of prognostic tumor subpopulations using spatially mapped t-SNE of mass spectrometry imaging data. Proceedings of the National Academy of Sciences, 113(43), 12244–12249.
- [2]. Attolini, C. S.-O., Cheng, Y.-K., Beroukhim, R., Getz, G., Abdel-Wahab, O., Levine, R. L., Mellinghoff, I. K., & Michor, F. (2010). A mathematical framework to determine the temporal sequence of somatic genetic events in cancer. Proceedings of the National Academy of Sciences, 107(41), 17604–17609.
- [3]. Anglani, R., Creanza, T. M., Liuzzi, V. C., Piepoli, A., Panza, A., Andriulli, A., & Ancona, N. (2014). Loss of connectivity in cancer co-expression networks. PLOS ONE, 9(1), e87075.
- [4]. Agirre, E., Cuadros, M., Rigau, G., & Soroa, A. (2010). Exploring knowledge bases for similarity. In LREC.
- [5]. Ainscough, B. J., Griffith, M., Coffman, A. C., Wagner, A. H., Kunisaki, J., Choudhary, M. N., McMichael, J. F., Fulton, R. S., Wilson, R. K., Griffith, O. L., & Mardis, E. R. (2016). DoCM: A database of curated mutations in cancer. Nature Methods, 13(10), 806–807.
- [6]. Akrami, F., Guo, L., Hu, W., & Li, C. (2018). Re-evaluating embedding-based knowledge graph completion methods. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (pp. 1779–1782). ACM.
- [7]. Azuaje, F., Kaoma, T., Jeanty, C., Nazarov, P. V., Muller, A., Kim, S.-Y., Dittmar, G., Golebiewska, A., & Niclou, S. P. (2018). Hub genes in a pan-cancer co-expression network show potential for predicting drug responses. F1000Research, 7, 1061.
- [8]. Akavia, U. D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H. C., Pochanard, P., Mozes, E., Garraway, L. A., & Pe'er, D. (2010). An integrated approach to uncover drivers of cancer. Cell, 143(6), 1005–1017.
- [9]. Altrock, P. M., Liu, L. L., & Michor, F. (2015). The mathematics of cancer: Integrating quantitative models. Nature Reviews Cancer, 15(12), 730–745.
- [10]. Alon, U. (2007). Network motifs: Theory and experimental approaches. Nature Reviews Genetics, 8(6), 450–461.
- [11]. Aasman, J., & Mirhaji, P. (2018). Knowledge graph solutions in healthcare for improved clinical outcomes. In CEUR Workshop Proceedings (Vol. 2180, pp. 1–9).
- [12]. Aziz, N. A. A., Mokhtar, N. M., Harun, R., Mollah, M. M. H., Rose, I. M., Sagap, I., Tamil, A. M., Ngah, W. Z. W., & Jamal, R. (2016). A 19-gene expression signature as a predictor of survival in colorectal cancer. BMC Medical Genomics, 9(1), 58.
- [13]. Adhami, M., MotieGhader, H., Haghdoost, A. A., Afshar, R. M., & Sadeghi, B. (2019). Gene co-expression network approach for predicting prognostic microRNA biomarkers in different subtypes of breast cancer. Genomics, 111(5), 1175–1184.
- [14]. Avesani, P., McPherson, B., Hayashi, S., Caiafa, C. F., Henschel, R., Garyfallidis, E., Kitchell, L., Bullock, D., Patterson, A., Olivetti, E., et al. (2019). The open diffusion data derivatives, brain data upcycling via integrated publishing of derivatives and reproducible open cloud services. Scientific Data, 6(1), 69.
- [15]. Asmann, Y. W., Necela, B. M., Kalari, K. R., Hossain, A., Baker, T. R., Carr, J. M., Davis, C., Getz, J. E., Hostetter, G., Li, X., et al. (2012). Detection of redundant fusion transcripts as biomarkers or disease-specific therapeutic targets in breast cancer. Cancer Research, 72(8), 1921–1928.
- [16]. Arrell, D., & Terzic, A. (2010). Network systems biology for drug discovery. Clinical Pharmacology & Therapeutics, 88(1), 120–125.
- [17]. Alyass, A., Turcotte, M., & Meyre, D. (2015). From big data analysis to personalized medicine for all: Challenges and opportunities. BMC Medical Genomics, 8(1), 33.
- [18]. Austin, P. C., Thomas, N., & Rubin, D. B. (2020). Covariate-adjusted survival analyses in propensity-score matched samples: Imputing potential time-to-event outcomes. Statistical Methods in Medical Research, 29(3), 728–751.
- [19]. Alfonso, J., Talkenberger, K., Seifert, M., Klink, B., Hawkins-Daarud, A., Swanson, K., Hatzikirou, H., & Deutsch, A. (2017). The biology and mathematical modelling of glioma invasion: A review. Journal of the Royal Society Interface, 14(136), 20170490.
- [20]. Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., & Ruckhaus, E. (2011). ANAPSID: An adaptive query processing engine for SPARQL endpoints. In ISWC.
- [21]. Baker, M. (2010). Next-generation sequencing: Adjusting to data overload. Nature Methods, 7(7), 495–499.
- [22]. Bell, R., Barraclough, R., & Vasieva, O. (2017). Gene expression meta-analysis of potential metastatic breast cancer markers. Current Molecular Medicine, 17(3), 200–210.
- [23]. Bibikova, M., Chudin, E., Arsanjani, A., Zhou, L., Garcia, E. W., Modder, J., Kostelec, M., Barker, D., Downs, T., Fan, J.-B., et al. (2007). Expression signatures that correlated with Gleason score and relapse in prostate cancer. Genomics, 89(6), 666–672.
- [24]. Bailey, P., Chang, D. K., Nones, K., Johns, A. L., Patch, A.-M., Gingras, M.-C., Miller, D. K., Christ, A. N., Bruxner, T. J., Quinn, M. C., et al. (2016). Genomic analyses identify molecular subtypes of pancreatic cancer. Nature, 531(7592), 47–52.
- [25]. Banerjee, N., Chakraborty, S., & Raman, V. (2016). Improved space-efficient algorithms for BFS, DFS, and applications. In International Computing and Combinatorics Conference (pp. 119–130). Springer.
- [26]. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehar, J., Kryukov, G. V., Sonkin, D., et al. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391), 603–607.
- [27]. Bonatti, P. A., Decker, S., Polleres, A., & Presutti, V. (2019). Knowledge graphs: New directions for knowledge representation on the semantic web (Dagstuhl seminar 18371). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
- [28]. Bismar, T. A., Demichelis, F., Riva, A., Kim, R., Varambally, S., He, L., Kutok, J., Aster, J. C., Tang, J., Kuefer, R., et al. (2006). Defining aggressive prostate cancer using a 12-gene model. Neoplasia, 8(1), 59–68.
- [29]. Burstin, J. von, Eser, S., Paul, M. C., Seidler, B., Brandl, M., Messer, M., Werder, A. von, Schmidt, A., Mages, J., Pagel, P., et al. (2009). E-cadherin regulates metastasis of pancreatic cancer in vivo and is suppressed by a SNAIL/HDAC1/HDAC2 repressor complex. Gastroenterology, 137(1), 361–371.
- [30]. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
- [31]. Baumstark, R., Hänzelmann, S., Tsuru, S., Schaerli, Y., Francesconi, M., Mancuso, F. M., Castelo, R., & Isalan, M. (2015). The propagation of perturbations in rewired bacterial gene networks. Nature Communications, 6, 10105.
- [32]. Bhardwaj, N., & Lu, H. (2009). Co-expression among constituents of a motif in the protein–protein interaction network. Journal of Bioinformatics and Computational Biology, 7(1), 1–17.