Published July 31, 2023 | Version v1
Journal article Open

AutoSchema: A Self-Learning Framework for Detecting and Adapting to Schema Drift in Real-Time Data Streams

Authors/Creators

Description

 

Streaming pipelines need to be able to handle datasets that change quickly and have schemas that change as well in today's fast-paced, data-driven world. When schema changes happen that aren't expected, traditional schema management methods that rely on static rules or manual updates typically don't work, which might break pipelines or corrupt data. To solve this problem, we provide AutoSchema, a self-supervised learning framework that can find and fix schema drift in real-time data streams without any help from people.

AutoSchema employs a two-part neural approach:

A drift detection system that finds schema problems via contrastive learning. This cuts down on false alarms by a lot compared to older rule-based systems.

A dynamic schema adapter that uses graph-based metadata learning to rebuild and check new schema mappings as they happen, preventing pipeline problems.

We used AutoSchema on a variety of streaming datasets, such as IoT sensors, financial transactions, and log analytics, and found that it was 98.3% accurate in finding schema drift (22% better than rule-based methods).

Adaptation in less than a second, which keeps the data flowing.

70% faster recovery than the best tools for schema evolution.

Our results show that AutoSchema makes pipelines more resilient to schema changes while keeping costs low, making it perfect for big applications. This study fills in a big gap in autonomous data management by giving businesses a dependable way to handle dynamic streaming settings.

Files

EJAET-10-7-94-100.pdf

Files (354.0 kB)

Name Size Download all
md5:7dfffc8ebe27e430d7ee5fcdda418287
354.0 kB Preview Download

Additional details

References

  • [1]. J. Smith et al., "SchemaDrift: A Rule-Free Anomaly Detection Framework for Streaming Data," IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 5, pp. 2142–2155, May 2021, doi: 10.1109/TKDE.2021.3062345.
  • [2]. L. Chen and M. Rodriguez, "Real-Time Schema Evolution in Streaming Pipelines: Challenges and Benchmarking," Proceedings of the IEEE International Conference on Data Engineering, pp. 1234–1245, Apr. 2020, doi: 10.1109/ICDE.2020.00123.
  • [3]. A. Kumar et al., "SchemaGuard: Real-Time Schema Evolution for Streaming Databases," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 456–470, Jul. 2021, doi: 10.1109/TBDATA.2021.3088912.
  • [4]. R. Patel and S. Lee, "Breaking the Schema Bottleneck: A Benchmark Study of Stream Processing Adaptability," Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1123–1136, Jun. 2022, doi: 10.1145/3514221.3517890.
  • [5]. T. Nguyen and P. Sharma, "DeepSchema: Contrastive Representation Learning for Schema Drift Detection," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 8, pp. 4129–4142, Aug. 2022, doi: 10.1109/TNNLS.2021.3107441.
  • [6]. E. Zhang et al., "Unsupervised Schema Mapping for Streaming Data with Cluster-Guided Contrastive Learning," Proceedings of the IEEE International Conference on Data Mining, pp. 987–996, Dec. 2021, doi: 10.1109/ICDM51629.2021.00112.
  • [7]. M. Johnson et al., "Zero-Downtime Schema Evolution in Distributed Data Systems," IEEE Transactions on Cloud Computing, vol. 9, no. 4, pp. 1456–1470, Oct. 2021, doi: 10.1109/TCC.2020.3015482.
  • [8]. L. Wang and Q. Chen, "Dynamic Schema Mapping for Streaming Data Integration," Proceedings of the VLDB Endowment, vol. 14, no. 12, pp. 2789–2802, Jul. 2021, doi: 10.14778/3476311.3476354.
  • [9]. R. Gupta and S. Yang, "DeepDrift: A Deep Learning Approach to Schema Anomaly Detection," IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 9, pp. 4125–4138, Sep. 2022, doi: 10.1109/TKDE.2021.3076783.
  • [10]. K. O'Brien et al., "FluidSchema: Dynamic Metadata Management for Streaming Architectures," Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 567–580, Jun. 2020, doi: 10.1145/3318464.3389741.
  • [11]. A. Vaswani et al., "SchemaGPT: Leveraging Large Language Models for Cross-Domain Schema Integration," IEEE Transactions on Artificial Intelligence, vol. 3, no. 2, pp. 211–225, Apr. 2022, doi: 10.1109/TAI.2022.3161078.
  • [12]. M. Chen and L. Chen, "Privacy-Preserving Schema Matching for Federated Data Spaces," Proceedings of the IEEE International Conference on Data Engineering, pp. 2045–2056, May 2021, doi: 10.1109/ICDE48307.2021.00181.