Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines
Description
Modern enterprises encounter significant challenges when integrating heterogeneous data sources due to schema variability, inconsistent naming conventions, and noisy real-world datasets that traditionally require extensive manual intervention. This work presents a novel generative AI framework that employs Large Language Models augmented with retrieval-based techniques to automate schema matching, column-level mapping, and transformation rule generation within ETL pipelines. The framework incorporates metadata-aware prompting strategies, domain-specific exemplar retrieval through RAG, and iterative self-refinement mechanisms to produce high-quality mapping suggestions alongside executable transformation code in SQL and pandas formats. Confidence scoring enables effective human-in-the-loop validation while adaptive feedback mechanisms facilitate continuous improvement from user corrections. The system demonstrates the capability of handling evolving schemas and multilingual datasets across diverse enterprise domains. Experimental evaluation on synthetic datasets and established benchmarks reveals substantial improvements in matching accuracy, precision, and recall compared to traditional name-similarity metrics and classical machine learning classifiers. The framework addresses critical deployment considerations, including data privacy compliance, operational scalability, and hallucination mitigation strategies. Results indicate significant potential for reliable large-scale enterprise adoption through transparent, auditable automated ETL workflows that maintain data quality standards while reducing manual overhead.
Files
SJMD-303-2025-98-104.pdf
Files
(756.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:dc08ca1bf109231e5ba72f840ecdfc06
|
756.3 kB | Preview Download |