Published September 14, 2025 | Version v1
Journal article Open

Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Authors/Creators

  • 1. Independent Researcher, USA

Description

Modern enterprises encounter significant challenges when integrating heterogeneous data sources due to schema variability, inconsistent naming conventions, and noisy real-world datasets that traditionally require extensive manual intervention. This work presents a novel generative AI framework that employs Large Language Models augmented with retrieval-based techniques to automate schema matching, column-level mapping, and transformation rule generation within ETL pipelines. The framework incorporates metadata-aware prompting strategies, domain-specific exemplar retrieval through RAG, and iterative self-refinement mechanisms to produce high-quality mapping suggestions alongside executable transformation code in SQL and pandas formats. Confidence scoring enables effective human-in-the-loop validation while adaptive feedback mechanisms facilitate continuous improvement from user corrections. The system demonstrates the capability of handling evolving schemas and multilingual datasets across diverse enterprise domains. Experimental evaluation on synthetic datasets and established benchmarks reveals substantial improvements in matching accuracy, precision, and recall compared to traditional name-similarity metrics and classical machine learning classifiers. The framework addresses critical deployment considerations, including data privacy compliance, operational scalability, and hallucination mitigation strategies. Results indicate significant potential for reliable large-scale enterprise adoption through transparent, auditable automated ETL workflows that maintain data quality standards while reducing manual overhead.

Files

SJMD-303-2025-98-104.pdf

Files (756.3 kB)

Name Size Download all
md5:dc08ca1bf109231e5ba72f840ecdfc06
756.3 kB Preview Download