Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Annapurneswar Putrevu

doi:10.5281/zenodo.17117079

Published September 14, 2025 | Version v1

Journal article Open

Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Annapurneswar Putrevu¹

1. Independent Researcher, USA

Modern enterprises encounter significant challenges when integrating heterogeneous data sources due to schema variability, inconsistent naming conventions, and noisy real-world datasets that traditionally require extensive manual intervention. This work presents a novel generative AI framework that employs Large Language Models augmented with retrieval-based techniques to automate schema matching, column-level mapping, and transformation rule generation within ETL pipelines. The framework incorporates metadata-aware prompting strategies, domain-specific exemplar retrieval through RAG, and iterative self-refinement mechanisms to produce high-quality mapping suggestions alongside executable transformation code in SQL and pandas formats. Confidence scoring enables effective human-in-the-loop validation while adaptive feedback mechanisms facilitate continuous improvement from user corrections. The system demonstrates the capability of handling evolving schemas and multilingual datasets across diverse enterprise domains. Experimental evaluation on synthetic datasets and established benchmarks reveals substantial improvements in matching accuracy, precision, and recall compared to traditional name-similarity metrics and classical machine learning classifiers. The framework addresses critical deployment considerations, including data privacy compliance, operational scalability, and hallucination mitigation strategies. Results indicate significant potential for reliable large-scale enterprise adoption through transparent, auditable automated ETL workflows that maintain data quality standards while reducing manual overhead.

Files

SJMD-303-2025-98-104.pdf

Files (756.3 kB)

Name	Size	Download all
SJMD-303-2025-98-104.pdf md5:dc08ca1bf109231e5ba72f840ecdfc06	756.3 kB	Preview Download

	All versions	This version
Views	52	52
Downloads	18	18
Data volume	18.9 MB	18.9 MB

Leveraging Large Language Models for Automated Schema Matching and Data Transformation in Enterprise ETL Pipelines

Authors/Creators

Description

Files

SJMD-303-2025-98-104.pdf

Files (756.3 kB)