Published April 1, 2025 | Version v1

Journal Open

Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis

1. School of Electrical and Computer Engineering, Institute of Communication and Computer Systems, National Technical University of Athens, Athens, Greece
2. Research & Innovation Development Department, Netcompany-Intrasoft S.A., Luxembourg, Luxembourg

Large Language Models (LLMs) have recently attracted considerable attention from the scientific community, due to their advanced capabilities and potential to serve as vital tools across various industries and academic fields. An important implementation domain for LLMs is Data Science, in which they could enhance the efficiency of Data Analysis and Profiling tasks.With the utilization of LLMs in Data Analytics tools, end-users could directly issue data analysis queries in natural language, bypassing the need for specialized user interfaces. However, due to the sensitive nature of certain data in some organizations, it is unwise to consider using established, cloud-based LLMs. This article explores the feasibility and effectiveness of a standalone, offline LLM in generating code for performing data analytics, given a set of natural language queries. A methodology tailored to a code-specific LLM is presented, evaluating its performance in generating Python Spark code and successfully producing the desired result. The model is assessed on its efficiency and ability to handle natural language queries of varying complexity, exploring the potential for wider adoption of offline LLMs in future data analysis frameworks and software solutions.

Files

Exploring_the_Potential_of_Offline_LLMs_in_Data_Science_A_Study_on_Code_Generation_for_Data_Analysis.pdf

Files (2.5 MB)

Name	Size	Download all
Exploring_the_Potential_of_Offline_LLMs_in_Data_Science_A_Study_on_Code_Generation_for_Data_Analysis.pdf md5:80caeaada2396969727e33d76060509c	2.5 MB	Preview Download

Additional details

ISSN: 2169-3536

European Commission
TENSOR - Reliable biomeTric tEhNologies to asSist Police authorities in cOmbating terrorism and oRganized crime 101073920

OpenAI, ''Gpt-4o large language model,'' 2024. Accessed on 10 February 2025
Google, ''Gemini large language model,'' 2024. Accessed on 10 February 2025.
H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Barnes, and A. Mian, ''A comprehensive overview of large language models,'' arXiv preprint arXiv:2307.06435, 2023
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., ''A survey of large language models,'' arXiv preprint arXiv:2303.18223, 2023
M. R. AI4Science and M. A. Quantum, ''The impact of large language models on scientific discovery: a preliminary study using gpt-4,'' arXiv preprint arXiv:2311.07361, 2023
A. Nikolakopoulos, S. Evangelatos, E. Veroni, K. Chasapas, N. Gousetis, A. Apostolaras, C. D. Nikolopoulos, and T. Korakis, ''Large language models in modern forensic investigations: Harnessing the power of generative artificial intelligence in crime resolution and suspect identification,'' in 2024 5th International Conference in Electronic Engineering, Information Technology and Education (EEITE), pp. 1–5, IEEE, 2024
J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, ''Challenges and applications of large language models,'' arXiv preprint arXiv:2307.10169, 2023.
E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., ''Chatgpt for good? on opportunities and challenges of large language models for education,'' Learning and individual differences, vol. 103, p. 102274, 2023
A. Bewersdorff, K. Seßler, A. Baur, E. Kasneci, and C. Nerdel, ''Assessing student errors in experimentation using artificial intelligence and large language models: A comparative study with human raters,'' Computers and Education: Artificial Intelligence, vol. 5, p. 100177, 2023
C. Ebert and P. Louridas, ''Generative ai for software practitioners,'' IEEE Software, vol. 40, no. 4, pp. 30–38, 2023
U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, ''Generalization through memorization: Nearest neighbor language models,'' arXiv preprint arXiv:1911.00172, 2019
R. H. Hariri, E. M. Fredericks, and K. M. Bowers, ''Uncertainty in big data analytics: survey, opportunities, and challenges,'' Journal of Big data, vol. 6, no. 1, pp. 1–16, 2019
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, ''Exploring the limits of transfer learning with a unified text-to-text transformer,'' Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020
F. Suchanek and A. T. Luu, ''Knowledge bases and language models: Complementing forces,'' in International Joint Conference on Rules and Reasoning, pp. 3–15, Springer, 2023
K. Kang, Y. Yang, Y. Wu, and R. Luo, ''Integrating large language models in bioinformatics education for medical students: Opportunities and challenges,'' Annals of Biomedical Engineering, pp. 1–5, 2024
E. U. Law, ''General data protection regulation,'' 2016. Accessed on 10 February 2025
Q. Wang, M. Du, X. Chen, Y. Chen, P. Zhou, X. Chen, and X. Huang, ''Privacy-preserving collaborative model learning: The case of word vector training,'' IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2381–2393, 2018
Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen, ''Investigating code generation performance of chatgpt with crowdsourcing social data,'' in 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 876–885, IEEE, 2023
Q. Gu, ''Llm-based code generation method for golang compiler testing,'' in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 2201–2203, ACM, 2023
S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz, ''The programmer's assistant: Conversational interaction with a large language model for software development,'' in Proceedings of the 28th International Conference on Intelligent User Interfaces, pp. 491–514, ACM, 2023
A. Soliman, S. Shaheen, and M. Hadhoud, ''Leveraging pre-trained language models for code generation,'' Complex & Intelligent Systems, pp. 1– 26, 2024
CoNaLa, ''Conala: The code / natural language challenge,'' 2024. Accessed on 10 February 2025
Y. Oda, ''Django dataset for code translation tasks,'' 2024. Accessed on 10 February 2025
G. Pinna, D. Ravalico, L. Rovito, L. Manzoni, and A. De Lorenzo, ''Enhancing large language models-based code generation by leveraging genetic improvement,'' in European Conference on Genetic Programming (Part of EvoStar), pp. 108–124, Springer, 2024
H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie, ''Codereval: A benchmark of pragmatic code generation with generative pre-trained models,'' in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pp. 1–12, ACM, 2024.
S. Omari, K. Basnet, and M. Wardat, ''Investigating large language models capabilities for automatic code repair in python,'' Cluster Computing, 2024
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, ''Large language models for software engineering: Survey and open problems,'' in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pp. 31–53, IEEE, 2023
M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, and C.-W. Tan, ''Natural language generation and understanding of big code for ai-assisted programming: A review,'' Entropy, vol. 25, no. 6, p. 888, 2023
GitHub, ''Github copilot ai developer tool,'' 2024. Accessed on 10 February 2025.
Google, ''Deepmind alphacode ai developer tool,'' 2024. Accessed on 10 February 2025
J. Wang and Y. Chen, ''A review on code generation with llms: Application and evaluation,'' in 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pp. 284–289, IEEE, 2023
H.-C. Liu, C.-T. Tsai, and M.-Y. Day, ''A pilot study on ai-assisted code generation with large language models for software engineering,'' in International Conference on Technologies and Applications of Artificial Intelligence, pp. 162–175, Springer, 2024
S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, ''Verigen: A large language model for verilog code generation,'' ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 3, pp. 1–31, 2024
F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, ''Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,'' Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2332–2354, 2024
D. Rau and J. Kamps, ''Query generation using large language models,'' in Advances in Information Retrieval, pp. 226–239, Springer, 2024
D. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W.-t. Yih, J. Pineau, and L. Zettlemoyer, ''Improving passage retrieval with zero-shot question generation,'' in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, (Abu Dhabi, United Arab Emirates), pp. 3781–3797, Association for Computational Linguistics, December 2022
B. Cellar, ''Beir: A heterogeneous benchmark,'' 2024. Accessed on 10 February 2025.
Microsoft, ''Trec 2020 deep learning track,'' 2020. Accessed on 10 February 2025.
X. Qu, Y. Wang, Z. Li, and J. Gao, ''Graph-enhanced prompt learning for personalized review generation,'' Data Science and Engineering, pp. 1–16, 2024.
X. Zhou, Z. Sun, and G. Li, ''Db-gpt: Large language model meets database,'' Data Science and Engineering, vol. 9, no. 1, pp. 102–111, 2024.
OpenAI, ''Openai codex,'' 2024. Accessed on 10 February 2025
DataRobot, ''Datarobot ai solutions,'' 2024. Accessed on 10 February 2025
ThoughtSpot, ''Thoughtspot ai-powered analytics,'' 2024. Accessed on 10 February 2025
Tableau, ''Tableau data visualization,'' 2024. Accessed on 10 February 2025
Microsoft, ''Microsoft power bi,'' 2024. Accessed on 10 February 2025
IBM, ''What is prompt engineering?,'' 2024. Accessed on 10 February 2025
G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, ''Prompt engineering in large language models,'' in International conference on data intelligence and cognitive informatics, pp. 387–402, Springer, 2023.
J. D. Velásquez-Henao, C. J. Franco-Cardona, and L. Cadavid-Higuita, ''Prompt engineering: a methodology for optimizing interactions with ai language models in the field of engineering,'' Dyna, vol. 90, no. 230, pp. 9– 17, 2023
W. Cain, ''Prompting change: exploring prompt engineering in large language model ai and its potential to transform education,'' TechTrends, vol. 68, no. 1, pp. 47–57, 2024
P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha, ''A systematic survey of prompt engineering in large language models: Techniques and applications,'' arXiv preprint arXiv:2402.07927, 2024
L. Beurer-Kellner, M. Fischer, and M. Vechev, ''Prompting is programming: A query language for large language models,'' Proceedings of the ACM on Programming Languages, vol. 7, no. PLDI, pp. 1946–1969, 2023
I. Arawjo, C. Swoopes, P. Vaithilingam, M. Wattenberg, and E. L. Glassman, ''Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing,'' in Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–18, ACM, 2024
A. Nikolakopoulos, E. Chondrogiannis, E. Karanastasis, M. J. L. Osa, J. A. Aroca, M. Kefalogiannis, V. Apostolopoulou, E. Deligeorgi, V. Siopidis, and T. Varvarigou, ''Scalable data profiling for quality analytics extraction,'' in IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 177–189, Springer, 2024
O. E. Gundersen and S. Kjensmo, ''State of the art: Reproducibility in artificial intelligence,'' in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018
Kaggle, ''Netflix: Movies and tv shows dataset,'' 2024. Accessed on 10 February 2025
Kaggle, ''Covid-19 twitter dataset,'' 2024. Accessed on 10 February 2025
Kaggle, ''Shard cars locations dataset: Location history of shared cars,'' 2024. Accessed on 10 February 2025
AutoTel, ''Autotel project in tel aviv,'' 2024. Accessed on 10 February 2025.
MavenAnalytics, ''Madrid daily weather dataset: Daily weather conditions in madrid from 1997-2015,'' 2024. Accessed on 10 February 2025
Kaggle, ''Supermarket sales dataset: Historical record of sales data in 3 different supermarkets,'' 2024. Accessed on 10 February 2025
V. Dogra, S. Verma, M. Woźniak, J. Shafi, M. F. Ijaz, et al., ''Shortcut learning explanations for deep natural language processing: A survey on dataset biases,'' IEEE Access, 2024
MistralAI, ''Codestral large language model,'' 2024. Accessed on 10 February 2025
A. Qwen, ''Qwen 2.5 coder large language model,'' 2024. Accessed on 10 February 2025
OpenAI, ''Humaneval dataset,'' 2024. Accessed on 10 February 2025
T. Liu, C. Xu, and J. McAuley, ''Repobench: Benchmarking repository level code auto-completion systems,'' arXiv preprint arXiv:2306.03091, 2023
Meta, ''Cruxeval: Code reasoning, understanding, and execution evaluation,'' 2024. Accessed on 10 February 2025
Meta, ''Codellama: A state-of-the-art large language model for coding,'' 2023. Accessed on 10 February 2025.
DeepSeekAI, ''Deepseek coder 33b large language model,'' 2024. Accessed on 10 February 2025.
Meta, ''Llama 3: Openly available large language model,'' 2024. Accessed on 10 February 2025.
DataCamp, ''What is mistral's codestral? key features, use cases, and limitations,'' 2024. Accessed on 10 February 2025
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, et al., ''Qwen2. 5-coder technical report,'' arXiv preprint arXiv:2409.12186, 2024
S. Zhu, W. Hu, Z. Yang, J. Yan, and F. Zhang, ''Qwen-2.5 outperforms other large language models in the chinese national nursing licensing examination: Retrospective cross-sectional comparative study,'' JMIR Medical Informatics, vol. 13, p. e63731, 2025
Apache, ''Apache spark: Unified engine for large-scale data analytics,'' 2024. Accessed on 10 February 2025
Q. Yin, J. Wang, S. Du, J. Leng, J. Li, Y. Hong, F. Zhang, Y. Chai, X. Zhang, X. Zhao, et al., ''An adaptive elastic multi-model big data analysis and information extraction system,'' Data Science and Engineering, vol. 7, no. 4, pp. 328–338, 2022
A. Nikolakopoulos, M. Julian Segui, A. B. Pellicer, M. Kefalogiannis, C.- A. Gizelis, A. Marinakis, K. Nestorakis, and T. Varvarigou, ''Bigdam: Efficient big data management and interoperability middleware for seaports as critical infrastructures,'' Computers, vol. 12, no. 11, p. 218, 2023
E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow, ''Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval,'' in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 512–519, 2005
J. Guo and Y. Lan, ''Query classification,'' Query Understanding for Search Engines, pp. 15–41, 2020
Y. Belinkov and J. Glass, ''Analysis methods in neural language processing: A survey,'' Transactions of the Association for Computational Linguistics, vol. 7, pp. 49–72, 2019
E. M. Bender and A. Koller, ''Climbing towards nlu: On meaning, form, and understanding in the age of data,'' in Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 5185–5198, 2020.
HuggingFace, ''Hugging face: Codestral v01 large language model,'' 2024. Accessed on 10 February 2025
LMStudio, ''Lm studio server,'' 2025. Accessed on 10 February 2025
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, ''The curious case of neural text degeneration,'' arXiv preprint arXiv:1904.09751, 2019
A. Nikolakopoulos, A. Psychas, A. Litke, and T. Varvarigou, ''Leveraging indoor localization data: The transactional area network (tan),'' Electronics, vol. 13, no. 13, p. 2454, 2024

110

Views

56

Downloads

Show more details

	All versions	This version
Views	110	110
Downloads	56	56
Data volume	176.3 MB	176.3 MB

More info on how stats are collected....

DOI

Resource type

Journal

Publisher

IEEE

Published in

IEEE Access, ISSN: 2169-3536, 2025.

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: April 15, 2025
Modified: April 15, 2025