Published April 1, 2025 | Version v1
Journal Open

Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis

  • 1. School of Electrical and Computer Engineering, Institute of Communication and Computer Systems, National Technical University of Athens, Athens, Greece
  • 2. Research & Innovation Development Department, Netcompany-Intrasoft S.A., Luxembourg, Luxembourg

Description

Large Language Models (LLMs) have recently attracted considerable attention from the scientific community, due to their advanced capabilities and potential to serve as vital tools across various industries and academic fields. An important implementation domain for LLMs is Data Science, in which they could enhance the efficiency of Data Analysis and Profiling tasks.With the utilization of LLMs in Data Analytics tools, end-users could directly issue data analysis queries in natural language, bypassing the need for specialized user interfaces. However, due to the sensitive nature of certain data in some organizations, it is unwise to consider using established, cloud-based LLMs. This article explores the feasibility and effectiveness of a standalone, offline LLM in generating code for performing data analytics, given a set of natural language queries. A methodology tailored to a code-specific LLM is presented, evaluating its performance in generating Python Spark code and successfully producing the desired result. The model is assessed on its efficiency and ability to handle natural language queries of varying complexity, exploring the potential for wider adoption of offline LLMs in future data analysis frameworks and software solutions.

Files

Exploring_the_Potential_of_Offline_LLMs_in_Data_Science_A_Study_on_Code_Generation_for_Data_Analysis.pdf

Additional details

Identifiers

ISSN
2169-3536

Funding

European Commission
TENSOR - Reliable biomeTric tEhNologies to asSist Police authorities in cOmbating terrorism and oRganized crime 101073920

References

  • OpenAI, ''Gpt-4o large language model,'' 2024. Accessed on 10 February 2025
  • Google, ''Gemini large language model,'' 2024. Accessed on 10 February 2025.
  • H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Barnes, and A. Mian, ''A comprehensive overview of large language models,'' arXiv preprint arXiv:2307.06435, 2023
  • W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., ''A survey of large language models,'' arXiv preprint arXiv:2303.18223, 2023
  • M. R. AI4Science and M. A. Quantum, ''The impact of large language models on scientific discovery: a preliminary study using gpt-4,'' arXiv preprint arXiv:2311.07361, 2023
  • A. Nikolakopoulos, S. Evangelatos, E. Veroni, K. Chasapas, N. Gousetis, A. Apostolaras, C. D. Nikolopoulos, and T. Korakis, ''Large language models in modern forensic investigations: Harnessing the power of generative artificial intelligence in crime resolution and suspect identification,'' in 2024 5th International Conference in Electronic Engineering, Information Technology and Education (EEITE), pp. 1–5, IEEE, 2024
  • J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, ''Challenges and applications of large language models,'' arXiv preprint arXiv:2307.10169, 2023.
  • E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., ''Chatgpt for good? on opportunities and challenges of large language models for education,'' Learning and individual differences, vol. 103, p. 102274, 2023
  • A. Bewersdorff, K. Seßler, A. Baur, E. Kasneci, and C. Nerdel, ''Assessing student errors in experimentation using artificial intelligence and large language models: A comparative study with human raters,'' Computers and Education: Artificial Intelligence, vol. 5, p. 100177, 2023
  • C. Ebert and P. Louridas, ''Generative ai for software practitioners,'' IEEE Software, vol. 40, no. 4, pp. 30–38, 2023
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, ''Generalization through memorization: Nearest neighbor language models,'' arXiv preprint arXiv:1911.00172, 2019
  • R. H. Hariri, E. M. Fredericks, and K. M. Bowers, ''Uncertainty in big data analytics: survey, opportunities, and challenges,'' Journal of Big data, vol. 6, no. 1, pp. 1–16, 2019
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, ''Exploring the limits of transfer learning with a unified text-to-text transformer,'' Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020
  • F. Suchanek and A. T. Luu, ''Knowledge bases and language models: Complementing forces,'' in International Joint Conference on Rules and Reasoning, pp. 3–15, Springer, 2023
  • K. Kang, Y. Yang, Y. Wu, and R. Luo, ''Integrating large language models in bioinformatics education for medical students: Opportunities and challenges,'' Annals of Biomedical Engineering, pp. 1–5, 2024
  • E. U. Law, ''General data protection regulation,'' 2016. Accessed on 10 February 2025
  • Q. Wang, M. Du, X. Chen, Y. Chen, P. Zhou, X. Chen, and X. Huang, ''Privacy-preserving collaborative model learning: The case of word vector training,'' IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2381–2393, 2018
  • Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen, ''Investigating code generation performance of chatgpt with crowdsourcing social data,'' in 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 876–885, IEEE, 2023
  • Q. Gu, ''Llm-based code generation method for golang compiler testing,'' in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 2201–2203, ACM, 2023
  • S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz, ''The programmer's assistant: Conversational interaction with a large language model for software development,'' in Proceedings of the 28th International Conference on Intelligent User Interfaces, pp. 491–514, ACM, 2023
  • A. Soliman, S. Shaheen, and M. Hadhoud, ''Leveraging pre-trained language models for code generation,'' Complex & Intelligent Systems, pp. 1– 26, 2024
  • CoNaLa, ''Conala: The code / natural language challenge,'' 2024. Accessed on 10 February 2025
  • Y. Oda, ''Django dataset for code translation tasks,'' 2024. Accessed on 10 February 2025
  • G. Pinna, D. Ravalico, L. Rovito, L. Manzoni, and A. De Lorenzo, ''Enhancing large language models-based code generation by leveraging genetic improvement,'' in European Conference on Genetic Programming (Part of EvoStar), pp. 108–124, Springer, 2024
  • H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie, ''Codereval: A benchmark of pragmatic code generation with generative pre-trained models,'' in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pp. 1–12, ACM, 2024.
  • S. Omari, K. Basnet, and M. Wardat, ''Investigating large language models capabilities for automatic code repair in python,'' Cluster Computing, 2024
  • A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, ''Large language models for software engineering: Survey and open problems,'' in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pp. 31–53, IEEE, 2023
  • M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, and C.-W. Tan, ''Natural language generation and understanding of big code for ai-assisted programming: A review,'' Entropy, vol. 25, no. 6, p. 888, 2023
  • GitHub, ''Github copilot ai developer tool,'' 2024. Accessed on 10 February 2025.
  • Google, ''Deepmind alphacode ai developer tool,'' 2024. Accessed on 10 February 2025
  • J. Wang and Y. Chen, ''A review on code generation with llms: Application and evaluation,'' in 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pp. 284–289, IEEE, 2023
  • H.-C. Liu, C.-T. Tsai, and M.-Y. Day, ''A pilot study on ai-assisted code generation with large language models for software engineering,'' in International Conference on Technologies and Applications of Artificial Intelligence, pp. 162–175, Springer, 2024
  • S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, ''Verigen: A large language model for verilog code generation,'' ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 3, pp. 1–31, 2024
  • F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, ''Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,'' Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2332–2354, 2024
  • D. Rau and J. Kamps, ''Query generation using large language models,'' in Advances in Information Retrieval, pp. 226–239, Springer, 2024
  • D. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W.-t. Yih, J. Pineau, and L. Zettlemoyer, ''Improving passage retrieval with zero-shot question generation,'' in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, (Abu Dhabi, United Arab Emirates), pp. 3781–3797, Association for Computational Linguistics, December 2022
  • B. Cellar, ''Beir: A heterogeneous benchmark,'' 2024. Accessed on 10 February 2025.
  • Microsoft, ''Trec 2020 deep learning track,'' 2020. Accessed on 10 February 2025.
  • X. Qu, Y. Wang, Z. Li, and J. Gao, ''Graph-enhanced prompt learning for personalized review generation,'' Data Science and Engineering, pp. 1–16, 2024.
  • X. Zhou, Z. Sun, and G. Li, ''Db-gpt: Large language model meets database,'' Data Science and Engineering, vol. 9, no. 1, pp. 102–111, 2024.
  • OpenAI, ''Openai codex,'' 2024. Accessed on 10 February 2025
  • DataRobot, ''Datarobot ai solutions,'' 2024. Accessed on 10 February 2025
  • ThoughtSpot, ''Thoughtspot ai-powered analytics,'' 2024. Accessed on 10 February 2025
  • Tableau, ''Tableau data visualization,'' 2024. Accessed on 10 February 2025
  • Microsoft, ''Microsoft power bi,'' 2024. Accessed on 10 February 2025
  • IBM, ''What is prompt engineering?,'' 2024. Accessed on 10 February 2025
  • G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, ''Prompt engineering in large language models,'' in International conference on data intelligence and cognitive informatics, pp. 387–402, Springer, 2023.
  • J. D. Velásquez-Henao, C. J. Franco-Cardona, and L. Cadavid-Higuita, ''Prompt engineering: a methodology for optimizing interactions with ai language models in the field of engineering,'' Dyna, vol. 90, no. 230, pp. 9– 17, 2023
  • W. Cain, ''Prompting change: exploring prompt engineering in large language model ai and its potential to transform education,'' TechTrends, vol. 68, no. 1, pp. 47–57, 2024
  • P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha, ''A systematic survey of prompt engineering in large language models: Techniques and applications,'' arXiv preprint arXiv:2402.07927, 2024
  • L. Beurer-Kellner, M. Fischer, and M. Vechev, ''Prompting is programming: A query language for large language models,'' Proceedings of the ACM on Programming Languages, vol. 7, no. PLDI, pp. 1946–1969, 2023
  • I. Arawjo, C. Swoopes, P. Vaithilingam, M. Wattenberg, and E. L. Glassman, ''Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing,'' in Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–18, ACM, 2024
  • A. Nikolakopoulos, E. Chondrogiannis, E. Karanastasis, M. J. L. Osa, J. A. Aroca, M. Kefalogiannis, V. Apostolopoulou, E. Deligeorgi, V. Siopidis, and T. Varvarigou, ''Scalable data profiling for quality analytics extraction,'' in IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 177–189, Springer, 2024
  • O. E. Gundersen and S. Kjensmo, ''State of the art: Reproducibility in artificial intelligence,'' in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018
  • Kaggle, ''Netflix: Movies and tv shows dataset,'' 2024. Accessed on 10 February 2025
  • Kaggle, ''Covid-19 twitter dataset,'' 2024. Accessed on 10 February 2025
  • Kaggle, ''Shard cars locations dataset: Location history of shared cars,'' 2024. Accessed on 10 February 2025
  • AutoTel, ''Autotel project in tel aviv,'' 2024. Accessed on 10 February 2025.
  • MavenAnalytics, ''Madrid daily weather dataset: Daily weather conditions in madrid from 1997-2015,'' 2024. Accessed on 10 February 2025
  • Kaggle, ''Supermarket sales dataset: Historical record of sales data in 3 different supermarkets,'' 2024. Accessed on 10 February 2025
  • V. Dogra, S. Verma, M. Woźniak, J. Shafi, M. F. Ijaz, et al., ''Shortcut learning explanations for deep natural language processing: A survey on dataset biases,'' IEEE Access, 2024
  • MistralAI, ''Codestral large language model,'' 2024. Accessed on 10 February 2025
  • A. Qwen, ''Qwen 2.5 coder large language model,'' 2024. Accessed on 10 February 2025
  • OpenAI, ''Humaneval dataset,'' 2024. Accessed on 10 February 2025
  • T. Liu, C. Xu, and J. McAuley, ''Repobench: Benchmarking repository level code auto-completion systems,'' arXiv preprint arXiv:2306.03091, 2023
  • Meta, ''Cruxeval: Code reasoning, understanding, and execution evaluation,'' 2024. Accessed on 10 February 2025
  • Meta, ''Codellama: A state-of-the-art large language model for coding,'' 2023. Accessed on 10 February 2025.
  • DeepSeekAI, ''Deepseek coder 33b large language model,'' 2024. Accessed on 10 February 2025.
  • Meta, ''Llama 3: Openly available large language model,'' 2024. Accessed on 10 February 2025.
  • DataCamp, ''What is mistral's codestral? key features, use cases, and limitations,'' 2024. Accessed on 10 February 2025
  • B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, et al., ''Qwen2. 5-coder technical report,'' arXiv preprint arXiv:2409.12186, 2024
  • S. Zhu, W. Hu, Z. Yang, J. Yan, and F. Zhang, ''Qwen-2.5 outperforms other large language models in the chinese national nursing licensing examination: Retrospective cross-sectional comparative study,'' JMIR Medical Informatics, vol. 13, p. e63731, 2025
  • Apache, ''Apache spark: Unified engine for large-scale data analytics,'' 2024. Accessed on 10 February 2025
  • Q. Yin, J. Wang, S. Du, J. Leng, J. Li, Y. Hong, F. Zhang, Y. Chai, X. Zhang, X. Zhao, et al., ''An adaptive elastic multi-model big data analysis and information extraction system,'' Data Science and Engineering, vol. 7, no. 4, pp. 328–338, 2022
  • A. Nikolakopoulos, M. Julian Segui, A. B. Pellicer, M. Kefalogiannis, C.- A. Gizelis, A. Marinakis, K. Nestorakis, and T. Varvarigou, ''Bigdam: Efficient big data management and interoperability middleware for seaports as critical infrastructures,'' Computers, vol. 12, no. 11, p. 218, 2023
  • E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow, ''Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval,'' in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 512–519, 2005
  • J. Guo and Y. Lan, ''Query classification,'' Query Understanding for Search Engines, pp. 15–41, 2020
  • Y. Belinkov and J. Glass, ''Analysis methods in neural language processing: A survey,'' Transactions of the Association for Computational Linguistics, vol. 7, pp. 49–72, 2019
  • E. M. Bender and A. Koller, ''Climbing towards nlu: On meaning, form, and understanding in the age of data,'' in Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 5185–5198, 2020.
  • HuggingFace, ''Hugging face: Codestral v01 large language model,'' 2024. Accessed on 10 February 2025
  • LMStudio, ''Lm studio server,'' 2025. Accessed on 10 February 2025
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, ''The curious case of neural text degeneration,'' arXiv preprint arXiv:1904.09751, 2019
  • A. Nikolakopoulos, A. Psychas, A. Litke, and T. Varvarigou, ''Leveraging indoor localization data: The transactional area network (tan),'' Electronics, vol. 13, no. 13, p. 2454, 2024