Published July 12, 2024 | Version v1
Conference paper Open

FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis

  • 1. Bielefeld University, Germany
  • 2. University of Alberta, Canada

Description

Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified `foundation' models. Tools for evaluating such deployed systems are currently sparse, inflexible, or difficult to use. In this paper, we introduce an open-source tool called FlexEval. FlexEval extends OpenAI Evals to allow developers to construct customized, comprehensive automated evaluations of both pre-production and live conversational systems. FlexEval runs locally and can be easily modified to meet the needs of application developers. Developers can evaluate new LLM applications by creating function-based or machine-graded metrics and obtaining results for chat completions or entire conversations. To illustrate FlexEval's utility, we share two use-cases involving content moderation and utterance classification. We built FlexEval to lower the effort required to implement automated testing and evaluation of LLM applications. The code is available on GitHub.

Files

2024.EDM-posters.107.pdf

Files (393.5 kB)

Name Size Download all
md5:f1b9810ef352af7eb1d5efb421dce32c
393.5 kB Preview Download