FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis
Contributors
Editors:
- 1. Bielefeld University, Germany
- 2. University of Alberta, Canada
Description
Large language models (LLMs) are increasingly being deployed in user-facing applications in educational settings. Deployed applications often augment LLMs with fine-tuning, custom system prompts, and moderation layers to achieve particular goals. However, the behaviors of LLM-powered systems are difficult to guarantee, and most existing evaluations focus instead on the performance of unmodified `foundation' models. Tools for evaluating such deployed systems are currently sparse, inflexible, or difficult to use. In this paper, we introduce an open-source tool called FlexEval. FlexEval extends OpenAI Evals to allow developers to construct customized, comprehensive automated evaluations of both pre-production and live conversational systems. FlexEval runs locally and can be easily modified to meet the needs of application developers. Developers can evaluate new LLM applications by creating function-based or machine-graded metrics and obtaining results for chat completions or entire conversations. To illustrate FlexEval's utility, we share two use-cases involving content moderation and utterance classification. We built FlexEval to lower the effort required to implement automated testing and evaluation of LLM applications. The code is available on GitHub.
Files
2024.EDM-posters.107.pdf
Files
(393.5 kB)
Name | Size | Download all |
---|---|---|
md5:f1b9810ef352af7eb1d5efb421dce32c
|
393.5 kB | Preview Download |