Member-only story
How to Evaluate (and Improve) Your LLM Apps
An overview and example case study
Although large language models (LLMs) seem to perform any task we throw at them, this flexibility turns into unpredictability when engineering new AI applications. This is where evaluations (i.e evals) can help. In this article, I’ll discuss 3 ways of evaluating LLM systems to ensure they actually do what you want. At the end, I’ll share a practical example of using evals to systematically improve the prompt for a YouTube video to blog post converter.
LLMs have made it faster and easier to build AI apps. Instead of collecting mountains of training data to train task-specific ML models, today we can simply adapt a general-purpose LLM to a use case by writing a prompt.
While this significantly expands the potential solutions we can build with AI, it has a core limitation. Namely, the relationship between a prompt and an LLM’s performance on a specific task isn’t always straightforward.