Member-only story

How to Evaluate (and Improve) Your LLM Apps

An overview and example case study

Shaw Talebi
12 min readMar 13, 2025

Although large language models (LLMs) seem to perform any task we throw at them, this flexibility turns into unpredictability when engineering new AI applications. This is where evaluations (i.e evals) can help. In this article, I’ll discuss 3 ways of evaluating LLM systems to ensure they actually do what you want. At the end, I’ll share a practical example of using evals to systematically improve the prompt for a YouTube video to blog post converter.

Photo by Diana Polekhina on Unsplash

LLMs have made it faster and easier to build AI apps. Instead of collecting mountains of training data to train task-specific ML models, today we can simply adapt a general-purpose LLM to a use case by writing a prompt.

While this significantly expands the potential solutions we can build with AI, it has a core limitation. Namely, the relationship between a prompt and an LLM’s performance on a specific task isn’t always straightforward.

Vibe Checks

--

--

Shaw Talebi
Shaw Talebi

Responses (2)