
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
We've been building evals for over a year. We found this course invaluable to determine where we could improve our process, identify tools and resources, and engage with others in the community. Shreya and Hamel have produced a top notch course that's worth every minute of time invested.
Hardware Engineering Leader at Cisco
This course is a game changer.
This course was a game-changer for me. My biggest takeaway was learning a structured approach to system traces, which has given me a reliable framework for making meaningful progress. The hands-on content was fantastic; I learn best by doing, so I truly appreciated the practical exercises. I now have the 'flywheel' I was missing to move forward with my own app development. I highly recommend this course and look forward to even more hands-on content in the future!
Founder, Socratify
1000x ROI
Taking a structured approach to evals is a game changer. Shreya and Hamel are teaching a skill with 1000x ROI in the age of AI. At Socratify, we're building a career coach that sharpens critical thinking skills through debates on business news and other topics. It's inherently challenging to ensure high quality LLM interactions and going through error analysis has been transformative for the product development process. I can't wait to release the next version! I would absolutely recommend this course to any founder working with LLMs
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Course Instructors Went Above & Beyond
As someone with prior experience designing human evaluations and developing metrics for a specific product, I took this course to broaden my understanding of AI evaluation practices, especially for agentic systems and RAG, as well as to deepen my knowledge of evaluation infrastructure such as CI/CD and trace review interfaces. This course delivered far more than I expected. It includes a comprehensive course reader that could stand on its own as a reference book, live classes packed with hands-on examples, and over 10 guest speakers who shared practical insights into evaluation strategies and even how to build your own evaluation tools for different use cases. What really set the course apart was the level of support. Hamel and Shreya were incredibly supportive throughout the course. They hosted office hours, thoughtfully answered every question on Discord, and even brought in two experienced professionals to offer additional hands-on support and help with (optional) homework. They went above and beyond to make sure everyone was learning and participating. I also really appreciated hearing from other students about the evaluation challenges they were facing in their own work, and watching Hamel and Shreya think through solutions with them in real time was just as educational as the prepared content. Highly recommend this course if you're working on or even adjacent to LLM applications. Whether you’re focused on product quality, engineering, or research, you’ll walk away with frameworks, tools, and best-practices you can use right away.
Wayde Gilliam
"If you are building with AI, you need this course!"
Founder, Wicked Data LLC
"Take this course to go from a good to a great AI Engineer!"
Owner at Kentro Tech LLC
"Practical techniques rarely taught elsewhere. Highly recommend!"
Adam Dadson
GTM @ OpenAI

Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
A colleague reached out to me and recommended “AI Evals For Engineers & PMs” being offered by Hamel H. and Shreya Shankar. I consider myself an eternal learner, and knew evaluations were a critical yet often overlooked component to successful GenAI implementation. Everyone keeps asking me how they stay ahead of the GenAI. Well, you take classes like this one so you can be on the cutting edge of how to ensure you get the expected outcomes from your future AI agents. It was so dense with useful information and guest speakers that I honestly couldn’t keep up, but after the course is over, you continue to have access to the recordings.
Jeroen Latour
FinTech at Booking.com
Juan Maturino
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Hima Tk
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Margarita Fakih
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
Júlio Paulillo
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
Tatyana Kazakova
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
Andrew Chaffin
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
Amol Shah
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Lada Kesseler
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Maruti Agarwal
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
Sergio Soage
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
Karen Lam
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Juan Maturino
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
Chris McDonald
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
Roey Ben Chaim
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Ben Eyal
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Annu Augustine
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.

George Job Vetticaden
VP of Products, AI Agents
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
This course provides a great take on building reliable AI applications. It teaches practical techniques while developing intuition for evaluating LLM based systems. What sets it apart is its tool agnostic approach. Rather than focusing on specific platforms, it emphasizes systematic and scientific principles that apply everywhere.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Director of AI
This course helps you transform guesswork into actionable insights.
Evaluating generative AI often relies on abstract benchmarks disconnected from real-world outcomes and detached from practical experience. To bridge this gap, many rely on subjective impressions or “vibes” (i.e., the eye-test). The eye-test is important since evaluators directly interact with the model in realistic contexts. However, vibes and qualitative evaluations are not particularly helpful in evaluating application-specific performance, consistency, bias, reliability, security, or return-on-investment. In contrast, application-specific evals reflect an essential day-to-day operational focus. They aim to assess if a specific pipeline performs successfully in a particular task using realistic data. This course is an important step to transform guesswork to actionable insights. Application-specific evals are not sufficient but they are necessary and so often overlooked. Check out this course. It covers a lot of important terrain.
Machine Learning Engineer
Highly recommend this course.
Hamel's and Shreya's course, "AI Evals for Engineers and PMs" has been a great resource for learning how to tame the scary wild world of LLM-based applications. It was fascinating see how high leverage it is to become "one with the data" and how to do that explicitly for LLMs and agents. Hamel's and Shreya's vast combined expertise is clearly shown in the course's lectures, practical exercises and even a textbook. On top of that, the guest lectures provide even more gems of practical wisdom. I'd highly recommend this course to anyone serious about learning how to improve their LLM-based AI products.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Alan Chang
AI, Machine Learning, and Biology @ Stanford University
Maxime Lelièvre
Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech
Maxime Lelièvre
Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech
Šimon Podhajský
Senior LLM/Data Engineer | 🤖 Evals, ML/AI, Python

Bryce York
AI/ML/LLM and UX-centric B2B startup product management leader with a love for zero-to-one product innovation • 12+ yrs PM, 7+ yrs AI/ML, 5+ yrs adtech • Writer/Speaker • Advisor

Senior Data Scientist @Amazon
Taking the course "Move beyond 'vibe-checks' to data-driven evaluation" has been a game-changer in how I approach AI development and evaluation. Before this, my team often relied on subjective assessments—what we called “vibe-checks”—to judge model outputs. This course provided a structured, systematic framework that replaces guesswork with measurable, repeatable methods for evaluating AI performance. I learned how to build robust evaluation systems tailored to the unique challenges of AI applications—especially those involving stochastic or subjective outputs. The curriculum walked us through defining meaningful metrics, conducting systematic error analysis, generating synthetic data, and implementing automated evaluation pipelines. I also gained hands-on experience evaluating complex architectures like RAG systems and multi-step pipelines, and I now understand how to monitor models in production with continuous feedback loops. The biggest takeaway is that effective AI evaluation isn’t just about measuring accuracy—it’s about building a comprehensive lifecycle strategy that spans development, deployment, and ongoing improvement. Learning how to prioritize engineering efforts based on real data, rather than hunches, has already helped us optimize both performance and cost in our LLM applications. Absolutely. Whether you're an AI engineer, product manager, or data scientist, this course gives you practical tools and frameworks that apply directly to real-world AI challenges. It’s especially valuable for teams looking to move beyond ad-hoc evaluations and establish a collaborative, metrics-driven culture around AI development. In short, this course has transformed how I think about evaluation—from a fuzzy afterthought to a foundational part of the AI lifecycle. I can't recommend it highly enough.
Account Director, OpenAI
This course exceeded my expectations.
The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. The biggest takeaway for me was the level of robustness/sophistication that you can build for evals and the impact that it can have! I had very surface level knowledge of evals before this course. The course exceeded my expectations and I would recommend this to my colleagues.
Data Scientist , Global Innovation Hub
I learned how to evaluate LLM outputs in a structured manner. My biggest takeaway is the quality of eval pipeline is highly critical to ensure good quality outputs in the production environment. I will definitely recommend this course to my colleagues and to anyone who is deploying LLMs.
The AI Evals for Engineers & PMs course, taught by Hamel Husain and Shreya Shankar, is 100% worth the investment. The material couldn't be more up-to-date. There is a perfect balance between theory and real-world hands-on learning. And the office hours and guest speakers are invaluable. I highly recommend it.
Piotr W.
Head of QA | Lead Automation Engineer | AI Evaluation Engineer
Piotr W.
Head of QA | Lead Automation Engineer | AI Evaluation Engineer
Sergiy Korniychuk
Staff Software Engineer - Full Stack at Sondermind Inc
Tsacho Rabchev
Software Developer / Inventor
Sergiy Korniychuk
Staff Software Engineer - Full Stack at Sondermind Inc
Benjamin Pace
Data Science @ Candidly
Joel Dean
AI Founder. Tech Entrepreneur. Content Creator. Prime Minister Youth Awardee. Former WEF Global Shaper Curator
Kenneth Reeser
Senior AI/ML Architect at Vanguard
Joshua Pittman
Prompt Engineer at Outlever
Alpa Dedhia
Engineering Lead | Solutions Architect | AWS | Applied Generative AI Solutions | Search & Relevance | CSM | CSPO
Geoffrey Pidcock
AI and Strategy at ANSTO | MBA Melbourne Business School | Ex Atlassian

Rasool Shaik
GenAI | Embedded/IoT | Medical Devices -Helping organizations to build and validate products
Uday Ramesh Phalak
Machine Learning Engineer | RecSys, AI Evals, GenAI, Climate Change, UX | Co-Founder at HazAdapt

Ankur Bhatt
AI Engineering | Product and Technology Leader | CTO

Robb Winkle
Fire your systems integrator. Conversational AI-native systems expert for ERP (Techstars '24)

Harris Brown
Fractional AI product leader for early-stage founders & teams | ex-Airbnb
Peter Cardwell
Software Engineer at Snap, Inc.
Joshua Pittman
Prompt Engineer at Outlever
Data Scientist at Tiger Analytics

The most practical AI course I've taken, with immediate value.
As a Data Scientist, I joined “AI Evals for Engineers & PMs” by Hamel and Shreya to get better at evaluating LLM systems in real-world settings, particularly in the context of LLMs being used for a variety of tasks. The course covered everything from fundamentals and error analysis to production monitoring and cost optimization. What stood out was how practical it was—teaching us to define use-case-specific metrics, collaborate with PMs, slice errors meaningfully, and avoid over-relying on automated scores. The lessons and guest talks (like on RAG evals, failure funnels, and continuous human review) felt directly applicable to my work, especially when building retrieval-based bots or monitoring model drift. It’s not a course that spoon-feeds you; it gives solid frameworks and real-world habits to build eval pipelines that actually reflect user experience. If you’re working on production ML or LLM projects and want to move beyond standard metrics, this is worth your time. I’ve already applied a few techniques at work and found them super helpful. I would strongly suggest anyone with an interest in evals to take this course!
Dev Team Lead
Highly recommend this course!
The AI Evals course has helped me gain knowledge of how we can true evaluate LLMs in a meaningful way and actually understand whats going wrong. Hamel and Shreya did a wonderful job in explaining how evaluations can be done in a structured manner and on what to try. The course guests were an additional bonus who gave their insights on how they carried evaluations out. I would highly recommend this resource to anyone who builds with LLMs and are wondering how to effectively understand why the LLM isnt working the way it was trained and to better understand what is going on behind the scenes!
Data Scientist
Amazing instructors.
Hamel and Shreya are amazing instructors and this course has been a great resources for me in understanding how to build robust, enterprise-grade evals and AI pipelines. What I found particularly useful were the guest lectures, which bring a variety of opinions from industry experts and practitioners on different topics that relate to AI and evals.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Highly recommend this course to anyone building Gen AI products and solutions. The biggest takeaway for me was the methodical and scientific process that the instructors outline for doing model evaluation. It helps build a mental model that i am applying at work to build an eval pipeline for RAG solutions. The course also offers an in-depth and practical approach to understanding how generative AI models are evaluated using rubrics and metrics which are critical skills for AI engineers. Overall a fantastic course with a lot of learning and value.
Founder, Supago Inc.

This course completely transformed my approach to building AI applications.
This course completely transformed how I approach evaluating LLM applications. Before this course, my evaluation processes were informal at best. Now, I've gained a structured, rigorous methodology to identify errors, quantify improvements, and build automated evaluators. The hands-on assignments and deep dives into error analysis were particularly valuable, directly impacting how efficiently I debug and iterate on LLM products. Whether you're an engineer, product manager, or someone working closely with AI systems, this course is essential—highly recommend!

Prashant Mital
Applied AI @ OpenAI
Zara Khan
Account Director @ OpenAI
AI Consultant
"Absolutely recommend this course to anyone building AI applications"
Senior Product Manager at Redfin

Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Founder, Searchkernel LLC

This course is the best place to learn evals
I learned a ton from Hamel and Shreya's course on AI Evals! I've worked at the intersection of information retrieval and AI for nearly two decades, so I've done my fair share of evals throughout that time for search results quality. I've even written books, like AI-Powered Search, that include significant sections on ranking metrics, judgement lists, and model training. Nevertheless, with the rise of generative AI, RAG, and agentic workflows, the complexity of the evals process to handle complex pipelines with non-deterministic outputs has increased the complexity of performing good evals significantly. The discussions about end-to-end traceability and leveraging Transition Failure Matrices were particularly helpful for me in tackling these more challenging multi-step workflows. This course has been a goldmine by providing: 1. Up-to-date information on best practices for evals on the current state-of-the-art AI workflows 2. Deep insight from experts with decades of both real-world experience and academic research into evals 3. Lots of tips, tricks, and real-world examples (with code) for getting end-to-end evals implemented and working well. This course significantly improved my mental models and increased the size of my practical toolkit for doing AI evals, which is already paying dividends for my client engagements. This is a set of skills everyone working in AI should acquire, and this course is currently the best place to quickly do that!
Jodi M. Casabianca
Entrepreneurial Measurement & Research Scientist | Psychometrician | Scoring of Open-ended Tasks | AI Evaluation


Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Prior to this class I had already read a bunch of stuff on evals (including Hamel's blog). But I struggled to convert that theory into practical steps. I had some apprehensions coming in -- will it be too theoretical? Will it assume a lot of background knowledge? And I can say now -- this course completely crushes it. It is fully hands on and practical, starting from zero and building up from there. You will learn every step of the evals process on what exactly to do (and not to do) and more importantly -- how to actually put it into practice. If you have been struggling with evals, then don't think twice and take the course.
Computational Linguist at ATENTO

This course has been incredibly eye-opening. I’ve learned how important it is to follow a clear “Analyze - Measure - Improve” cycle when working with language models. What really stood out to me is that the biggest challenges often come not from the technology itself, but from how we approach the process — like jumping straight to complex solutions without truly understanding the problem, or using misaligned evaluation methods. My biggest takeaway is that every stage of the process has its own traps, and skipping steps or making quick fixes can easily backfire. Being intentional about collecting the right examples, measuring in a fair and meaningful way, and making thoughtful improvements can make all the difference. I’d definitely recommend this course to anyone working with AI systems. It helped me slow down, ask better questions, and be more strategic — and that’s something every team could benefit from.
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Before the AI Evals led by Hamel Husain and Shreya Shankar, I used evals sporadically, mostly relying on third-party ones. Now, I have a clearer understanding of how to design meaningful evals and communicate their value to the teams I work with.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
We've been building evals for over a year. We found this course invaluable to determine where we could improve our process, identify tools and resources, and engage with others in the community. Shreya and Hamel have produced a top notch course that's worth every minute of time invested.
Hardware Engineering Leader at Cisco
This course is a game changer.
This course was a game-changer for me. My biggest takeaway was learning a structured approach to system traces, which has given me a reliable framework for making meaningful progress. The hands-on content was fantastic; I learn best by doing, so I truly appreciated the practical exercises. I now have the 'flywheel' I was missing to move forward with my own app development. I highly recommend this course and look forward to even more hands-on content in the future!
Founder, Socratify
1000x ROI
Taking a structured approach to evals is a game changer. Shreya and Hamel are teaching a skill with 1000x ROI in the age of AI. At Socratify, we're building a career coach that sharpens critical thinking skills through debates on business news and other topics. It's inherently challenging to ensure high quality LLM interactions and going through error analysis has been transformative for the product development process. I can't wait to release the next version! I would absolutely recommend this course to any founder working with LLMs
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Course Instructors Went Above & Beyond
As someone with prior experience designing human evaluations and developing metrics for a specific product, I took this course to broaden my understanding of AI evaluation practices, especially for agentic systems and RAG, as well as to deepen my knowledge of evaluation infrastructure such as CI/CD and trace review interfaces. This course delivered far more than I expected. It includes a comprehensive course reader that could stand on its own as a reference book, live classes packed with hands-on examples, and over 10 guest speakers who shared practical insights into evaluation strategies and even how to build your own evaluation tools for different use cases. What really set the course apart was the level of support. Hamel and Shreya were incredibly supportive throughout the course. They hosted office hours, thoughtfully answered every question on Discord, and even brought in two experienced professionals to offer additional hands-on support and help with (optional) homework. They went above and beyond to make sure everyone was learning and participating. I also really appreciated hearing from other students about the evaluation challenges they were facing in their own work, and watching Hamel and Shreya think through solutions with them in real time was just as educational as the prepared content. Highly recommend this course if you're working on or even adjacent to LLM applications. Whether you’re focused on product quality, engineering, or research, you’ll walk away with frameworks, tools, and best-practices you can use right away.
Wayde Gilliam
"If you are building with AI, you need this course!"
Founder, Wicked Data LLC
"Take this course to go from a good to a great AI Engineer!"
Owner at Kentro Tech LLC
"Practical techniques rarely taught elsewhere. Highly recommend!"
Adam Dadson
GTM @ OpenAI

Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
A colleague reached out to me and recommended “AI Evals For Engineers & PMs” being offered by Hamel H. and Shreya Shankar. I consider myself an eternal learner, and knew evaluations were a critical yet often overlooked component to successful GenAI implementation. Everyone keeps asking me how they stay ahead of the GenAI. Well, you take classes like this one so you can be on the cutting edge of how to ensure you get the expected outcomes from your future AI agents. It was so dense with useful information and guest speakers that I honestly couldn’t keep up, but after the course is over, you continue to have access to the recordings.
Jeroen Latour
FinTech at Booking.com
Juan Maturino
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Hima Tk
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Margarita Fakih
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
Júlio Paulillo
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
Tatyana Kazakova
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
Andrew Chaffin
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
Amol Shah
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Lada Kesseler
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Maruti Agarwal
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
Sergio Soage
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
Karen Lam
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Juan Maturino
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
Chris McDonald
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
Roey Ben Chaim
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Ben Eyal
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Annu Augustine
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.

George Job Vetticaden
VP of Products, AI Agents
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
This course provides a great take on building reliable AI applications. It teaches practical techniques while developing intuition for evaluating LLM based systems. What sets it apart is its tool agnostic approach. Rather than focusing on specific platforms, it emphasizes systematic and scientific principles that apply everywhere.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Director of AI
This course helps you transform guesswork into actionable insights.
Evaluating generative AI often relies on abstract benchmarks disconnected from real-world outcomes and detached from practical experience. To bridge this gap, many rely on subjective impressions or “vibes” (i.e., the eye-test). The eye-test is important since evaluators directly interact with the model in realistic contexts. However, vibes and qualitative evaluations are not particularly helpful in evaluating application-specific performance, consistency, bias, reliability, security, or return-on-investment. In contrast, application-specific evals reflect an essential day-to-day operational focus. They aim to assess if a specific pipeline performs successfully in a particular task using realistic data. This course is an important step to transform guesswork to actionable insights. Application-specific evals are not sufficient but they are necessary and so often overlooked. Check out this course. It covers a lot of important terrain.
Machine Learning Engineer
Highly recommend this course.
Hamel's and Shreya's course, "AI Evals for Engineers and PMs" has been a great resource for learning how to tame the scary wild world of LLM-based applications. It was fascinating see how high leverage it is to become "one with the data" and how to do that explicitly for LLMs and agents. Hamel's and Shreya's vast combined expertise is clearly shown in the course's lectures, practical exercises and even a textbook. On top of that, the guest lectures provide even more gems of practical wisdom. I'd highly recommend this course to anyone serious about learning how to improve their LLM-based AI products.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Alan Chang
AI, Machine Learning, and Biology @ Stanford University
Maxime Lelièvre
Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech
Maxime Lelièvre
Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech
Šimon Podhajský
Senior LLM/Data Engineer | 🤖 Evals, ML/AI, Python

Bryce York
AI/ML/LLM and UX-centric B2B startup product management leader with a love for zero-to-one product innovation • 12+ yrs PM, 7+ yrs AI/ML, 5+ yrs adtech • Writer/Speaker • Advisor

Senior Data Scientist @Amazon
Taking the course "Move beyond 'vibe-checks' to data-driven evaluation" has been a game-changer in how I approach AI development and evaluation. Before this, my team often relied on subjective assessments—what we called “vibe-checks”—to judge model outputs. This course provided a structured, systematic framework that replaces guesswork with measurable, repeatable methods for evaluating AI performance. I learned how to build robust evaluation systems tailored to the unique challenges of AI applications—especially those involving stochastic or subjective outputs. The curriculum walked us through defining meaningful metrics, conducting systematic error analysis, generating synthetic data, and implementing automated evaluation pipelines. I also gained hands-on experience evaluating complex architectures like RAG systems and multi-step pipelines, and I now understand how to monitor models in production with continuous feedback loops. The biggest takeaway is that effective AI evaluation isn’t just about measuring accuracy—it’s about building a comprehensive lifecycle strategy that spans development, deployment, and ongoing improvement. Learning how to prioritize engineering efforts based on real data, rather than hunches, has already helped us optimize both performance and cost in our LLM applications. Absolutely. Whether you're an AI engineer, product manager, or data scientist, this course gives you practical tools and frameworks that apply directly to real-world AI challenges. It’s especially valuable for teams looking to move beyond ad-hoc evaluations and establish a collaborative, metrics-driven culture around AI development. In short, this course has transformed how I think about evaluation—from a fuzzy afterthought to a foundational part of the AI lifecycle. I can't recommend it highly enough.
Account Director, OpenAI
This course exceeded my expectations.
The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. The biggest takeaway for me was the level of robustness/sophistication that you can build for evals and the impact that it can have! I had very surface level knowledge of evals before this course. The course exceeded my expectations and I would recommend this to my colleagues.
Data Scientist , Global Innovation Hub
I learned how to evaluate LLM outputs in a structured manner. My biggest takeaway is the quality of eval pipeline is highly critical to ensure good quality outputs in the production environment. I will definitely recommend this course to my colleagues and to anyone who is deploying LLMs.
The AI Evals for Engineers & PMs course, taught by Hamel Husain and Shreya Shankar, is 100% worth the investment. The material couldn't be more up-to-date. There is a perfect balance between theory and real-world hands-on learning. And the office hours and guest speakers are invaluable. I highly recommend it.
Piotr W.
Head of QA | Lead Automation Engineer | AI Evaluation Engineer
Piotr W.
Head of QA | Lead Automation Engineer | AI Evaluation Engineer
Sergiy Korniychuk
Staff Software Engineer - Full Stack at Sondermind Inc
Tsacho Rabchev
Software Developer / Inventor
Sergiy Korniychuk
Staff Software Engineer - Full Stack at Sondermind Inc
Benjamin Pace
Data Science @ Candidly
Joel Dean
AI Founder. Tech Entrepreneur. Content Creator. Prime Minister Youth Awardee. Former WEF Global Shaper Curator
Kenneth Reeser
Senior AI/ML Architect at Vanguard
Joshua Pittman
Prompt Engineer at Outlever
Alpa Dedhia
Engineering Lead | Solutions Architect | AWS | Applied Generative AI Solutions | Search & Relevance | CSM | CSPO
Geoffrey Pidcock
AI and Strategy at ANSTO | MBA Melbourne Business School | Ex Atlassian

Rasool Shaik
GenAI | Embedded/IoT | Medical Devices -Helping organizations to build and validate products
Uday Ramesh Phalak
Machine Learning Engineer | RecSys, AI Evals, GenAI, Climate Change, UX | Co-Founder at HazAdapt

Ankur Bhatt
AI Engineering | Product and Technology Leader | CTO

Robb Winkle
Fire your systems integrator. Conversational AI-native systems expert for ERP (Techstars '24)

Harris Brown
Fractional AI product leader for early-stage founders & teams | ex-Airbnb
Peter Cardwell
Software Engineer at Snap, Inc.
Joshua Pittman
Prompt Engineer at Outlever
Data Scientist at Tiger Analytics

The most practical AI course I've taken, with immediate value.
As a Data Scientist, I joined “AI Evals for Engineers & PMs” by Hamel and Shreya to get better at evaluating LLM systems in real-world settings, particularly in the context of LLMs being used for a variety of tasks. The course covered everything from fundamentals and error analysis to production monitoring and cost optimization. What stood out was how practical it was—teaching us to define use-case-specific metrics, collaborate with PMs, slice errors meaningfully, and avoid over-relying on automated scores. The lessons and guest talks (like on RAG evals, failure funnels, and continuous human review) felt directly applicable to my work, especially when building retrieval-based bots or monitoring model drift. It’s not a course that spoon-feeds you; it gives solid frameworks and real-world habits to build eval pipelines that actually reflect user experience. If you’re working on production ML or LLM projects and want to move beyond standard metrics, this is worth your time. I’ve already applied a few techniques at work and found them super helpful. I would strongly suggest anyone with an interest in evals to take this course!
Dev Team Lead
Highly recommend this course!
The AI Evals course has helped me gain knowledge of how we can true evaluate LLMs in a meaningful way and actually understand whats going wrong. Hamel and Shreya did a wonderful job in explaining how evaluations can be done in a structured manner and on what to try. The course guests were an additional bonus who gave their insights on how they carried evaluations out. I would highly recommend this resource to anyone who builds with LLMs and are wondering how to effectively understand why the LLM isnt working the way it was trained and to better understand what is going on behind the scenes!
Data Scientist
Amazing instructors.
Hamel and Shreya are amazing instructors and this course has been a great resources for me in understanding how to build robust, enterprise-grade evals and AI pipelines. What I found particularly useful were the guest lectures, which bring a variety of opinions from industry experts and practitioners on different topics that relate to AI and evals.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Highly recommend this course to anyone building Gen AI products and solutions. The biggest takeaway for me was the methodical and scientific process that the instructors outline for doing model evaluation. It helps build a mental model that i am applying at work to build an eval pipeline for RAG solutions. The course also offers an in-depth and practical approach to understanding how generative AI models are evaluated using rubrics and metrics which are critical skills for AI engineers. Overall a fantastic course with a lot of learning and value.
Founder, Supago Inc.

This course completely transformed my approach to building AI applications.
This course completely transformed how I approach evaluating LLM applications. Before this course, my evaluation processes were informal at best. Now, I've gained a structured, rigorous methodology to identify errors, quantify improvements, and build automated evaluators. The hands-on assignments and deep dives into error analysis were particularly valuable, directly impacting how efficiently I debug and iterate on LLM products. Whether you're an engineer, product manager, or someone working closely with AI systems, this course is essential—highly recommend!

Prashant Mital
Applied AI @ OpenAI
Zara Khan
Account Director @ OpenAI
AI Consultant
"Absolutely recommend this course to anyone building AI applications"
Senior Product Manager at Redfin

Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Founder, Searchkernel LLC

This course is the best place to learn evals
I learned a ton from Hamel and Shreya's course on AI Evals! I've worked at the intersection of information retrieval and AI for nearly two decades, so I've done my fair share of evals throughout that time for search results quality. I've even written books, like AI-Powered Search, that include significant sections on ranking metrics, judgement lists, and model training. Nevertheless, with the rise of generative AI, RAG, and agentic workflows, the complexity of the evals process to handle complex pipelines with non-deterministic outputs has increased the complexity of performing good evals significantly. The discussions about end-to-end traceability and leveraging Transition Failure Matrices were particularly helpful for me in tackling these more challenging multi-step workflows. This course has been a goldmine by providing: 1. Up-to-date information on best practices for evals on the current state-of-the-art AI workflows 2. Deep insight from experts with decades of both real-world experience and academic research into evals 3. Lots of tips, tricks, and real-world examples (with code) for getting end-to-end evals implemented and working well. This course significantly improved my mental models and increased the size of my practical toolkit for doing AI evals, which is already paying dividends for my client engagements. This is a set of skills everyone working in AI should acquire, and this course is currently the best place to quickly do that!
Jodi M. Casabianca
Entrepreneurial Measurement & Research Scientist | Psychometrician | Scoring of Open-ended Tasks | AI Evaluation


Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Prior to this class I had already read a bunch of stuff on evals (including Hamel's blog). But I struggled to convert that theory into practical steps. I had some apprehensions coming in -- will it be too theoretical? Will it assume a lot of background knowledge? And I can say now -- this course completely crushes it. It is fully hands on and practical, starting from zero and building up from there. You will learn every step of the evals process on what exactly to do (and not to do) and more importantly -- how to actually put it into practice. If you have been struggling with evals, then don't think twice and take the course.
Computational Linguist at ATENTO

This course has been incredibly eye-opening. I’ve learned how important it is to follow a clear “Analyze - Measure - Improve” cycle when working with language models. What really stood out to me is that the biggest challenges often come not from the technology itself, but from how we approach the process — like jumping straight to complex solutions without truly understanding the problem, or using misaligned evaluation methods. My biggest takeaway is that every stage of the process has its own traps, and skipping steps or making quick fixes can easily backfire. Being intentional about collecting the right examples, measuring in a fair and meaningful way, and making thoughtful improvements can make all the difference. I’d definitely recommend this course to anyone working with AI systems. It helped me slow down, ask better questions, and be more strategic — and that’s something every team could benefit from.
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Before the AI Evals led by Hamel Husain and Shreya Shankar, I used evals sporadically, mostly relying on third-party ones. Now, I have a clearer understanding of how to design meaningful evals and communicate their value to the teams I work with.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.

