
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
Hardware Engineering Leader at Cisco
This course is a game changer.
Hardware Engineering Leader at Cisco
This course is a game changer.
Founder, Socratify
1000x ROI
Founder, Socratify
1000x ROI
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Course Instructors Went Above & Beyond
Data Scientist
Course Instructors Went Above & Beyond
Wayde Gilliam
"If you are building with AI, you need this course!"
Founder, Wicked Data LLC
"Take this course to go from a good to a great AI Engineer!"
Owner at Kentro Tech LLC
"Practical techniques rarely taught elsewhere. Highly recommend!"
Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Director of AI
This course helps you transform guesswork into actionable insights.
Director of AI
This course helps you transform guesswork into actionable insights.
Machine Learning Engineer
Highly recommend this course.
Machine Learning Engineer
Highly recommend this course.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Senior Data Scientist @Amazon
Senior Data Scientist @Amazon
Account Director, OpenAI
This course exceeded my expectations.
Account Director, OpenAI
This course exceeded my expectations.
Data Scientist , Global Innovation Hub
Data Scientist , Global Innovation Hub
Self
Self
Data Scientist at Tiger Analytics
The most practical AI course I've taken, with immediate value.
Data Scientist at Tiger Analytics
The most practical AI course I've taken, with immediate value.
Dev Team Lead
Highly recommend this course!
Dev Team Lead
Highly recommend this course!
Data Scientist
Amazing instructors.
Data Scientist
Amazing instructors.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Founder, Supago Inc.
This course completely transformed my approach to building AI applications.
Founder, Supago Inc.
This course completely transformed my approach to building AI applications.
AI Consultant
"Absolutely recommend this course to anyone building AI applications"
Senior Product Manager at Redfin
Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Senior Product Manager at Redfin
Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Founder, Searchkernel LLC
This course is the best place to learn evals
Founder, Searchkernel LLC
This course is the best place to learn evals
Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Computational Linguist at ATENTO
Computational Linguist at ATENTO
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
CEO, Fern AI, AI for Legal
This course is worth the time. Take it.
Hardware Engineering Leader at Cisco
This course is a game changer.
Hardware Engineering Leader at Cisco
This course is a game changer.
Founder, Socratify
1000x ROI
Founder, Socratify
1000x ROI
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech
Pragmatic techniques, free of jargon.
What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Tools to quantitatively improve your AI product
Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.
Data Scientist
Course Instructors Went Above & Beyond
Data Scientist
Course Instructors Went Above & Beyond
Wayde Gilliam
"If you are building with AI, you need this course!"
Founder, Wicked Data LLC
"Take this course to go from a good to a great AI Engineer!"
Owner at Kentro Tech LLC
"Practical techniques rarely taught elsewhere. Highly recommend!"
Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
Senior Technical Program Manager, Netflix
This course helps you get expected outcomes from your AI
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Software Engineer at Edua
Removed a malicious system prompt and reversed falling engagement—user interactions increased.
Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.
Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.
But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.
Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Lead PM - AI / ML Products at CultureAmp at CultureAmp
Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.
I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.
After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.
For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
Business Operations and Development at N/A
Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.
As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.
After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.
Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
CRO @ Agendor at Agendor
I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.
Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.
After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.
Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
QA Engineer :) at Qazaco
Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.
Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.
After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.
Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
CEO at Argo Analytics
Structured error analysis gave me a clearer method to iterate and actually get the results I needed.
A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.
After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.
Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Head of Product at Count
I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.
When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.
After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.
Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Lead Developer at Logic20/20
I can now predict and prevent code quality issues instead of treating them as isolated bugs.
I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.
After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.
The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
Expert AI Research Scientist at Datasite
Gained clarity on what to fix first, transforming my entire approach to evolving the system.
I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.
After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
AI RD lead at Diligent
I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.
Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Product Design
I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.
I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.
After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.
I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
Software Engineer at Edua
I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.
In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.
After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.
Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
AI Team Leader at Comtrac
I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.
At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.
After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.
What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Staff Engineer at Zenity
I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.
The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.
After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Research Engineer at Ai2 Israel
I gained clarity to find root causes and stop repeated agent confusion.
At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.
Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.
After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Founder, Product Coach at NedRock
Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.
When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.
After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.
Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Co-Founder at Comprendo
Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.
Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.
After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.
In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.
Product Manager, Analytics & AI at Axi
Sanity checks turned unreliable scores into business-aligned predictions I could trust.
I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.
Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.
After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
AI Engineer, Vantager
Practical techniques that generalize regardless of the tools you use.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Principal Data/AI Scientist/Engineer, Slido/Cisco
This course teaches material you can't find anywhere else. Investing in this course is a no brainer.
"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.
Director of AI
This course helps you transform guesswork into actionable insights.
Director of AI
This course helps you transform guesswork into actionable insights.
Machine Learning Engineer
Highly recommend this course.
Machine Learning Engineer
Highly recommend this course.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Senior Director of Machine Learning, SponsorUnited
An Absolute must. Valuable for any AI engineer and product manager.
The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.
Senior Data Scientist @Amazon
Senior Data Scientist @Amazon
Account Director, OpenAI
This course exceeded my expectations.
Account Director, OpenAI
This course exceeded my expectations.
Data Scientist , Global Innovation Hub
Data Scientist , Global Innovation Hub
Self
Self
Data Scientist at Tiger Analytics
The most practical AI course I've taken, with immediate value.
Data Scientist at Tiger Analytics
The most practical AI course I've taken, with immediate value.
Dev Team Lead
Highly recommend this course!
Dev Team Lead
Highly recommend this course!
Data Scientist
Amazing instructors.
Data Scientist
Amazing instructors.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Head of Product, Tavus
Great insights that is shaping how we evaluate AI products.
Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer
I learned how to be truly effective in creating LLM-powered applications
I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Software Engineer, Google
Comprehensive and practical curriculum
Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Palette, CPO
A Masterclass in Practical AI Evaluation.
From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Soothien HealthTech Advisory
A must for any developer or PM building AI products.
I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Machine Learning Engineer | Co-Founder at HazAdapt
Good course if you want to build products people actually trust.
Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Technology director - Wells fargo
A fantastic course offering an in-depth practical approach to evals.
Founder, Supago Inc.
This course completely transformed my approach to building AI applications.
Founder, Supago Inc.
This course completely transformed my approach to building AI applications.
AI Consultant
"Absolutely recommend this course to anyone building AI applications"
Senior Product Manager at Redfin
Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Senior Product Manager at Redfin
Error analysis (and this course) is all you need
Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!
Founder, Searchkernel LLC
This course is the best place to learn evals
Founder, Searchkernel LLC
This course is the best place to learn evals
Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Independent AI Engineer
An essential resource for engineers & PMs
For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Scarlet AI
This course changed how I approach AI projects. Instructors provide great support.
The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Consultant, Silver Stripe Software
Learn how to put evals into practice. Practical and hands on instruction.
Computational Linguist at ATENTO
Computational Linguist at ATENTO
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Full Stack Computational Linguist, Bad Idea Factory
Now I can design meaningful evals! Highly recommend this course.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.
Senior Director, AI
This course is comprehensive in a way that's hard to find elsewhere.
This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.


