user avatar
Teresa Torres
@ttorres
I'm finishing up the "AI Evals for Engineers and Technical PMs" class taught by @sh_reya and @HamelHusain on Maven. In four weeks I went from kind of knowing what evals were to doing in-depth error analysis and implementing my first round of automated evaluations. If you are looking for a structured and in-depth way to evaluate the quality of your LLM apps, I can't recommend this course enough. I have a whole new appreciation for what it means to build a high-quality LLM-based product. And can see why so many tacked on AI features are terrible.
133
Sunita Parbhu
Sunita Parbhu

CEO, Fern AI, AI for Legal

This course is worth the time. Take it.

We've been building evals for over a year. We found this course invaluable to determine where we could improve our process, identify tools and resources, and engage with others in the community. Shreya and Hamel have produced a top notch course that's worth every minute of time invested.
Brian Chase
Brian Chase

Hardware Engineering Leader at Cisco

This course is a game changer.

This course was a game-changer for me. My biggest takeaway was learning a structured approach to system traces, which has given me a reliable framework for making meaningful progress. The hands-on content was fantastic; I learn best by doing, so I truly appreciated the practical exercises. I now have the 'flywheel' I was missing to move forward with my own app development. I highly recommend this course and look forward to even more hands-on content in the future!
Adi Pradhan
Adi Pradhan

Founder, Socratify

1000x ROI

Taking a structured approach to evals is a game changer. Shreya and Hamel are teaching a skill with 1000x ROI in the age of AI. At Socratify, we're building a career coach that sharpens critical thinking skills through debates on business news and other topics. It's inherently challenging to ensure high quality LLM interactions and going through error analysis has been transformative for the product development process. I can't wait to release the next version! I would absolutely recommend this course to any founder working with LLMs
Daniel Roy Greenfeld
Daniel Roy Greenfeld

Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech

Pragmatic techniques, free of jargon.

What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.

Forrest McKee
Forrest McKee

Data Scientist

Tools to quantitatively improve your AI product

Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.

Constanza Schibber
Constanza Schibber

Data Scientist

Course Instructors Went Above & Beyond

As someone with prior experience designing human evaluations and developing metrics for a specific product, I took this course to broaden my understanding of AI evaluation practices, especially for agentic systems and RAG, as well as to deepen my knowledge of evaluation infrastructure such as CI/CD and trace review interfaces. This course delivered far more than I expected. It includes a comprehensive course reader that could stand on its own as a reference book, live classes packed with hands-on examples, and over 10 guest speakers who shared practical insights into evaluation strategies and even how to build your own evaluation tools for different use cases. What really set the course apart was the level of support. Hamel and Shreya were incredibly supportive throughout the course. They hosted office hours, thoughtfully answered every question on Discord, and even brought in two experienced professionals to offer additional hands-on support and help with (optional) homework. They went above and beyond to make sure everyone was learning and participating. I also really appreciated hearing from other students about the evaluation challenges they were facing in their own work, and watching Hamel and Shreya think through solutions with them in real time was just as educational as the prepared content. Highly recommend this course if you're working on or even adjacent to LLM applications. Whether you’re focused on product quality, engineering, or research, you’ll walk away with frameworks, tools, and best-practices you can use right away.
Video Poster
Wayde Gilliam

Wayde Gilliam

"If you are building with AI, you need this course!"

Video Poster
Skylar Payne

Founder, Wicked Data LLC

"Take this course to go from a good to a great AI Engineer!"

Video Poster
Isaac Flath

Owner at Kentro Tech LLC

"Practical techniques rarely taught elsewhere. Highly recommend!"

user avatar
Adam Dadson

GTM @ OpenAI

linkedin logo
I've been spending time learning Evals for AI in the past few weeks, and throughout the process, what I've really started to understand is the impact that systematic evals can have on dramatically improving LLM output. If you're curious to learn more and upskill yourself when it comes to driving better model responses, check out Shreya and Hamel's incredible course on the power of Evals. The next cohort begins July 21st: lnkd.in/gVCJk-WC
attached image
user avatar
Alex Elting
@alexelting
I have been tinkering around with LLMs for a few years now as a software engineer. I realized the potential of LLMs early on, but one worry I had for LLM-powered software was just how are you supposed to test things? You can't just unit test English (usually)
1
Jasmine Robinson
Jasmine Robinson

Senior Technical Program Manager, Netflix

This course helps you get expected outcomes from your AI

A colleague reached out to me and recommended “AI Evals For Engineers & PMs” being offered by Hamel H. and Shreya Shankar. I consider myself an eternal learner, and knew evaluations were a critical yet often overlooked component to successful GenAI implementation. Everyone keeps asking me how they stay ahead of the GenAI. Well, you take classes like this one so you can be on the cutting edge of how to ensure you get the expected outcomes from your future AI agents. It was so dense with useful information and guest speakers that I honestly couldn’t keep up, but after the course is over, you continue to have access to the recordings.
user avatar
Jeroen Latour

FinTech at Booking.com

linkedin logo
I’m currently taking the Maven course AI Evals for Engineers & PMs by Shreya Shankar and Hamel H. My five biggest take-aways so far: 1. *Evals turn chaos into clarity* – LLMs are unpredictable. Evals give you a repeatable way to measure what matters instead of chasing bugs one by one. 2. *Correctness = your product definition* – What counts as “good” depends on your product, not on a generic benchmark. 3. *Fix specs before measuring* – Some errors come from unclear prompts or vague product goals (what the course calls the “Gulf of Specification”). In those cases, sharpen the prompt or definition before investing in evals. 4. *LLM judges need judging* – Using one LLM to evaluate another can work, but only if you validate it against human experts and refine the criteria. 5. *AI evals are the new core skill* – They don’t just measure accuracy; they help shape product roadmaps. This is fast becoming a must-have skill for PMs and builders. Outside of my day job at Booking, I’m working on a side project to make EU lobbying more transparent with LLMs. The course is already helping me think about how to design evals that keep me honest. I’d recommend this course to anyone building with LLMs—especially PMs, engineers, or anyone responsible for shipping AI products. Next cohort starts Oct 6, with recorded sessions: bit.ly/470obaL


Get 25% off our next cohort!

Enroll Here


Juan Maturino

Juan Maturino

Software Engineer at Edua

Removed a malicious system prompt and reversed falling engagement—user interactions increased.

Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.

Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.

But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.

Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.

Hima Tk

Hima Tk

Lead PM - AI / ML Products at CultureAmp at CultureAmp

Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.

I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.

After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.

For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.

Margarita Fakih

Margarita Fakih

Business Operations and Development at N/A

Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.

As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.

After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.

Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.

Júlio Paulillo

Júlio Paulillo

CRO @ Agendor at Agendor

I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.

Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.

After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.

Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.

Tatyana Kazakova

Tatyana Kazakova

QA Engineer :) at Qazaco

Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.

Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.

After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.

Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.

Andrew Chaffin

Andrew Chaffin

CEO at Argo Analytics

Structured error analysis gave me a clearer method to iterate and actually get the results I needed.

A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.

After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.

Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.

Amol Shah

Amol Shah

Head of Product at Count

I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.

When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.

After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.

Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.

Lada Kesseler

Lada Kesseler

Lead Developer at Logic20/20

I can now predict and prevent code quality issues instead of treating them as isolated bugs.

I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.

After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.

The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.


Get 25% off our next cohort!

Enroll Here


MA

Maruti Agarwal

Expert AI Research Scientist at Datasite

Gained clarity on what to fix first, transforming my entire approach to evolving the system.

I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.

After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.

Sergio Soage

Sergio Soage

AI RD lead at Diligent

I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.

Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.

KL

Karen Lam

Product Design

I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.

I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.

After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.

I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.

Juan Maturino

Juan Maturino

Software Engineer at Edua

I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.

In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.

After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.

Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.

Chris McDonald

Chris McDonald

AI Team Leader at Comtrac

I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.

At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.

After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.

What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.

Roey Ben Chaim

Roey Ben Chaim

Staff Engineer at Zenity

I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.

The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.

After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.

Ben Eyal

Ben Eyal

Research Engineer at Ai2 Israel

I gained clarity to find root causes and stop repeated agent confusion.

At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.

Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.

After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.

Annu Augustine

Annu Augustine

Founder, Product Coach at NedRock

Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.

When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.

After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.

Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.


Get 25% off our next cohort!

Enroll Here


Jack Shaw
Jack Shaw

Co-Founder at Comprendo

Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.

Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.

After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.

In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.

Richard Ng
Richard Ng

Product Manager, Analytics & AI at Axi

Sanity checks turned unreliable scores into business-aligned predictions I could trust.

I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.

Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.

After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.

user avatar
Puja Nanda
@pujananda_
Just wrapped up Evaluating AI Systems by @HamelHusain & @sh_reya - practical & production-focused for devs, PMs & product leaders. Key takeaways: start w/ well-scoped prompts, own the annotation process, and keep evals specific + monitored. Course link: maven.com/parlance-labs/evals?promoCode=testimonial-c2-81&ajs_uid=569133
2
user avatar
Anja Buckley
@anjabuckley1984
I just finished the Maven course AI Evals For Engineers & PMs. The mix of live sessions, office hours, and recorded material was perfectly balanced. The expertise of @sh_reya and @HamelHusain was truly impressive. x.com/anjabuckley1984/status/1956645831794115003/photo/1
attached image
user avatar
George Job Vetticaden

VP of Products, AI Agents

linkedin logo
I enrolled in Hamel H. and Shreya Shankar's course "AI Evals For Engineers & PMs" (lnkd.in/ghh3Yk3e), and two weeks in, it's already changing how I approach building agents. The timing couldn't be better. Kevin Weil (OpenAI's CPO) recently said: "Writing Evals is going to become a core skill for product managers." He's absolutely right—but here's what I discovered: there are tons of content on agent architectures and basic evals, but almost nothing on evaluating real multi-agent systems. That gap is what drove me to immediately apply the course learnings to my multi-agent health system. There have been so many insights and aha moments over these last two weeks that I wanted to share them with my community. The course taught me to build a "vocabulary of failure"—but what really clicked was creating a complete loop where everything connects. When you spot an issue, an Evaluation Agent helps you turn it into a test case. LLM-as-Judge doesn't just score it—it diagnoses root causes and prescribes exact fixes. Claude Code takes those prescriptions and automatically implements them. Then the same test runs in your CI/CD pipeline to verify the fix and prevent regressions. Every step feeds the next, turning evaluation from a one-time check into a continuous improvement engine. I've captured everything—from the methodology and tools to real examples showing how a failed health query becomes an automated fix. 🎥 Watch the technical demo (attached) 📝 Read the deep dive blog: lnkd.in/g5ArvB3V 💻 Explore the code: lnkd.in/gGsNu7pZ #AIEvals #MultiAgentSystems #AIAgents #ClaudeCode #ProductManagement #Anthropic #DeveloperTools #AgenticDevelopment
Kranthi Kiran
Kranthi Kiran

AI Engineer, Vantager

Practical techniques that generalize regardless of the tools you use.

This course provides a great take on building reliable AI applications. It teaches practical techniques while developing intuition for evaluating LLM based systems. What sets it apart is its tool agnostic approach. Rather than focusing on specific platforms, it emphasizes systematic and scientific principles that apply everywhere.
Marek Šuppa
Marek Šuppa

Principal Data/AI Scientist/Engineer, Slido/Cisco

This course teaches material you can't find anywhere else. Investing in this course is a no brainer.

"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.

Rich Heimann
Rich Heimann

Director of AI

This course helps you transform guesswork into actionable insights.

Evaluating generative AI often relies on abstract benchmarks disconnected from real-world outcomes and detached from practical experience. To bridge this gap, many rely on subjective impressions or “vibes” (i.e., the eye-test). The eye-test is important since evaluators directly interact with the model in realistic contexts. However, vibes and qualitative evaluations are not particularly helpful in evaluating application-specific performance, consistency, bias, reliability, security, or return-on-investment. In contrast, application-specific evals reflect an essential day-to-day operational focus. They aim to assess if a specific pipeline performs successfully in a particular task using realistic data. This course is an important step to transform guesswork to actionable insights. Application-specific evals are not sufficient but they are necessary and so often overlooked. Check out this course. It covers a lot of important terrain.


Get 25% off our next cohort!

Enroll Here


Jonathan Sarker
Jonathan Sarker

Machine Learning Engineer

Highly recommend this course.

Hamel's and Shreya's course, "AI Evals for Engineers and PMs" has been a great resource for learning how to tame the scary wild world of LLM-based applications. It was fascinating see how high leverage it is to become "one with the data" and how to do that explicitly for LLMs and agents. Hamel's and Shreya's vast combined expertise is clearly shown in the course's lectures, practical exercises and even a textbook. On top of that, the guest lectures provide even more gems of practical wisdom. I'd highly recommend this course to anyone serious about learning how to improve their LLM-based AI products.
Ajaykumar Rajasekharan
Ajaykumar Rajasekharan

Senior Director of Machine Learning, SponsorUnited

An Absolute must. Valuable for any AI engineer and product manager.

The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.

user avatar
Alan Chang

AI, Machine Learning, and Biology @ Stanford University

linkedin logo
Some quick reflections after wrapping up the quite enjoyable AI Evals for Engineers and Technical PMs course led by Hamel H. and Shreya Shankar. This is the second course co-taught by Hamel that I've taken, following the LLM Finetuning course that became an amazing mini-conference that expanded far beyond finetuning. Course highlights: 1. Immediately actionable advice: the strategy and demonstration of open coding approaches for evals, how to create a custom human eval interface and why, in addition to great case studies of AI evals in action. Shout out to the course reader which is a great companion to the course and solid reference moving forward. 2. The material was very tightly focused - Hamel and Shreya did a great job of distilling key concepts and approaches, intentionally introducing only a few frameworks. 3. Evolving homework assignments (even though I regrettably was too busy during the course to engage with the homework as I would have liked) that paralleled the main material in a very complementary manner. I did get a lot of inspiration for how to approach a few personal projects that I have in mind and how to approach evals in the many LLM-related projects in the lab that I'm a part of. 4. Guest speakers that covered a broad array of additional topics of interest / specific use cases. I particularly enjoyed the talk on Reasoning Models & LLM-as-a-Judge with Alex Volkov You'll be able to find the course on Maven at lnkd.in/gSPqf_Bk - you'll probably get more out of it if you are actively working on an LLM-based product or project that already has many generated outputs (or can get them soon).
user avatar
Maxime Lelièvre

Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech

linkedin logo
Huge thanks to Hamel H. and Shreya Shankar for their "AI Evals for Engineers and PMs" course! I really enjoyed the hands-on approach, particularly the assignments and guest lectures, which provided invaluable real-world application scenarios. A key takeaway for me was the importance of thoughtful evaluation frameworks, especially as AI models become commoditized and specialized for specific domains. I really liked their "Analyze-Measure-Improve" eval loop concept and their emphasis on keeping the human in the loop, especially domain experts. As someone focused on education in Low and Middle-Income Countries (LMICs), I'm genuinely excited by the opportunities AI presents, while also being very aware of the inherent risks. This course truly emphasized how robust evaluations are crucial for navigating both the immense potential and the significant challenges. With models improving in performance and decreasing in cost (as we've seen with our pedagogy benchmark: lnkd.in/dtiUxVWM), I believe that new opportunities are opening up rapidly. I am thinking about how complex pipelines like RAG could be prohibitively expensive to implement and evaluate. I'm currently working on an AI chatbot to support teachers' professional development in Sierra Leone, and learning these evaluation strategies is something that could really make it better and ensure responsible deployment. Highly recommend this course to anyone building or working with AI products! lnkd.in/dgk-Hcv9
user avatar
Maxime Lelièvre

Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech

linkedin logo
Huge thanks to Hamel H. and Shreya Shankar for their "AI Evals for Engineers and PMs" course! I really enjoyed the hands-on approach, particularly the assignments and guest lectures, which provided invaluable real-world application scenarios. A key takeaway for me was the importance of thoughtful evaluation frameworks, especially as AI models become commoditized and specialized for specific domains. I really liked their "Analyze-Measure-Improve" eval loop concept and their emphasis on keeping the human in the loop, especially domain experts. As someone focused on education in Low and Middle-Income Countries (LMICs), I'm genuinely excited by the opportunities AI presents, while also being very aware of the inherent risks. This course truly emphasized how robust evaluations are crucial for navigating both the immense potential and the significant challenges. With models improving in performance and decreasing in cost (as we've seen with our pedagogy benchmark: lnkd.in/dtiUxVWM), I believe that new opportunities are opening up rapidly. I am thinking about how complex pipelines like RAG could be prohibitively expensive to implement and evaluate. I'm currently working on an AI chatbot to support teachers' professional development in Sierra Leone, and learning these evaluation strategies is something that could really make it better and ensure responsible deployment. Highly recommend this course to anyone building or working with AI products! lnkd.in/dgk-Hcv9
user avatar
Šimon Podhajský

Senior LLM/Data Engineer | 🤖 Evals, ML/AI, Python

linkedin logo
In the spirit of learning in public, here's my paean to Maven.com. Two recent sprints really sharpened my evals game: - @Jason Liu's "Systematically Improving RAG Applications". - Shreya Shankar's and Hamel H.'s "AI Evals". Both month-long courses were chock-full of insight, research and experience. (Not to mention the extra guest lectures, all meticulously chosen to add to the main learning battery, that sometimes came by every. other. day. Sometimes in inconvenient timezones. I'm still in the process of catching up.) What did we learn? The thing is, Jason, Shreya and Hamel aren't exactly 𝘴𝘦𝘤𝘳𝘦𝘵𝘪𝘷𝘦 about the key insights: if anything, they shout 𝘦𝘳𝘳𝘰𝘳 𝘢𝘯𝘢𝘭𝘺𝘴𝘪𝘴! and 𝘭𝘰𝘰𝘬 𝘢𝘵 𝘺𝘰𝘶𝘳 𝘥𝘢𝘵𝘢! and 𝘤𝘰𝘯𝘴𝘵𝘳𝘶𝘤𝘵 𝘦𝘷𝘢𝘭 𝘧𝘶𝘯𝘯𝘦𝘭𝘴! from the rooftops. It's still really rather useful to wrap the slogans into a specific happy path, note the edge cases, and integrate them with insights from other practitioners and prior research. (For example, I've never heard about 𝘨𝘳𝘰𝘶𝘯𝘥𝘦𝘥 𝘵𝘩𝘦𝘰𝘳𝘺 before, but it's an excellent fallback when theories and taxonomies just aren't jumping at you from your chaotic annotations.) Oh, and you get an early sneak peek into Shreya's and Hamel's book that ties it all together! All in all: Highly recommended.
attached image
user avatar
Bryce York

AI/ML/LLM and UX-centric B2B startup product management leader with a love for zero-to-one product innovation • 12+ yrs PM, 7+ yrs AI/ML, 5+ yrs adtech • Writer/Speaker • Advisor

linkedin logo
If you haven't already heard about Shreya Shankar & Hamel Husain's Maven course on evals and you have any plan to build in the LLM-space, you're missing out. I'm about to finish the course and I couldn't recommend it more highly. I can't think of a better way for an engineer or technical PM to spend their L&D allowance. It had already paid for itself in the first week and provides a step-by-step playbook to build and continuously improve LLM-powered products. It's given me & my teams a structured process to follow. Even though we were doing a lot of the same things, i had to constantly keep us on track without a framework to point to. Their process is going to become the new de facto process and what better way to learn and master it than in a live cohort class. Their upcoming book will no doubt be on all of our bookshelves, but if you have the $$ and want to get in early - do it!
attached image
E
ess d

Senior Data Scientist @Amazon

Taking the course "Move beyond 'vibe-checks' to data-driven evaluation" has been a game-changer in how I approach AI development and evaluation. Before this, my team often relied on subjective assessments—what we called “vibe-checks”—to judge model outputs. This course provided a structured, systematic framework that replaces guesswork with measurable, repeatable methods for evaluating AI performance. I learned how to build robust evaluation systems tailored to the unique challenges of AI applications—especially those involving stochastic or subjective outputs. The curriculum walked us through defining meaningful metrics, conducting systematic error analysis, generating synthetic data, and implementing automated evaluation pipelines. I also gained hands-on experience evaluating complex architectures like RAG systems and multi-step pipelines, and I now understand how to monitor models in production with continuous feedback loops. The biggest takeaway is that effective AI evaluation isn’t just about measuring accuracy—it’s about building a comprehensive lifecycle strategy that spans development, deployment, and ongoing improvement. Learning how to prioritize engineering efforts based on real data, rather than hunches, has already helped us optimize both performance and cost in our LLM applications. Absolutely. Whether you're an AI engineer, product manager, or data scientist, this course gives you practical tools and frameworks that apply directly to real-world AI challenges. It’s especially valuable for teams looking to move beyond ad-hoc evaluations and establish a collaborative, metrics-driven culture around AI development. In short, this course has transformed how I think about evaluation—from a fuzzy afterthought to a foundational part of the AI lifecycle. I can't recommend it highly enough.


Get 25% off our next cohort!

Enroll Here


Mark Manolas
Mark Manolas

Account Director, OpenAI

This course exceeded my expectations.

The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. The biggest takeaway for me was the level of robustness/sophistication that you can build for evals and the impact that it can have! I had very surface level knowledge of evals before this course. The course exceeded my expectations and I would recommend this to my colleagues.
K Vikraman
K Vikraman

Data Scientist , Global Innovation Hub

I learned how to evaluate LLM outputs in a structured manner. My biggest takeaway is the quality of eval pipeline is highly critical to ensure good quality outputs in the production environment. I will definitely recommend this course to my colleagues and to anyone who is deploying LLMs.
Barbara Graniello Batlle
Barbara Graniello Batlle

Self

The AI Evals for Engineers & PMs course, taught by Hamel Husain and Shreya Shankar, is 100% worth the investment. The material couldn't be more up-to-date. There is a perfect balance between theory and real-world hands-on learning. And the office hours and guest speakers are invaluable. I highly recommend it.
user avatar
Piotr W.

Head of QA | Lead Automation Engineer | AI Evaluation Engineer

linkedin logo
I recently took an LLM Evals course led by Hamel H. and Shreya Shankar—and huge thanks to both of them for putting together such a practical experience. 👏 Since some time I'm in the process of redefining, adjusting, and evolving the role of a QA Engineer to make it more prepared for AI-related projects. It felt natural for us to jump into the role of Evaluation Engineer, as there are many similarities to a QA role. One of the biggest takeaways for me was seeing how to build automated eval setups from scratch without relying on external platforms. That alone was a big eye-opener—turns out, a lot of this can be done in-house if you know what you're doing. Also, the sessions on synthetic data creation were really great. We're already planning to write up internal guidelines based on what we learned and use them in upcoming projects. Maybe even an internal tool to automate the process would be a great idea. Overall, this course connected a lot of dots for me. It showed me how QA skills can evolve into something super valuable in the AI space—especially when it comes to designing evaluation frameworks that actually make sense and bring value. 🚀 lnkd.in/dzrmesqM
user avatar
Vivek
@feynwarrwen
Learned a ton of useful stuff about the Evals in the course with @HamelHusain and @sh_reya. Before this course, I would try to change the prompt, change the model, or improve the RAG pipeline - all w/o systematic error analysis (my biggest learning). Thx @HamelHusain @sh_reya!
1
user avatar
Piotr W.

Head of QA | Lead Automation Engineer | AI Evaluation Engineer

linkedin logo
I recently took an LLM Evals course led by Hamel H. and Shreya Shankar—and huge thanks to both of them for putting together such a practical experience. 👏 Since some time I'm in the process of redefining, adjusting, and evolving the role of a QA Engineer to make it more prepared for AI-related projects. It felt natural for us to jump into the role of Evaluation Engineer, as there are many similarities to a QA role. One of the biggest takeaways for me was seeing how to build automated eval setups from scratch without relying on external platforms. That alone was a big eye-opener—turns out, a lot of this can be done in-house if you know what you're doing. Also, the sessions on synthetic data creation were really great. We're already planning to write up internal guidelines based on what we learned and use them in upcoming projects. Maybe even an internal tool to automate the process would be a great idea. Overall, this course connected a lot of dots for me. It showed me how QA skills can evolve into something super valuable in the AI space—especially when it comes to designing evaluation frameworks that actually make sense and bring value. 🚀
user avatar
Deep Gandhi
@deepgandhi91
Look at data with different lenses! A very practical approach in the ‘AI Evaluations’ class by @sh_reya & @HamelHusain. Learned hands-on techniques to deeply analyze and improve AI solutions, new ways to spot failure modes I didn’t know before. Putting it all into practice now.
1
user avatar
Sergiy Korniychuk

Staff Software Engineer - Full Stack at Sondermind Inc

linkedin logo
Recently I took part in a course called "AI Evals For Engineers & PMs" led by Hamel H. and Shreya Shankar. It provided a solid foundation for assessing systems that use AI in various products. My biggest takeaway is the brilliant "three gulfs" mental model: the gulfs of comprehension, specification, and generalization. This model completely reframed how I approach understanding and improving LLM applications. It's incredibly powerful and a practical way to break down where things might be going wrong. First, you see if your LLM even understands what users want. Next, you check if your prompts are clear and specific. Finally, you see if your system can handle new or unexpected cases. It's simple but powerful, and something I plan to apply to every new feature and prompt I work on. Another thing that's truly stood out for me is how important trace-based analysis is. Whether you're working with real user data or building synthetic examples, looking at traces is the only way to spot real problems. That's where the real insights come from, and it stops you from wasting time on things that don't matter. This has fundamentally shifted my focus to solving upstream problems. I also really like the idea of starting with code-based evaluators before using LLM-as-Judge - for example checking if response is in the correct JSON format might be very easy to evaluate. This approach advocates saving the more complex, subjective evaluations for later, and instead, focusing on automating the easy wins first. It's about being pragmatic and effective, not just measuring for measurement's sake. Practically, this course has directly impacted how I operate as an engineer. I now see prompts as carefully crafted specifications and understand that the evaluation process is an iterative, continuous cycle. I'm more convinced now about looking at failure modes first, collaborating with domain experts to define ground truth, and ensuring our evals are simple, durable, and tied directly to customer outcomes. It's no longer 'prompt and forget,' but 'prompt, measure, and refine.' Overall, the course content was exceptional, offering a rare balance of theory and practical application. The instructors were incredibly experienced, and their emphasis on real-world scenarios made everything immediately applicable. They also had a lot of guests speakers with real life experience - that was really helpful. This has been one of the most beneficial and enriching learning experiences I've had, and I highly recommend it to anyone serious about building robust AI products!


Get 25% off our next cohort!

Enroll Here


user avatar
Tsacho Rabchev

Software Developer / Inventor

linkedin logo
Just completed "AI Evals for Engineers & PMs" by Hamel H. and Shreya Shankar, and it was one of the most thoughtfully crafted technical courses I’ve taken! Hamel brings a rare blend of depth, clarity, meticulous attention to detail, and patience. His explanations cut through complexity without ever oversimplifying, and his thoughtful approach reflects a deep commitment to helping others truly understand. He’s a natural teacher. What stood out just as much was the uncompromising focus on quality — both Hamel and Shreya clearly care about delivering a meaningful, well-structured learning experience. Their collaboration created a space that was not only technically rich but also engaging and supportive. If you’re working with LLMs and want a solid, reproducible framework for evaluation — this course delivers. If you're working with LLMs and want a rigorous, hands-on approach to evaluation, I highly recommend checking out the course: lnkd.in/g_yGUxFK
user avatar
Sergiy Korniychuk

Staff Software Engineer - Full Stack at Sondermind Inc

linkedin logo
Recently I took part in a course called "AI Evals For Engineers & PMs" led by Hamel H. and Shreya Shankar. It provided a solid foundation for assessing systems that use AI in various products. My biggest takeaway is the brilliant "three gulfs" mental model: the gulfs of comprehension, specification, and generalization. This model completely reframed how I approach understanding and improving LLM applications. It's incredibly powerful and a practical way to break down where things might be going wrong. First, you see if your LLM even understands what users want. Next, you check if your prompts are clear and specific. Finally, you see if your system can handle new or unexpected cases. It's simple but powerful, and something I plan to apply to every new feature and prompt I work on. Another thing that's truly stood out for me is how important trace-based analysis is. Whether you're working with real user data or building synthetic examples, looking at traces is the only way to spot real problems. That's where the real insights come from, and it stops you from wasting time on things that don't matter. This has fundamentally shifted my focus to solving upstream problems. I also really like the idea of starting with code-based evaluators before using LLM-as-Judge - for example checking if response is in the correct JSON format might be very easy to evaluate. This approach advocates saving the more complex, subjective evaluations for later, and instead, focusing on automating the easy wins first. It's about being pragmatic and effective, not just measuring for measurement's sake. Practically, this course has directly impacted how I operate as an engineer. I now see prompts as carefully crafted specifications and understand that the evaluation process is an iterative, continuous cycle. I'm more convinced now about looking at failure modes first, collaborating with domain experts to define ground truth, and ensuring our evals are simple, durable, and tied directly to customer outcomes. It's no longer 'prompt and forget,' but 'prompt, measure, and refine.' Overall, the course content was exceptional, offering a rare balance of theory and practical application. The instructors were incredibly experienced, and their emphasis on real-world scenarios made everything immediately applicable. They also had a lot of guests speakers with real life experience - that was really helpful. This has been one of the most beneficial and enriching learning experiences I've had, and I highly recommend it to anyone serious about building robust AI products!
user avatar
Desmond Choy
@Norest
Halfway through @HamelHusain & @sh_reya’s #AIEvals course and it’s already making me rethink how I approach my LLM projects and conduct error analysis. The course has a very high signal-to-noise ratio. Highly recommended!
user avatar
Benjamin Pace

Data Science @ Candidly

linkedin logo
Had a great time taking Hamel H. and Shreya Shankar course on AI Evals. Out of everything I've seen so far in the space their approach is the among the best and most concrete I've seen to in terms of actually improving AI Products. If you are actively working on AI products / features I strongly suggest checking it out! AI Evals For Engineers & PMs: lnkd.in/gvgnkYSn
user avatar
Joel Dean

AI Founder. Tech Entrepreneur. Content Creator. Prime Minister Youth Awardee. Former WEF Global Shaper Curator

linkedin logo
🚀 Course Review: “AI Evals for Engineers & PMs” by Shreya & Hamel Just wrapped up this powerhouse training and I’m genuinely impressed. Before the course, LLM pipelines often felt like opaque black boxes especially once you add multimodal steps. Debugging failures across retrieval, reasoning, and generation was equal parts art and guess-work. What changed for me? 1. Error analysis as a first-class habit Learned a systematic “open coding → axial coding” workflow that surfaces the first upstream failure fast and keeps me focused on the highest-leverage fixes. 2. LLM-as-Judge, done right We went deep on calibrating true-positive / true-negative rates, designing binary metrics that actually move product KPIs, and even building cheap → expensive model cascades to keep eval costs sane. 3. Synthetic data that matters Instead of random tuple generation, we start with failure hypotheses, then let (feature, persona, scenario) dimensions explode exactly where risk is highest. 4. Production monitoring playbook Continuous eval pipelines, concept-drift alerts, and lightweight human in the loop dashboards critical for enterprise deployments where silent failures mean silent churn. The net result? I can now prove that improvements ship, spot regressions before users do, and sleep better knowing my AI products are measurable, debuggable, and trustworthy. If you’re building or PM-ing production-grade AI systems especially those serving high-stakes, enterprise use cases do yourself (and your users) a favour and take this course (lnkd.in/ecR5pGmk). Huge thanks to Shreya Shankar and Hamel H. for distilling years of hard-won lessons into an engaging, hands-on curriculum. 🙌
user avatar
Kenneth Reeser

Senior AI/ML Architect at Vanguard

linkedin logo
Just wrapped up an incredible course on AI Evaluations for Engineers and Product Managers taught by Hamel H. and Shreya Shankar, and I can’t recommend it enough. As a Senior AI Architect, I walked away with a deeper appreciation for the art of prompt engineering. This course reminded me that crafting effective prompts isn’t just about syntax—it’s about creativity, experimentation, and nuance. One of the biggest challenges in AI development is building labeled datasets for evaluation. It’s time-consuming, often inconsistent, and can be prohibitively expensive—especially when scaling fine-tuned models. This course introduced practical strategies to start small and scale smart, without the heavy overhead. What really stood out was how the course demonstrated the power of reusability. With the right tools, you can reuse datasets, evaluation prompts, application prompts, and even fine-tuned models—dramatically cutting costs and accelerating development. If you're working in AI/ML and looking to build more efficient, scalable, and cost-effective solutions, this course is a must. AI Evals for Engineers & PMs lnkd.in/eZg39xhD Next cohort begins July 21.
user avatar
Joshua Pittman

Prompt Engineer at Outlever

linkedin logo
Just finished the AI Evals course by Hamel H. and Shreya Shankar - solid training on how to properly test and measure AI applications. They're doing another cohort soon lnkd.in/dM9Ye55H The course does a great job connecting theory with real-world practice. Their Three Gulfs model helped me understand where things typically break in LLM workflows. The biggest takeaway was learning how to build proper failure taxonomies using coding techniques. I was making the common mistake of jumping straight to LLM-as-a-judge without first figuring out what I actually needed to measure. The sections on calibrating LLM judges with TPR/TNR methods were exactly the technical content I was looking for. The programming assignments were really valuable -and the guest speakers were great. If you're working with LLMs in production, this course is worth it.
user avatar
Alpa Dedhia

Engineering Lead | Solutions Architect | AWS | Applied Generative AI Solutions | Search & Relevance | CSM | CSPO

linkedin logo
⚓️ Why So Many Agentic AI Projects Are Getting Abandoned In the past year, I’ve seen a wave of agent-based products and prototypes—many promising, some even well-architected—quietly get shelved. The most common explanation? “It doesn’t work well.” But here’s the truth: it’s not that these systems can’t work. It’s that most teams never defined what “working well” even means. Without a robust evaluation strategy, you’re flying blind. You don’t know if the agent improved, regressed, or just hallucinated in a new and exciting way. Shipping an agent without a way to evaluate it is like launching software with no tests. And yet, that’s what most of us are doing. We need task-level evals, regression testing for LLM outputs, structured error analysis, and a feedback loop grounded in data—not vibes. I recently took the AI Evals for Engineers &PMs course by Shreya Shankar & Hamel H. and it completely changed how I think about building LLM systems. It gave me a concrete framework to move from anecdotal debugging to systematic evaluation. I now treat evals as code, build judgment pipelines I can trust, and iterate with measurable confidence. Highly recommend it for anyone serious about building reliable, production-grade AI systems. ⸻ #AI #GenerativeAI #LLM #AgenticAI #MLOps #AIEngineering #EvaluationFirst #MultiAgentSystems #LLMOps #AIEvaluation #LLMTesting #MavenCourses #StartupEngineering #AIProductDevelopment


Get 25% off our next cohort!

Enroll Here


user avatar
Geoffrey Pidcock

AI and Strategy at ANSTO | MBA Melbourne Business School | Ex Atlassian

linkedin logo
It's been a blast learning AI evals from Shreya Shankar and Hamel H. these last 4 weeks. Everybody who writes prompts - for API calls, for CoPilot agents, for ChatGPT projects - can benefit from their solid instruction backed by theory and grounded in practice. A new cohort starts soon, with details linked below (along with a 35% discount code). I've summarised my take-homes in the image below, which is the approach I now take to writing and tuning the ~20 prompts I use in my work and personal projects. I won't miss the extremely early starts though 😅 (New York and Sydney have a challenging overlap!) bit.ly/evals-ai
attached image
user avatar
Rasool Shaik

GenAI | Embedded/IoT | Medical Devices -Helping organizations to build and validate products

linkedin logo
I am attending the AI Evals course by Hamel H. and Shreya Shankar can’t recommend it enough! 👉 lnkd.in/gHQusTCG This course gave me a solid foundation in designing evaluation strategies for different types of GenAI applications. Some of the most valuable lessons for me included: How to perform effective error analysis and pinpoint areas where AI systems are falling short Building custom annotation tools—a completely new skill for me, and one that adds a lot of control to the evaluation process Crafting evaluation approaches that are tailored to different model architectures Techniques for scaling evaluations efficiently while keeping costs manageable What made the course stand out was the strong emphasis on hands-on learning—real exercises, live discussions, and opportunities to apply concepts to real-world problems. It was a well-balanced mix of theory and practice. I'm looking forward to digging into the homework next and continuing to apply what I’ve learned. Highly recommend this course to anyone working with GenAI products or systems.
U
Uday Ramesh Phalak

Machine Learning Engineer | RecSys, AI Evals, GenAI, Climate Change, UX | Co-Founder at HazAdapt

linkedin logo
The most dangerous bugs in your AI product aren't the ones that cause it to crash. They're the ones where the AI looks like it's working perfectly, but it's confidently wrong. That's why we need AI Evals. Just wrapped up Shreya Shankar and Hamel H. "AI Evals For Engineers & PMs" course. Honestly wasn't sure what to expect going in, but it ended up being way more practical than I thought. The biggest thing I took away was this framework they call the "Three Gulfs" - basically that every AI product struggles with three core problems: - Can we actually understand what our data/pipelines are doing? (Gulf of Comprehension) - Are we being clear enough in our prompts? (Gulf of Specification) - Will this work on new data we haven't seen? (Gulf of Generalization) Most teams try to tackle all three at once and just end up spinning their wheels. Approach presented by shreya and hamel is more systematic. They break it down into this "Analyze-Measure-Improve" cycle. The pitfalls they covered in cycles hit close to home. When outsourcing the error analysis and wondering why our evals were garbage. Turns out you need to actually look at the failures yourself. Also, we kept testing judge models on the same examples we used to build them; obviously they looked great until they hit production. A few reasons you might want to check this out: - If you're shipping AI features and crossing your fingers they work, this gives you actual ways to know what's happening - The cost optimization stuff alone could pay for the course (they showed examples of 60%+ savings) - Honestly, knowing how to do evals properly is becoming pretty essential if you want to work on AI products seriously There is another cohort starting soon if anyone's interested. (Link in comments) Worth it if you're building anything with LLMs in production.
attached image
user avatar
Ankur Bhatt

AI Engineering | Product and Technology Leader | CTO

linkedin logo
Just completed AI Evals for Engineers & PMs — an outstanding experience in applying rigor and clarity to AI systems. My biggest takeaway: Evaluation are not a one-time task — it’s a continuous learning process. Designing automated evaluators (code-based or LLM-as-judge) isn’t just about checks; it’s about building a feedback loop that helps teams and systems grow, improve, and deliver impact. That’s what true product leadership looks like. The frameworks, real-world examples, and collaborative discussions in this course have inspired me to raise the bar on AI product quality. Don’t miss their final live cohort starting July 21st: lnkd.in/g-DY74tC Big thanks to Hamel H. and Shreya Shankar for creating a learning environment that fosters both technical excellence and leadership growth!
attached image
user avatar
Robb Winkle

Fire your systems integrator. Conversational AI-native systems expert for ERP (Techstars '24)

linkedin logo
🚨 Just wrapped the AI Evals for Engineers & PMs course by Hamel H. and Shreya Shankar, and… WOW. 🔥 This wasn’t just another AI course — it was a firehose of practical knowledge, packed with insights I could put to work immediately. Even though it was intense, the format lets you take what you need now and revisit deeper topics as they become relevant. Biggest takeaway: Doing real error analysis and looking at your data is harder and more time-consuming than you’d think — but the payoff is massive. You build intuition about where your product fails, which lets you design systematic, automated evals that outperform off-the-shelf tools. It’s the difference between flying blind and instrument-level visibility into your product quality. Impact on our team: It filled a crucial gap in how we approach evals. One example: the 2-step process for generating synthetic user queries with an LLM — a structured method that’s far better than jumping straight into query generation. Huge quality boost. If you’re building AI-powered products, get in on the next one: 👉 lnkd.in/ebBtGKSE It might be the last live run of the course — don’t miss it. Massive thanks to Hamel and Shreya for sharing a framework that truly shifts how you think about evaluation. 👏👏
attached image
user avatar
Harris Brown

Fractional AI product leader for early-stage founders & teams | ex-Airbnb

linkedin logo
It's the final week of the Maven "AI Evals For Engineers & PMs" course. 🔥🔥🔥 Grateful for the time and effort Hamel and Shreya Shankar have put in to craft an amazing course. Just the right mix of practical and theoretical. 10/10 recommend. So many learnings that I'm excited to share, but one insight that I can't stop thinking about: 𝘆𝗼𝘂 𝗰𝗮𝗻𝗻𝗼𝘁 𝗼𝘂𝘁𝘀𝗼𝘂𝗿𝗰𝗲 𝘆𝗼𝘂𝗿 𝗲𝘃𝗮𝗹 𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝗮𝘀 𝗮 𝗰𝗼𝗺𝗽𝗮𝗻𝘆. As tempting as this is to do - just have an LLM do it, right? - it is so clearly an AI product-building failure mode. The companies, and product teams, that take evals seriously (and from the start) will be the ones that succeed. An added bonus to taking this course -- we've been able to implement many of the learnings directly with Native Studios clients, which is pretty cool.
PC
Peter Cardwell

Software Engineer at Snap, Inc.

linkedin logo
Hi all! I wanted to give a quick plug for the AI Evals For Engineers & PMs (lnkd.in/g6QBKJMs) course taught by Hamel H. and Shreya Shankar that I've been taking over the last few weeks. During my career break I’ve been operating as a solo dev on my own projects, wearing many hats I'm not used to—from UX design to mobile development to product management. On the backend, what I've found most challenging about building in the era of LLMs isn't engineering any one component, but putting a process - a product evaluation - in place that enables me to choose the most appropriate techniques and ultimately guide development in the right direction. At a high level, the process is straightforward: 1. Gather data (ideally real production data, or synthesize if you don't have it) 2. Analyze the data carefully 3. Build a hierarchy of errors you observe 4. Create automated evaluators for each error type: Simple programmatic evaluators when possible LLM-as-judge for more nuanced errors, aligned to human evaluators 5. Operationalize everything to guide development, catch regressions, and monitor production Of course, the devil is in the details - it all gets complicated quickly when you start layering in a retrieval pipeline, multi-turn conversations, memory, tool-calls, MCP and so on. I’ve found Hamel's blog posts on this topic (lnkd.in/gqnH6Exa, lnkd.in/gQHH3dJS, lnkd.in/g596NAHF) have been an incredibly useful roadmap for me putting this into practice, so I signed up for this course on a whim—and I'm very happy I did. What I particularly loved: Office hours were goldmines of useful discussion. Hearing how other practitioners across the industry are tackling their challenges was incredibly valuable. The discord chat + lecture combo worked perfectly for me. In a traditional in-person lecture, I would be far too introverted to stand up in front of a few hundred other people to ask my confused questions, but somehow I had no such reservations doing the same thing in a chat. Hands-on homework assignments made everything click. It's hard to apply these concepts while building new things, so having well-scoped projects designed to walk you through the process end-to-end was illuminating. Responsive instructors who adjusted course material based on Discord discussions, plus a clearly written course reader and excellent guest speakers. I’m excited to put everything I’ve learned into practice and I highly recommend this course to anyone building LLM-based products in 2025.
user avatar
Joshua Pittman

Prompt Engineer at Outlever

linkedin logo
Just finished the AI Evals course by Hamel H. and Shreya Shankar - solid training on how to properly test and measure AI applications. They're doing another cohort soon lnkd.in/dM9Ye55H The course does a great job connecting theory with real-world practice. Their Three Gulfs model helped me understand where things typically break in LLM workflows. The biggest takeaway was learning how to build proper failure taxonomies using coding techniques. I was making the common mistake of jumping straight to LLM-as-a-judge without first figuring out what I actually needed to measure. The sections on calibrating LLM judges with TPR/TNR methods were exactly the technical content I was looking for. The programming assignments were really valuable -and the guest speakers were great. If you're working with LLMs in production, this course is worth it.


Get 25% off our next cohort!

Enroll Here


Gurunath Parasaram
Gurunath Parasaram

Data Scientist at Tiger Analytics

attached

The most practical AI course I've taken, with immediate value.

As a Data Scientist, I joined “AI Evals for Engineers & PMs” by Hamel and Shreya to get better at evaluating LLM systems in real-world settings, particularly in the context of LLMs being used for a variety of tasks. The course covered everything from fundamentals and error analysis to production monitoring and cost optimization. What stood out was how practical it was—teaching us to define use-case-specific metrics, collaborate with PMs, slice errors meaningfully, and avoid over-relying on automated scores. The lessons and guest talks (like on RAG evals, failure funnels, and continuous human review) felt directly applicable to my work, especially when building retrieval-based bots or monitoring model drift. It’s not a course that spoon-feeds you; it gives solid frameworks and real-world habits to build eval pipelines that actually reflect user experience. If you’re working on production ML or LLM projects and want to move beyond standard metrics, this is worth your time. I’ve already applied a few techniques at work and found them super helpful. I would strongly suggest anyone with an interest in evals to take this course!
Omar Irfan Khan
Omar Irfan Khan

Dev Team Lead

Highly recommend this course!

The AI Evals course has helped me gain knowledge of how we can true evaluate LLMs in a meaningful way and actually understand whats going wrong. Hamel and Shreya did a wonderful job in explaining how evaluations can be done in a structured manner and on what to try. The course guests were an additional bonus who gave their insights on how they carried evaluations out. I would highly recommend this resource to anyone who builds with LLMs and are wondering how to effectively understand why the LLM isnt working the way it was trained and to better understand what is going on behind the scenes!
Muhammad Jarir Kanji
Muhammad Jarir Kanji

Data Scientist

Amazing instructors.

Hamel and Shreya are amazing instructors and this course has been a great resources for me in understanding how to build robust, enterprise-grade evals and AI pipelines. What I found particularly useful were the guest lectures, which bring a variety of opinions from industry experts and practitioners on different topics that relate to AI and evals.
effie goenawan
effie goenawan

Head of Product, Tavus

Great insights that is shaping how we evaluate AI products.

Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.

Alex Elting
Alex Elting

Software Engineer

I learned how to be truly effective in creating LLM-powered applications

I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.

Yusong Shen
Yusong Shen

Software Engineer, Google

Comprehensive and practical curriculum

Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.

Trina Sen
Trina Sen

Palette, CPO

A Masterclass in Practical AI Evaluation.

From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.

Lukasz Kowalczyk
Lukasz Kowalczyk

Soothien HealthTech Advisory

A must for any developer or PM building AI products.

I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.


Get 25% off our next cohort!

Enroll Here


Uday Phalak
Uday Phalak

Machine Learning Engineer | Co-Founder at HazAdapt

Good course if you want to build products people actually trust.

Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.

Srinivasan Krishnamurthy
Srinivasan Krishnamurthy

Technology director - Wells fargo

A fantastic course offering an in-depth practical approach to evals.

Highly recommend this course to anyone building Gen AI products and solutions. The biggest takeaway for me was the methodical and scientific process that the instructors outline for doing model evaluation. It helps build a mental model that i am applying at work to build an eval pipeline for RAG solutions. The course also offers an in-depth and practical approach to understanding how generative AI models are evaluated using rubrics and metrics which are critical skills for AI engineers. Overall a fantastic course with a lot of learning and value.
KF
Kaname Favier

Founder, Supago Inc.

attached

This course completely transformed my approach to building AI applications.

This course completely transformed how I approach evaluating LLM applications. Before this course, my evaluation processes were informal at best. Now, I've gained a structured, rigorous methodology to identify errors, quantify improvements, and build automated evaluators. The hands-on assignments and deep dives into error analysis were particularly valuable, directly impacting how efficiently I debug and iterate on LLM products. Whether you're an engineer, product manager, or someone working closely with AI systems, this course is essential—highly recommend!
user avatar
Raymond Weitekamp
@raw_works
i have never been more excited to "look at my data"! @HamelHusain and @sh_reya could have easily charged 10x for their "AI Evals" course. amazing education all star guest lectures (ie @altryne & @charles_irl & @kwindla & @BEBischof ) not to mention the @modal_labs credits...
13
user avatar
Sandeep Pawar
@PawarBI
I am finishing one of the best courses I have taken recently - "AI Evals for Engineers and PMs" by @HamelHusain and @sh_reya . I was in dual mind before signing up but given the practical tips shared by industry leaders and practioners, it's been totally worth it. Building AI x.com/PawarBI/status/1933940567073075526/photo/1
attached image
17
user avatar
Eleanor Berger
@intellectronica
When you hear about @HamelHusain and @sh_reya's AI Evals Course, you probably think: great, a few lectures from two of the best experts in the field. What you really get? - Lectures from two of the best experts in the field. - The best, no, the only textbook on AI evals that twitter.com/825766640/status/1933695903266943234
39
user avatar
Prashant Mital

Applied AI @ OpenAI

linkedin logo
🤖 I've been taking the AI Evals For Engineers & PMs course on Maven over the last few weeks (because OpenAI wasn’t intense enough 😅). While I had picked up bits and pieces of the techniques Hamel and Shreya delve into in this course during the past year, it has been great to understand the full-lifecycle of building & scaling evals. If you think your job is likely to be automated by AI, there's probably a new job being created to evaluate how well that AI replacement performs. Next (and last!) live cohort starts July 21 — enroll at lnkd.in/g7BYtQiH
user avatar
Zara Khan

Account Director @ OpenAI

linkedin logo
Does your AI product actually work? Without evals, it's hard to tell. I just completed the 4-week Maven course "AI Evals For Engineers & PMs" taught by Shreya Shankar and Hamel Husain and it truly exceeded my expectations. Next cohort starts July 21st (registration link in comments). The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. My top customers don't just measure performance with evals. They use them to shape their entire roadmap. Without evals, you're shooting in the dark.


Get 25% off our next cohort!

Enroll Here


Video Poster
John Berryman

AI Consultant

"Absolutely recommend this course to anyone building AI applications"

Sebastian Lozano
Sebastian Lozano

Senior Product Manager at Redfin

attached

Error analysis (and this course) is all you need

Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!

Trey Grainger
Trey Grainger

Founder, Searchkernel LLC

attached

This course is the best place to learn evals

I learned a ton from Hamel and Shreya's course on AI Evals! I've worked at the intersection of information retrieval and AI for nearly two decades, so I've done my fair share of evals throughout that time for search results quality. I've even written books, like AI-Powered Search, that include significant sections on ranking metrics, judgement lists, and model training. Nevertheless, with the rise of generative AI, RAG, and agentic workflows, the complexity of the evals process to handle complex pipelines with non-deterministic outputs has increased the complexity of performing good evals significantly. The discussions about end-to-end traceability and leveraging Transition Failure Matrices were particularly helpful for me in tackling these more challenging multi-step workflows. This course has been a goldmine by providing: 1. Up-to-date information on best practices for evals on the current state-of-the-art AI workflows 2. Deep insight from experts with decades of both real-world experience and academic research into evals 3. Lots of tips, tricks, and real-world examples (with code) for getting end-to-end evals implemented and working well. This course significantly improved my mental models and increased the size of my practical toolkit for doing AI evals, which is already paying dividends for my client engagements. This is a set of skills everyone working in AI should acquire, and this course is currently the best place to quickly do that!
user avatar
Andrei Bocan 🐐
@monsieur_pickle
Ok so looking back on the Evals Course by @HamelHusain and @sh_reya it’s been one of the best uses of my time in years. Very focused and practical, amazing guest lectures, stellar course reader and just overall just a great experience. 10/10 would recommend.
12
user avatar
Batu
@BatuAytemiz
Highly recommend @HamelHusain and @sh_reya's llm evals course! Clear high-level ideas + grounded low-level worked out examples. Loved the focus on reproducible, iterative workflows.
7
user avatar
Jodi M. Casabianca

Entrepreneurial Measurement & Research Scientist | Psychometrician | Scoring of Open-ended Tasks | AI Evaluation

linkedin logo
ATTN: Anyone interested in LLM evals... I recently decided to enroll in Hamel H. and Shreya Shankar LLM evaluation course.... All I can say is....WOW. The past 4 weeks have been full of seminars, guest lectures, homework assignments, Discord chats, and office hours. I enrolled so that I could learn more about the psychometric aspects of LLM evals (think: rating scales, rubrics, LLM-as-a-Judge, etc.) and I've learned that and so much more. And don't get me started on the course reader! 🤓 This course has really helped me get where I wanted to go wrt understanding the full pipeline. My biggest takeaway: be diligent with looking at your data, with your own eyes. Not everything can be automated or farmed out. This is helpful to me as a consultant helping others. Thank you Hamel and Shreya for offering this course and answering EVERY. SINGLE. QUESTION. If you are considering the course, just do it.
attached image
user avatar
Frazer Dourado
@FrazerDourado
If you've been thinking about building an AI application but aren't sure how to go about evaluation, then @HamelHusain and @sh_reya's evals course on Maven is everything you need. I'm about to finish the course, and I highly recommend it. Since most of my work is in enterprises x.com/FrazerDourado/status/1933266864534401167/photo/1
attached image
18
Jiho Bak
Jiho Bak

Independent AI Engineer

An essential resource for engineers & PMs

For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.


Get 25% off our next cohort!

Enroll Here


user avatar
Reza Yousefzadeh
@reza_yz
Taking the AI evals course by @HamelHusain and @sh_reya has been like drinking from a firehose. So much valuable information. I'll have to come back and rewatch the recordings from time to time and refer to the course materials. Thanks so much to you both.
Tiago Freitas
Tiago Freitas

Scarlet AI

This course changed how I approach AI projects. Instructors provide great support.

The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!

Siddharta Govindaraj
Siddharta Govindaraj

Consultant, Silver Stripe Software

Learn how to put evals into practice. Practical and hands on instruction.

Prior to this class I had already read a bunch of stuff on evals (including Hamel's blog). But I struggled to convert that theory into practical steps. I had some apprehensions coming in -- will it be too theoretical? Will it assume a lot of background knowledge? And I can say now -- this course completely crushes it. It is fully hands on and practical, starting from zero and building up from there. You will learn every step of the evals process on what exactly to do (and not to do) and more importantly -- how to actually put it into practice. If you have been struggling with evals, then don't think twice and take the course.
VANESSA MARQUIAFAVEL SERRANI
VANESSA MARQUIAFAVEL SERRANI

Computational Linguist at ATENTO

attached
This course has been incredibly eye-opening. I’ve learned how important it is to follow a clear “Analyze - Measure - Improve” cycle when working with language models. What really stood out to me is that the biggest challenges often come not from the technology itself, but from how we approach the process — like jumping straight to complex solutions without truly understanding the problem, or using misaligned evaluation methods. My biggest takeaway is that every stage of the process has its own traps, and skipping steps or making quick fixes can easily backfire. Being intentional about collecting the right examples, measuring in a fair and meaningful way, and making thoughtful improvements can make all the difference. I’d definitely recommend this course to anyone working with AI systems. It helped me slow down, ask better questions, and be more strategic — and that’s something every team could benefit from.
Laurian Gridinoc
Laurian Gridinoc

Full Stack Computational Linguist, Bad Idea Factory

Now I can design meaningful evals! Highly recommend this course.

Before the AI Evals led by Hamel Husain and Shreya Shankar, I used evals sporadically, mostly relying on third-party ones. Now, I have a clearer understanding of how to design meaningful evals and communicate their value to the teams I work with.
Sydney Sarachek
Sydney Sarachek

Senior Director, AI

This course is comprehensive in a way that's hard to find elsewhere.

This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.

user avatar
Teresa Torres
@ttorres
I'm finishing up the "AI Evals for Engineers and Technical PMs" class taught by @sh_reya and @HamelHusain on Maven. In four weeks I went from kind of knowing what evals were to doing in-depth error analysis and implementing my first round of automated evaluations. If you are looking for a structured and in-depth way to evaluate the quality of your LLM apps, I can't recommend this course enough. I have a whole new appreciation for what it means to build a high-quality LLM-based product. And can see why so many tacked on AI features are terrible.
133
Sunita Parbhu
Sunita Parbhu

CEO, Fern AI, AI for Legal

This course is worth the time. Take it.

We've been building evals for over a year. We found this course invaluable to determine where we could improve our process, identify tools and resources, and engage with others in the community. Shreya and Hamel have produced a top notch course that's worth every minute of time invested.
Brian Chase
Brian Chase

Hardware Engineering Leader at Cisco

This course is a game changer.

This course was a game-changer for me. My biggest takeaway was learning a structured approach to system traces, which has given me a reliable framework for making meaningful progress. The hands-on content was fantastic; I learn best by doing, so I truly appreciated the practical exercises. I now have the 'flywheel' I was missing to move forward with my own app development. I highly recommend this course and look forward to even more hands-on content in the future!
Adi Pradhan
Adi Pradhan

Founder, Socratify

1000x ROI

Taking a structured approach to evals is a game changer. Shreya and Hamel are teaching a skill with 1000x ROI in the age of AI. At Socratify, we're building a career coach that sharpens critical thinking skills through debates on business news and other topics. It's inherently challenging to ensure high quality LLM interactions and going through error analysis has been transformative for the product development process. I can't wait to release the next version! I would absolutely recommend this course to any founder working with LLMs
Daniel Roy Greenfeld
Daniel Roy Greenfeld

Author and Principal at Feldroy, LLC / Software Artisan at Kraken Tech

Pragmatic techniques, free of jargon.

What I learned is optimal techniques for expediting improvements in quality for AI applications. We were taught practical methodologies based on straightforward metrics that keeps humans within the loop in order to ensure the quality of result. Hamel and Shreya were quite good at explaining all terms with real-world examples taken from experience. They didn't load the course with jargon. The homework exercises was challenging yet achievable. It's been fun and educational to get the work done. I recommend the course to anyone who wants to learn incredible tricks and tips for building AI applications.

Forrest McKee
Forrest McKee

Data Scientist

Tools to quantitatively improve your AI product

Hamel and Shreya do such a great job at equipping you with the tools to quantitatively improve your AI product. This is a must take course for anyone working with LLM powered applications.

Constanza Schibber
Constanza Schibber

Data Scientist

Course Instructors Went Above & Beyond

As someone with prior experience designing human evaluations and developing metrics for a specific product, I took this course to broaden my understanding of AI evaluation practices, especially for agentic systems and RAG, as well as to deepen my knowledge of evaluation infrastructure such as CI/CD and trace review interfaces. This course delivered far more than I expected. It includes a comprehensive course reader that could stand on its own as a reference book, live classes packed with hands-on examples, and over 10 guest speakers who shared practical insights into evaluation strategies and even how to build your own evaluation tools for different use cases. What really set the course apart was the level of support. Hamel and Shreya were incredibly supportive throughout the course. They hosted office hours, thoughtfully answered every question on Discord, and even brought in two experienced professionals to offer additional hands-on support and help with (optional) homework. They went above and beyond to make sure everyone was learning and participating. I also really appreciated hearing from other students about the evaluation challenges they were facing in their own work, and watching Hamel and Shreya think through solutions with them in real time was just as educational as the prepared content. Highly recommend this course if you're working on or even adjacent to LLM applications. Whether you’re focused on product quality, engineering, or research, you’ll walk away with frameworks, tools, and best-practices you can use right away.
Video Poster
Wayde Gilliam

Wayde Gilliam

"If you are building with AI, you need this course!"

Video Poster
Skylar Payne

Founder, Wicked Data LLC

"Take this course to go from a good to a great AI Engineer!"

Video Poster
Isaac Flath

Owner at Kentro Tech LLC

"Practical techniques rarely taught elsewhere. Highly recommend!"

user avatar
Adam Dadson

GTM @ OpenAI

linkedin logo
I've been spending time learning Evals for AI in the past few weeks, and throughout the process, what I've really started to understand is the impact that systematic evals can have on dramatically improving LLM output. If you're curious to learn more and upskill yourself when it comes to driving better model responses, check out Shreya and Hamel's incredible course on the power of Evals. The next cohort begins July 21st: lnkd.in/gVCJk-WC
attached image
user avatar
Alex Elting
@alexelting
I have been tinkering around with LLMs for a few years now as a software engineer. I realized the potential of LLMs early on, but one worry I had for LLM-powered software was just how are you supposed to test things? You can't just unit test English (usually)
1
Jasmine Robinson
Jasmine Robinson

Senior Technical Program Manager, Netflix

This course helps you get expected outcomes from your AI

A colleague reached out to me and recommended “AI Evals For Engineers & PMs” being offered by Hamel H. and Shreya Shankar. I consider myself an eternal learner, and knew evaluations were a critical yet often overlooked component to successful GenAI implementation. Everyone keeps asking me how they stay ahead of the GenAI. Well, you take classes like this one so you can be on the cutting edge of how to ensure you get the expected outcomes from your future AI agents. It was so dense with useful information and guest speakers that I honestly couldn’t keep up, but after the course is over, you continue to have access to the recordings.
user avatar
Jeroen Latour

FinTech at Booking.com

linkedin logo
I’m currently taking the Maven course AI Evals for Engineers & PMs by Shreya Shankar and Hamel H. My five biggest take-aways so far: 1. *Evals turn chaos into clarity* – LLMs are unpredictable. Evals give you a repeatable way to measure what matters instead of chasing bugs one by one. 2. *Correctness = your product definition* – What counts as “good” depends on your product, not on a generic benchmark. 3. *Fix specs before measuring* – Some errors come from unclear prompts or vague product goals (what the course calls the “Gulf of Specification”). In those cases, sharpen the prompt or definition before investing in evals. 4. *LLM judges need judging* – Using one LLM to evaluate another can work, but only if you validate it against human experts and refine the criteria. 5. *AI evals are the new core skill* – They don’t just measure accuracy; they help shape product roadmaps. This is fast becoming a must-have skill for PMs and builders. Outside of my day job at Booking, I’m working on a side project to make EU lobbying more transparent with LLMs. The course is already helping me think about how to design evals that keep me honest. I’d recommend this course to anyone building with LLMs—especially PMs, engineers, or anyone responsible for shipping AI products. Next cohort starts Oct 6, with recorded sessions: bit.ly/470obaL


Get 25% off our next cohort!

Enroll Here


Juan Maturino

Juan Maturino

Software Engineer at Edua

Removed a malicious system prompt and reversed falling engagement—user interactions increased.

Before this course, my instinct was to jump straight into axial coding. That meant I leaned heavily on my own presuppositions about what failures I thought would show up. By doing that, I was blind to unexpected issues. It’s like hearing about someone before meeting them—you imagine who they are, but until you actually meet them, you don’t see the full picture. With data products and LLM pipelines, the same thing happens.

Take a healthcare chatbot as an example. Going in, I assumed failures would only be factual: did it answer the medical question correctly? If I jumped straight into axial coding, I’d only tag factual errors and conclude the model was nearly flawless. From that narrow view, I might even think the product was destined for massive success.

But after this course, I learned to take a step back and examine the data without presuppositions. By looking at traces more openly, I discovered a hidden failure mode: the chatbot was mean. It was calling people “fat,” “ugly,” “stupid,” and generally creating a hostile experience. No factual errors—just a terrible user experience. This was something axial coding alone, or automated LLM-as-a-judge evaluation, would have missed without prior human review.

Digging deeper, I found the root cause: a disgruntled former employee had slipped “be mean when answering” into the system prompt. Once we fixed that, user engagement improved dramatically. The key lesson I took from the course is that real error analysis starts with open coding and direct observation. Skipping that step leaves you blind to the most important problems.

Hima Tk

Hima Tk

Lead PM - AI / ML Products at CultureAmp at CultureAmp

Turned costly trial-and-error into a data-driven plan that avoided massive retraining and prioritized fixes.

I worked with a supermarket chain to build an AI system that could count inventory from shelf photos. At first, the system struggled with issues like blurry images, background clutter, and confusingly similar packaging. Before this course, my approach would have been driven by intuition and trial-and-error. I might have looked at a handful of errors, jumped to a conclusion like “the model is just bad at distinguishing Coke cans,” and proposed a vague fix such as retraining with thousands of new images. That would have been expensive, slow, and unfocused—and it might not have solved the real problem, like blurry photos from staff.

After this course, my approach is now structured and data-driven. Instead of guessing, I use error analysis to diagnose issues systematically. I start by gathering a representative failure set and tagging images to capture why errors occur—blurry images, poor lighting, occlusion, similar or new packaging, unusual angles, background clutter. From there, I group these into a taxonomy of failures and calculate how much each category contributes to overall errors. This creates a prioritized roadmap for improvement.

For example, when Image Quality and Similar Classes accounted for 75% of failures, I could recommend high-impact, targeted fixes: improve photo capture guidelines and augment training data with blurred images for the first, and collect more Diet Coke vs. Coke Zero examples for the second. Instead of vague trial-and-error, I now have a clear, quantitative path to better results.

Margarita Fakih

Margarita Fakih

Business Operations and Development at N/A

Saved me hours of rewriting by creating a reusable framework that prevents repeated AI errors.

As a product manager, I often struggled with inconsistencies in user stories generated by AI tools. Even when my prompts were clear, the outputs would miss key requirements or include irrelevant details. Before this course, my instinct was to keep tweaking the prompt through trial and error until I got something usable. While that sometimes worked, it was inefficient and didn’t explain why the model was failing.

After this course, my approach is much more systematic. I start by defining the key dimensions of a good user story—clarity, completeness, alignment with acceptance criteria, and the right level of technical detail. Then I collect flawed outputs and apply open coding to label issues like “missing acceptance criteria,” “misinterpreted intent,” or “overly generic details.” From there, I build a taxonomy of failure types, which lets me organize and prioritize problems. Finally, I design a feedback loop: the LLM generates a user story, checks it against the taxonomy, and revises if any known issues are detected.

Instead of wasting hours on one-off fixes, I now have a reusable framework that scales across projects. What was once frustrating trial-and-error has become a structured, repeatable process for improving quality.

Júlio Paulillo

Júlio Paulillo

CRO @ Agendor at Agendor

I turned scattered agent errors into prioritized fixes, enabling focused, measurable improvements.

Building a personal assistant for salespeople is my day-to-day work. One of the tools the agent uses fetches activities from the CRM, but I noticed the LLM sometimes hallucinated—passing unnecessary arguments when calling the tool. Before this course, I would have gone straight into prompt engineering, rewriting tool descriptions or adding more examples to try to fix the issue.

After this course, my approach is different. I start by defining key dimensions such as user persona, intent (e.g., “fetch activities”), and activity type (past due, finished, pending). From there, I can ask an LLM to generate tuples from these dimensions, giving me a structured way to build a synthetic eval dataset. If traces of user interactions are already logged, I filter by intent and begin open coding the different failure modes I see. After reviewing dozens or even hundreds of examples, I then use an LLM to help categorize the failures. This lets me prioritize the categories that matter most and focus fixes where they’ll have the biggest impact.

Instead of reactive prompt tweaking, I now have a systematic framework for diagnosing failures and improving my assistant in a repeatable way.

Tatyana Kazakova

Tatyana Kazakova

QA Engineer :) at Qazaco

Turned random fixes into a repeatable process that improved the whole system and proved changes actually worked.

Before this course, I would just fix issues as I spotted them—tweak a prompt here, change a setting there—and hope the next run looked better. Sometimes it worked, but I never had the full picture of what was really going wrong or how often certain problems appeared.

After this course, I’ve learned to slow down at the start: define what I actually want to measure (relevance, completeness, context handling), collect a solid set of examples, and trace where errors first start to show up. From there, I group similar issues into clear failure types, which makes patterns obvious and helps me prioritize what to fix.

Now the process feels less like random whack-a-mole and more like a structured, repeatable system. Instead of chasing one-off issues, I can improve the whole system and know whether the changes are actually working.

Andrew Chaffin

Andrew Chaffin

CEO at Argo Analytics

Structured error analysis gave me a clearer method to iterate and actually get the results I needed.

A while back, I used an AI writing assistant to draft a personal statement for a fellowship. I gave it a detailed prompt with my goals, values, and experience, but the output was generic and missed the emotional tone I wanted. At first, I just kept rephrasing the prompt, hoping it would eventually get it right. Instead, it swung between being too formal or inventing details I never mentioned. It was frustrating, and trial-and-error didn’t get me far.

After this course, I’d approach it completely differently. I’d start by defining what “good” means for the task—tone alignment, factual accuracy, and personal relevance. Then I’d collect flawed outputs and open code them: did the model invent details, ignore parts of the prompt, or lose the emotional tone? From there, I’d build a taxonomy of failures—like hallucination, tone mismatch, or misunderstanding the prompt—and use it to spot patterns. Maybe I’d realize the model struggles when the prompt is too abstract or lacks emotional cues.

Compared to my old approach of hoping a better version would show up, this gives me a clear, methodical way to iterate. It turns what used to be trial-and-error frustration into a structured process for actually getting the results I need.

Amol Shah

Amol Shah

Head of Product at Count

I can now pinpoint errors and measure reductions in each error bucket—turning guesswork into measurable improvement.

When I first built a small chatbot to recommend books based on user mood, it often gave wildly off-base suggestions—like pairing someone “feeling nostalgic” with a cutting-edge tech thriller. Back then, I just tweaked the prompt or guessed at what the model might “understand” about mood. It was trial and error with no clear sense of what was actually going wrong.

After this course, I’d tackle the problem systematically. I’d collect failures by running the bot across a fixed set of test prompts and logging every mismatch. Then I’d open code the bad outputs—labels like “misread tone,” “genre bias,” or “keyword fixation.” From there, I’d define key dimensions of failure (emotional alignment, genre diversity, keyword vs. context) and group them into a taxonomy, like “semantic misinterpretation.” By quantifying how often each type occurs, I’d know where to focus first.

Armed with that data, I could design targeted fixes: refining prompts with explicit mood-to-genre mappings, adding checks for emotional themes, or diversifying candidate genres. Instead of hacking prompts by gut feel, I’d have a transparent, repeatable process that shows whether error rates are actually dropping.

Lada Kesseler

Lada Kesseler

Lead Developer at Logic20/20

I can now predict and prevent code quality issues instead of treating them as isolated bugs.

I often ran into code quality issues when using AI assistants, but I didn’t have a structured way to make sense of them. Before this course, I would just label outputs as “messy code” without really digging into the underlying problems.

After this course, I now analyze them systematically across dimensions—things like hardcoded tests, long methods, poor formatting, bad naming, poor architecture choices, duplication, dead code, or ignoring available quality tools. By open coding these issues and building a taxonomy, I can see patterns emerge instead of treating each problem as random or isolated.

The key shift for me is realizing these aren’t one-off mistakes but systematic failure modes that appear under specific conditions. With that understanding, I can both predict and prevent quality issues, rather than just reacting to them after the fact.


Get 25% off our next cohort!

Enroll Here


MA

Maruti Agarwal

Expert AI Research Scientist at Datasite

Gained clarity on what to fix first, transforming my entire approach to evolving the system.

I applied what I learned the very same day we covered error analysis. I was working on an industry classification system and followed a structured process: I asked annotators to provide detailed feedback on wrong predictions, reviewed their notes to improve annotation quality, then parsed all the feedback and used ChatGPT to categorize it into six major error patterns. Finally, I shared those patterns and error percentages with stakeholders.

After this course, error analysis feels much more structured. Instead of just collecting feedback in an ad hoc way, I now have a clear method that gives me visibility into what problems matter most and what to solve first. It’s changed how I think about evolving the system overall.

Sergio Soage

Sergio Soage

AI RD lead at Diligent

I built a structured understanding of failures, yielding actionable insights instead of whack-a-mole fixes.

Now I understand how to systematically explore the problem space, identify patterns across multiple failures, and build a structured understanding of why and when the system fails - not just that it fails. This leads to more actionable insights for improvement rather than playing whack-a-mole with individual issues.

KL

Karen Lam

Product Design

I gained clarity and confidence to systematically narrow the gap between AI failures and human understanding.

I’m a product designer with no prior AI Evals experience. Before this course, when I encountered unexpected or confusing results from the Recipe Bot in the first homework, my instinct was to just iterate on the system prompt in Cursor and manually test through the UI.

After this course, I’ve learned there’s a more systematic way to approach error analysis. Using open and axial coding, I can narrow the gap between AI system failures and human understanding through a step-by-step process. I especially appreciate that this framework is grounded in social science research practices like coding data and building taxonomies—and that it emphasizes doing the analysis manually to ensure accuracy, rather than offloading it entirely to AI.

I also see the value in wearing both the data scientist and product manager hats: questioning the data rigorously while bringing product knowledge into the decision-making. This approach gives me a structured, repeatable way to analyze failures instead of ad hoc trial and error.

Juan Maturino

Juan Maturino

Software Engineer at Edua

I stopped endless prompting and now systematically document failures to improve outcomes and efficiency.

In automated agentic code generation, I often ran into situations where the desired output was far from what the model produced. My old approach was to keep prompting the LLM until progress stalled, then spin up a new chat with a rephrased prompt and updated context. Eventually I’d accept whatever was “good enough” and finish the task myself.

After this course, I understand why that approach was limited. Evaluating code has two axes: reference-based (objective tests like unit tests) and reference-free (qualitative measures of style, readability, and design). Code isn’t just functional—it’s also expressive, like writing prose—so both dimensions matter.

Now, instead of endless prompt tweaking, I document failures in short form through open coding, then group and categorize them using axial coding. This helps me identify common failure patterns in the LLM’s output and design more robust system prompts targeted at those issues. What used to be trial-and-error guesswork is now a structured process for improving both the reliability and quality of generated code.

Chris McDonald

Chris McDonald

AI Team Leader at Comtrac

I now have the clarity and confidence to diagnose failures instead of ‘living on a prayer’.

At work, we use prompts and prompt engineering to turn selected inputs into specific outputs. Before this course, whenever I ran into unexpected results, my approach was to jump straight into the prompt and randomly change words until something worked. After a few tries, I might even hand the prompt, input, and output to an LLM and ask it to fix things. There was no hypothesis, no structure—just living on a prayer.

After this course, I have a far more systematic approach. If I encounter a problem now, I’d begin by collecting an initial dataset of around 100 traces. From there, I’d perform open and axial coding to build a taxonomy of failures. That structure gives me clarity about what’s really going wrong instead of just chasing random fixes.

What stands out to me is that the processes in this course are simple—not in the sense of easy, but in being concise and straightforward while still requiring real effort and understanding. As Richard Feynman said, “if you can explain something in simple terms, you understand it well.” That’s exactly how Hamel and Shreya have designed this course, and I’m grateful for it.

Roey Ben Chaim

Roey Ben Chaim

Staff Engineer at Zenity

I can now pinpoint agents' core failures, turning vague vibes into clear, actionable fixes that improve agent performance.

The axial coding just hit different. Before this course, my approach to failures was more of a “vibe investigation,” poking around without a clear structure.

After this course, I now cluster failures systematically and trace them back to their core issues. Quantizing the errors into meaningful groups makes it much easier to see the main failure points. I finally feel like I have a proper way to identify the root problems in my agent instead of just guessing.

Ben Eyal

Ben Eyal

Research Engineer at Ai2 Israel

I gained clarity to find root causes and stop repeated agent confusion.

At work, we’re building Paper Finder, which (as the name suggests) should find papers. We wanted the agent to refuse certain requests so people wouldn’t treat it like a free ChatGPT. But we kept running into a strange behavior: the agent would refuse, ask the user a clarifying question, the user would reply “yes,” and then the agent would have no idea what they were talking about.

Before this course, we would have just dug through the logs, checked for crashes, and treated it like any other bug.

After this course, I’d handle it differently. I’d look closely at the traces of these failures, identify common patterns, form a hypothesis about why it was happening, and then test it systematically. In this case, the real issue was that history wasn’t being shared between two components: one asked the question, the other just saw “yes” with no context. By approaching it through error analysis, the root cause becomes clearer and easier to solve.

Annu Augustine

Annu Augustine

Founder, Product Coach at NedRock

Open coding gave me clarity into the model's real behavior, revealing failures my framework missed.

When I built a custom GPT for product managers to help write better user stories, I initially jumped straight into axial coding. I predefined categories of failure based on the INVEST framework (Independent, Negotiable, Valuable, Estimable, Small, Testable), which I often use when coaching teams. At the time, it felt like a solid, practical approach grounded in real-world product work.

After this course, I started applying open coding before forcing outputs into predefined boxes. That shift revealed patterns the INVEST framework would have completely missed. For example, some stories were overly complex even though they technically met the “Small” criteria, and others ignored edge cases or real-world exceptions not covered by INVEST at all.

Open coding gave me a clearer picture of how the model was actually behaving, rather than bending its outputs to fit categories I had assumed upfront. It’s a far more reliable way to uncover the real failure modes.


Get 25% off our next cohort!

Enroll Here


Jack Shaw
Jack Shaw

Co-Founder at Comprendo

Open coding gave us clarity on true error patterns, preventing overconfidence and costly misclassification.

Before this course, I didn’t fully appreciate the risk of skipping open coding. It’s easy to take a small sample, jump straight into categories, and gain false confidence in themes that don’t actually reflect the full range of errors. That’s the “when you only have a hammer, every problem looks like a nail” trap—imposing categories that miss important failure modes.

After this course, I see why open coding matters. It prevents premature categorization, helps me understand saturation, and surfaces the true diversity of errors. I’ve also learned to think more carefully about how evaluation rubrics should be designed. For some products, a “benevolent dictator” works—if one person truly has holistic expertise across every stage of the workflow. But for more complex systems, multiple experts are needed, each contributing perspective from their domain.

In my past work reviewing clinical trial protocols, no single reviewer understood every dimension—ethics, study design, and biostatistics each required deep, specialized expertise. The lesson from this course is clear: open coding reveals the real error space, and evaluation rubrics are strongest when designed with the right balance of expertise.

Richard Ng
Richard Ng

Product Manager, Analytics & AI at Axi

Sanity checks turned unreliable scores into business-aligned predictions I could trust.

I built a churn prediction model for a subscription service and evaluated it using standard metrics like accuracy, precision, and recall on a test dataset. At first, the high evaluator scores looked promising, but they gave me a false sense of confidence. In reality, the model was overfitting, producing outputs that didn’t even add up logically—for example, reporting fewer new onboarded customers than the combined total of retained and churned customers.

Before this course, I relied too heavily on evaluator scores, only realizing something was wrong when results felt “too good to be true.” I had to manually compare predictions with business reports and historical trends to uncover the discrepancies.

After this course, I know how to approach it differently. I would run cross-validation across multiple folds to confirm stability, add domain-specific sanity checks (like validating customer balances against business logic), and bring in qualitative stakeholder input. These practices create a stronger evaluation process—less dependent on raw metrics and more aligned with real-world trustworthiness.

user avatar
Puja Nanda
@pujananda_
Just wrapped up Evaluating AI Systems by @HamelHusain & @sh_reya - practical & production-focused for devs, PMs & product leaders. Key takeaways: start w/ well-scoped prompts, own the annotation process, and keep evals specific + monitored. Course link: maven.com/parlance-labs/evals?promoCode=testimonial-c2-81&ajs_uid=569133
2
user avatar
Anja Buckley
@anjabuckley1984
I just finished the Maven course AI Evals For Engineers & PMs. The mix of live sessions, office hours, and recorded material was perfectly balanced. The expertise of @sh_reya and @HamelHusain was truly impressive. x.com/anjabuckley1984/status/1956645831794115003/photo/1
attached image
user avatar
George Job Vetticaden

VP of Products, AI Agents

linkedin logo
I enrolled in Hamel H. and Shreya Shankar's course "AI Evals For Engineers & PMs" (lnkd.in/ghh3Yk3e), and two weeks in, it's already changing how I approach building agents. The timing couldn't be better. Kevin Weil (OpenAI's CPO) recently said: "Writing Evals is going to become a core skill for product managers." He's absolutely right—but here's what I discovered: there are tons of content on agent architectures and basic evals, but almost nothing on evaluating real multi-agent systems. That gap is what drove me to immediately apply the course learnings to my multi-agent health system. There have been so many insights and aha moments over these last two weeks that I wanted to share them with my community. The course taught me to build a "vocabulary of failure"—but what really clicked was creating a complete loop where everything connects. When you spot an issue, an Evaluation Agent helps you turn it into a test case. LLM-as-Judge doesn't just score it—it diagnoses root causes and prescribes exact fixes. Claude Code takes those prescriptions and automatically implements them. Then the same test runs in your CI/CD pipeline to verify the fix and prevent regressions. Every step feeds the next, turning evaluation from a one-time check into a continuous improvement engine. I've captured everything—from the methodology and tools to real examples showing how a failed health query becomes an automated fix. 🎥 Watch the technical demo (attached) 📝 Read the deep dive blog: lnkd.in/g5ArvB3V 💻 Explore the code: lnkd.in/gGsNu7pZ #AIEvals #MultiAgentSystems #AIAgents #ClaudeCode #ProductManagement #Anthropic #DeveloperTools #AgenticDevelopment
Kranthi Kiran
Kranthi Kiran

AI Engineer, Vantager

Practical techniques that generalize regardless of the tools you use.

This course provides a great take on building reliable AI applications. It teaches practical techniques while developing intuition for evaluating LLM based systems. What sets it apart is its tool agnostic approach. Rather than focusing on specific platforms, it emphasizes systematic and scientific principles that apply everywhere.
Marek Šuppa
Marek Šuppa

Principal Data/AI Scientist/Engineer, Slido/Cisco

This course teaches material you can't find anywhere else. Investing in this course is a no brainer.

"Why would a Principal Data Scientist take a course on evals? Shouldn't they know this already?!" Fair question. Here's why I think it's still worth it: 1. Learn from the best. LLM evals are still nascent, so learning from people doing this full-time across multiple contexts is invaluable. Game recognizes game, and as you'll learn in the very first week already, Shreya and Hamel are top-tier. 2. Get the full picture. Evals are more art than science right now. Getting a coherent view of best practices and mature end-to-end pipelines designed from first principles is rare. Their course reader alone is worth multiple times the price. 3. Build common vocabulary. If you're building impactful LLM products, you'll collaborate with PMs. Having both technical folks and PMs in sessions creates a shared language that bridges the gap -- something you can't find anywhere else for this topic. In other words, whether you're a PM, a Principal or a vibe coder building with LLMs, this course is simply a no-brainer.

Rich Heimann
Rich Heimann

Director of AI

This course helps you transform guesswork into actionable insights.

Evaluating generative AI often relies on abstract benchmarks disconnected from real-world outcomes and detached from practical experience. To bridge this gap, many rely on subjective impressions or “vibes” (i.e., the eye-test). The eye-test is important since evaluators directly interact with the model in realistic contexts. However, vibes and qualitative evaluations are not particularly helpful in evaluating application-specific performance, consistency, bias, reliability, security, or return-on-investment. In contrast, application-specific evals reflect an essential day-to-day operational focus. They aim to assess if a specific pipeline performs successfully in a particular task using realistic data. This course is an important step to transform guesswork to actionable insights. Application-specific evals are not sufficient but they are necessary and so often overlooked. Check out this course. It covers a lot of important terrain.


Get 25% off our next cohort!

Enroll Here


Jonathan Sarker
Jonathan Sarker

Machine Learning Engineer

Highly recommend this course.

Hamel's and Shreya's course, "AI Evals for Engineers and PMs" has been a great resource for learning how to tame the scary wild world of LLM-based applications. It was fascinating see how high leverage it is to become "one with the data" and how to do that explicitly for LLMs and agents. Hamel's and Shreya's vast combined expertise is clearly shown in the course's lectures, practical exercises and even a textbook. On top of that, the guest lectures provide even more gems of practical wisdom. I'd highly recommend this course to anyone serious about learning how to improve their LLM-based AI products.
Ajaykumar Rajasekharan
Ajaykumar Rajasekharan

Senior Director of Machine Learning, SponsorUnited

An Absolute must. Valuable for any AI engineer and product manager.

The AI Evals Course by Shreya and Hamel is an absolute must for everyone serious about building AI applications into production. I have been following Hamel's and Shreya's work for quite some time and it was really awesome to learn from them all the concepts of error analysis, measurement best practices, LLM as Judge + how to make sure it is reliable with human evaluations, collaborative analysis of errors, evaluation of multiturn chats, creation of datasets for CI/CD etc. The last topic on accuracy and cost optimization is really useful as we are seeing in our applications when scaling. All in all this is an amazing set of vital information that is valuable for any AI engineer and product manager. Highly recommend this course to everyone.

user avatar
Alan Chang

AI, Machine Learning, and Biology @ Stanford University

linkedin logo
Some quick reflections after wrapping up the quite enjoyable AI Evals for Engineers and Technical PMs course led by Hamel H. and Shreya Shankar. This is the second course co-taught by Hamel that I've taken, following the LLM Finetuning course that became an amazing mini-conference that expanded far beyond finetuning. Course highlights: 1. Immediately actionable advice: the strategy and demonstration of open coding approaches for evals, how to create a custom human eval interface and why, in addition to great case studies of AI evals in action. Shout out to the course reader which is a great companion to the course and solid reference moving forward. 2. The material was very tightly focused - Hamel and Shreya did a great job of distilling key concepts and approaches, intentionally introducing only a few frameworks. 3. Evolving homework assignments (even though I regrettably was too busy during the course to engage with the homework as I would have liked) that paralleled the main material in a very complementary manner. I did get a lot of inspiration for how to approach a few personal projects that I have in mind and how to approach evals in the many LLM-related projects in the lab that I'm a part of. 4. Guest speakers that covered a broad array of additional topics of interest / specific use cases. I particularly enjoyed the talk on Reasoning Models & LLM-as-a-Judge with Alex Volkov You'll be able to find the course on Maven at lnkd.in/gSPqf_Bk - you'll probably get more out of it if you are actively working on an LLM-based product or project that already has many generated outputs (or can get them soon).
user avatar
Maxime Lelièvre

Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech

linkedin logo
Huge thanks to Hamel H. and Shreya Shankar for their "AI Evals for Engineers and PMs" course! I really enjoyed the hands-on approach, particularly the assignments and guest lectures, which provided invaluable real-world application scenarios. A key takeaway for me was the importance of thoughtful evaluation frameworks, especially as AI models become commoditized and specialized for specific domains. I really liked their "Analyze-Measure-Improve" eval loop concept and their emphasis on keeping the human in the loop, especially domain experts. As someone focused on education in Low and Middle-Income Countries (LMICs), I'm genuinely excited by the opportunities AI presents, while also being very aware of the inherent risks. This course truly emphasized how robust evaluations are crucial for navigating both the immense potential and the significant challenges. With models improving in performance and decreasing in cost (as we've seen with our pedagogy benchmark: lnkd.in/dtiUxVWM), I believe that new opportunities are opening up rapidly. I am thinking about how complex pipelines like RAG could be prohibitively expensive to implement and evaluate. I'm currently working on an AI chatbot to support teachers' professional development in Sierra Leone, and learning these evaluation strategies is something that could really make it better and ensure responsible deployment. Highly recommend this course to anyone building or working with AI products! lnkd.in/dgk-Hcv9
user avatar
Maxime Lelièvre

Visiting Researcher @ Columbia | MSc Robotics & Data Science @ EPFL | Passionate about EdTech

linkedin logo
Huge thanks to Hamel H. and Shreya Shankar for their "AI Evals for Engineers and PMs" course! I really enjoyed the hands-on approach, particularly the assignments and guest lectures, which provided invaluable real-world application scenarios. A key takeaway for me was the importance of thoughtful evaluation frameworks, especially as AI models become commoditized and specialized for specific domains. I really liked their "Analyze-Measure-Improve" eval loop concept and their emphasis on keeping the human in the loop, especially domain experts. As someone focused on education in Low and Middle-Income Countries (LMICs), I'm genuinely excited by the opportunities AI presents, while also being very aware of the inherent risks. This course truly emphasized how robust evaluations are crucial for navigating both the immense potential and the significant challenges. With models improving in performance and decreasing in cost (as we've seen with our pedagogy benchmark: lnkd.in/dtiUxVWM), I believe that new opportunities are opening up rapidly. I am thinking about how complex pipelines like RAG could be prohibitively expensive to implement and evaluate. I'm currently working on an AI chatbot to support teachers' professional development in Sierra Leone, and learning these evaluation strategies is something that could really make it better and ensure responsible deployment. Highly recommend this course to anyone building or working with AI products! lnkd.in/dgk-Hcv9
user avatar
Šimon Podhajský

Senior LLM/Data Engineer | 🤖 Evals, ML/AI, Python

linkedin logo
In the spirit of learning in public, here's my paean to Maven.com. Two recent sprints really sharpened my evals game: - @Jason Liu's "Systematically Improving RAG Applications". - Shreya Shankar's and Hamel H.'s "AI Evals". Both month-long courses were chock-full of insight, research and experience. (Not to mention the extra guest lectures, all meticulously chosen to add to the main learning battery, that sometimes came by every. other. day. Sometimes in inconvenient timezones. I'm still in the process of catching up.) What did we learn? The thing is, Jason, Shreya and Hamel aren't exactly 𝘴𝘦𝘤𝘳𝘦𝘵𝘪𝘷𝘦 about the key insights: if anything, they shout 𝘦𝘳𝘳𝘰𝘳 𝘢𝘯𝘢𝘭𝘺𝘴𝘪𝘴! and 𝘭𝘰𝘰𝘬 𝘢𝘵 𝘺𝘰𝘶𝘳 𝘥𝘢𝘵𝘢! and 𝘤𝘰𝘯𝘴𝘵𝘳𝘶𝘤𝘵 𝘦𝘷𝘢𝘭 𝘧𝘶𝘯𝘯𝘦𝘭𝘴! from the rooftops. It's still really rather useful to wrap the slogans into a specific happy path, note the edge cases, and integrate them with insights from other practitioners and prior research. (For example, I've never heard about 𝘨𝘳𝘰𝘶𝘯𝘥𝘦𝘥 𝘵𝘩𝘦𝘰𝘳𝘺 before, but it's an excellent fallback when theories and taxonomies just aren't jumping at you from your chaotic annotations.) Oh, and you get an early sneak peek into Shreya's and Hamel's book that ties it all together! All in all: Highly recommended.
attached image
user avatar
Bryce York

AI/ML/LLM and UX-centric B2B startup product management leader with a love for zero-to-one product innovation • 12+ yrs PM, 7+ yrs AI/ML, 5+ yrs adtech • Writer/Speaker • Advisor

linkedin logo
If you haven't already heard about Shreya Shankar & Hamel Husain's Maven course on evals and you have any plan to build in the LLM-space, you're missing out. I'm about to finish the course and I couldn't recommend it more highly. I can't think of a better way for an engineer or technical PM to spend their L&D allowance. It had already paid for itself in the first week and provides a step-by-step playbook to build and continuously improve LLM-powered products. It's given me & my teams a structured process to follow. Even though we were doing a lot of the same things, i had to constantly keep us on track without a framework to point to. Their process is going to become the new de facto process and what better way to learn and master it than in a live cohort class. Their upcoming book will no doubt be on all of our bookshelves, but if you have the $$ and want to get in early - do it!
attached image
E
ess d

Senior Data Scientist @Amazon

Taking the course "Move beyond 'vibe-checks' to data-driven evaluation" has been a game-changer in how I approach AI development and evaluation. Before this, my team often relied on subjective assessments—what we called “vibe-checks”—to judge model outputs. This course provided a structured, systematic framework that replaces guesswork with measurable, repeatable methods for evaluating AI performance. I learned how to build robust evaluation systems tailored to the unique challenges of AI applications—especially those involving stochastic or subjective outputs. The curriculum walked us through defining meaningful metrics, conducting systematic error analysis, generating synthetic data, and implementing automated evaluation pipelines. I also gained hands-on experience evaluating complex architectures like RAG systems and multi-step pipelines, and I now understand how to monitor models in production with continuous feedback loops. The biggest takeaway is that effective AI evaluation isn’t just about measuring accuracy—it’s about building a comprehensive lifecycle strategy that spans development, deployment, and ongoing improvement. Learning how to prioritize engineering efforts based on real data, rather than hunches, has already helped us optimize both performance and cost in our LLM applications. Absolutely. Whether you're an AI engineer, product manager, or data scientist, this course gives you practical tools and frameworks that apply directly to real-world AI challenges. It’s especially valuable for teams looking to move beyond ad-hoc evaluations and establish a collaborative, metrics-driven culture around AI development. In short, this course has transformed how I think about evaluation—from a fuzzy afterthought to a foundational part of the AI lifecycle. I can't recommend it highly enough.


Get 25% off our next cohort!

Enroll Here


Mark Manolas
Mark Manolas

Account Director, OpenAI

This course exceeded my expectations.

The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. The biggest takeaway for me was the level of robustness/sophistication that you can build for evals and the impact that it can have! I had very surface level knowledge of evals before this course. The course exceeded my expectations and I would recommend this to my colleagues.
K Vikraman
K Vikraman

Data Scientist , Global Innovation Hub

I learned how to evaluate LLM outputs in a structured manner. My biggest takeaway is the quality of eval pipeline is highly critical to ensure good quality outputs in the production environment. I will definitely recommend this course to my colleagues and to anyone who is deploying LLMs.
Barbara Graniello Batlle
Barbara Graniello Batlle

Self

The AI Evals for Engineers & PMs course, taught by Hamel Husain and Shreya Shankar, is 100% worth the investment. The material couldn't be more up-to-date. There is a perfect balance between theory and real-world hands-on learning. And the office hours and guest speakers are invaluable. I highly recommend it.
user avatar
Piotr W.

Head of QA | Lead Automation Engineer | AI Evaluation Engineer

linkedin logo
I recently took an LLM Evals course led by Hamel H. and Shreya Shankar—and huge thanks to both of them for putting together such a practical experience. 👏 Since some time I'm in the process of redefining, adjusting, and evolving the role of a QA Engineer to make it more prepared for AI-related projects. It felt natural for us to jump into the role of Evaluation Engineer, as there are many similarities to a QA role. One of the biggest takeaways for me was seeing how to build automated eval setups from scratch without relying on external platforms. That alone was a big eye-opener—turns out, a lot of this can be done in-house if you know what you're doing. Also, the sessions on synthetic data creation were really great. We're already planning to write up internal guidelines based on what we learned and use them in upcoming projects. Maybe even an internal tool to automate the process would be a great idea. Overall, this course connected a lot of dots for me. It showed me how QA skills can evolve into something super valuable in the AI space—especially when it comes to designing evaluation frameworks that actually make sense and bring value. 🚀 lnkd.in/dzrmesqM
user avatar
Vivek
@feynwarrwen
Learned a ton of useful stuff about the Evals in the course with @HamelHusain and @sh_reya. Before this course, I would try to change the prompt, change the model, or improve the RAG pipeline - all w/o systematic error analysis (my biggest learning). Thx @HamelHusain @sh_reya!
1
user avatar
Piotr W.

Head of QA | Lead Automation Engineer | AI Evaluation Engineer

linkedin logo
I recently took an LLM Evals course led by Hamel H. and Shreya Shankar—and huge thanks to both of them for putting together such a practical experience. 👏 Since some time I'm in the process of redefining, adjusting, and evolving the role of a QA Engineer to make it more prepared for AI-related projects. It felt natural for us to jump into the role of Evaluation Engineer, as there are many similarities to a QA role. One of the biggest takeaways for me was seeing how to build automated eval setups from scratch without relying on external platforms. That alone was a big eye-opener—turns out, a lot of this can be done in-house if you know what you're doing. Also, the sessions on synthetic data creation were really great. We're already planning to write up internal guidelines based on what we learned and use them in upcoming projects. Maybe even an internal tool to automate the process would be a great idea. Overall, this course connected a lot of dots for me. It showed me how QA skills can evolve into something super valuable in the AI space—especially when it comes to designing evaluation frameworks that actually make sense and bring value. 🚀
user avatar
Deep Gandhi
@deepgandhi91
Look at data with different lenses! A very practical approach in the ‘AI Evaluations’ class by @sh_reya & @HamelHusain. Learned hands-on techniques to deeply analyze and improve AI solutions, new ways to spot failure modes I didn’t know before. Putting it all into practice now.
1
user avatar
Sergiy Korniychuk

Staff Software Engineer - Full Stack at Sondermind Inc

linkedin logo
Recently I took part in a course called "AI Evals For Engineers & PMs" led by Hamel H. and Shreya Shankar. It provided a solid foundation for assessing systems that use AI in various products. My biggest takeaway is the brilliant "three gulfs" mental model: the gulfs of comprehension, specification, and generalization. This model completely reframed how I approach understanding and improving LLM applications. It's incredibly powerful and a practical way to break down where things might be going wrong. First, you see if your LLM even understands what users want. Next, you check if your prompts are clear and specific. Finally, you see if your system can handle new or unexpected cases. It's simple but powerful, and something I plan to apply to every new feature and prompt I work on. Another thing that's truly stood out for me is how important trace-based analysis is. Whether you're working with real user data or building synthetic examples, looking at traces is the only way to spot real problems. That's where the real insights come from, and it stops you from wasting time on things that don't matter. This has fundamentally shifted my focus to solving upstream problems. I also really like the idea of starting with code-based evaluators before using LLM-as-Judge - for example checking if response is in the correct JSON format might be very easy to evaluate. This approach advocates saving the more complex, subjective evaluations for later, and instead, focusing on automating the easy wins first. It's about being pragmatic and effective, not just measuring for measurement's sake. Practically, this course has directly impacted how I operate as an engineer. I now see prompts as carefully crafted specifications and understand that the evaluation process is an iterative, continuous cycle. I'm more convinced now about looking at failure modes first, collaborating with domain experts to define ground truth, and ensuring our evals are simple, durable, and tied directly to customer outcomes. It's no longer 'prompt and forget,' but 'prompt, measure, and refine.' Overall, the course content was exceptional, offering a rare balance of theory and practical application. The instructors were incredibly experienced, and their emphasis on real-world scenarios made everything immediately applicable. They also had a lot of guests speakers with real life experience - that was really helpful. This has been one of the most beneficial and enriching learning experiences I've had, and I highly recommend it to anyone serious about building robust AI products!


Get 25% off our next cohort!

Enroll Here


user avatar
Tsacho Rabchev

Software Developer / Inventor

linkedin logo
Just completed "AI Evals for Engineers & PMs" by Hamel H. and Shreya Shankar, and it was one of the most thoughtfully crafted technical courses I’ve taken! Hamel brings a rare blend of depth, clarity, meticulous attention to detail, and patience. His explanations cut through complexity without ever oversimplifying, and his thoughtful approach reflects a deep commitment to helping others truly understand. He’s a natural teacher. What stood out just as much was the uncompromising focus on quality — both Hamel and Shreya clearly care about delivering a meaningful, well-structured learning experience. Their collaboration created a space that was not only technically rich but also engaging and supportive. If you’re working with LLMs and want a solid, reproducible framework for evaluation — this course delivers. If you're working with LLMs and want a rigorous, hands-on approach to evaluation, I highly recommend checking out the course: lnkd.in/g_yGUxFK
user avatar
Sergiy Korniychuk

Staff Software Engineer - Full Stack at Sondermind Inc

linkedin logo
Recently I took part in a course called "AI Evals For Engineers & PMs" led by Hamel H. and Shreya Shankar. It provided a solid foundation for assessing systems that use AI in various products. My biggest takeaway is the brilliant "three gulfs" mental model: the gulfs of comprehension, specification, and generalization. This model completely reframed how I approach understanding and improving LLM applications. It's incredibly powerful and a practical way to break down where things might be going wrong. First, you see if your LLM even understands what users want. Next, you check if your prompts are clear and specific. Finally, you see if your system can handle new or unexpected cases. It's simple but powerful, and something I plan to apply to every new feature and prompt I work on. Another thing that's truly stood out for me is how important trace-based analysis is. Whether you're working with real user data or building synthetic examples, looking at traces is the only way to spot real problems. That's where the real insights come from, and it stops you from wasting time on things that don't matter. This has fundamentally shifted my focus to solving upstream problems. I also really like the idea of starting with code-based evaluators before using LLM-as-Judge - for example checking if response is in the correct JSON format might be very easy to evaluate. This approach advocates saving the more complex, subjective evaluations for later, and instead, focusing on automating the easy wins first. It's about being pragmatic and effective, not just measuring for measurement's sake. Practically, this course has directly impacted how I operate as an engineer. I now see prompts as carefully crafted specifications and understand that the evaluation process is an iterative, continuous cycle. I'm more convinced now about looking at failure modes first, collaborating with domain experts to define ground truth, and ensuring our evals are simple, durable, and tied directly to customer outcomes. It's no longer 'prompt and forget,' but 'prompt, measure, and refine.' Overall, the course content was exceptional, offering a rare balance of theory and practical application. The instructors were incredibly experienced, and their emphasis on real-world scenarios made everything immediately applicable. They also had a lot of guests speakers with real life experience - that was really helpful. This has been one of the most beneficial and enriching learning experiences I've had, and I highly recommend it to anyone serious about building robust AI products!
user avatar
Desmond Choy
@Norest
Halfway through @HamelHusain & @sh_reya’s #AIEvals course and it’s already making me rethink how I approach my LLM projects and conduct error analysis. The course has a very high signal-to-noise ratio. Highly recommended!
user avatar
Benjamin Pace

Data Science @ Candidly

linkedin logo
Had a great time taking Hamel H. and Shreya Shankar course on AI Evals. Out of everything I've seen so far in the space their approach is the among the best and most concrete I've seen to in terms of actually improving AI Products. If you are actively working on AI products / features I strongly suggest checking it out! AI Evals For Engineers & PMs: lnkd.in/gvgnkYSn
user avatar
Joel Dean

AI Founder. Tech Entrepreneur. Content Creator. Prime Minister Youth Awardee. Former WEF Global Shaper Curator

linkedin logo
🚀 Course Review: “AI Evals for Engineers & PMs” by Shreya & Hamel Just wrapped up this powerhouse training and I’m genuinely impressed. Before the course, LLM pipelines often felt like opaque black boxes especially once you add multimodal steps. Debugging failures across retrieval, reasoning, and generation was equal parts art and guess-work. What changed for me? 1. Error analysis as a first-class habit Learned a systematic “open coding → axial coding” workflow that surfaces the first upstream failure fast and keeps me focused on the highest-leverage fixes. 2. LLM-as-Judge, done right We went deep on calibrating true-positive / true-negative rates, designing binary metrics that actually move product KPIs, and even building cheap → expensive model cascades to keep eval costs sane. 3. Synthetic data that matters Instead of random tuple generation, we start with failure hypotheses, then let (feature, persona, scenario) dimensions explode exactly where risk is highest. 4. Production monitoring playbook Continuous eval pipelines, concept-drift alerts, and lightweight human in the loop dashboards critical for enterprise deployments where silent failures mean silent churn. The net result? I can now prove that improvements ship, spot regressions before users do, and sleep better knowing my AI products are measurable, debuggable, and trustworthy. If you’re building or PM-ing production-grade AI systems especially those serving high-stakes, enterprise use cases do yourself (and your users) a favour and take this course (lnkd.in/ecR5pGmk). Huge thanks to Shreya Shankar and Hamel H. for distilling years of hard-won lessons into an engaging, hands-on curriculum. 🙌
user avatar
Kenneth Reeser

Senior AI/ML Architect at Vanguard

linkedin logo
Just wrapped up an incredible course on AI Evaluations for Engineers and Product Managers taught by Hamel H. and Shreya Shankar, and I can’t recommend it enough. As a Senior AI Architect, I walked away with a deeper appreciation for the art of prompt engineering. This course reminded me that crafting effective prompts isn’t just about syntax—it’s about creativity, experimentation, and nuance. One of the biggest challenges in AI development is building labeled datasets for evaluation. It’s time-consuming, often inconsistent, and can be prohibitively expensive—especially when scaling fine-tuned models. This course introduced practical strategies to start small and scale smart, without the heavy overhead. What really stood out was how the course demonstrated the power of reusability. With the right tools, you can reuse datasets, evaluation prompts, application prompts, and even fine-tuned models—dramatically cutting costs and accelerating development. If you're working in AI/ML and looking to build more efficient, scalable, and cost-effective solutions, this course is a must. AI Evals for Engineers & PMs lnkd.in/eZg39xhD Next cohort begins July 21.
user avatar
Joshua Pittman

Prompt Engineer at Outlever

linkedin logo
Just finished the AI Evals course by Hamel H. and Shreya Shankar - solid training on how to properly test and measure AI applications. They're doing another cohort soon lnkd.in/dM9Ye55H The course does a great job connecting theory with real-world practice. Their Three Gulfs model helped me understand where things typically break in LLM workflows. The biggest takeaway was learning how to build proper failure taxonomies using coding techniques. I was making the common mistake of jumping straight to LLM-as-a-judge without first figuring out what I actually needed to measure. The sections on calibrating LLM judges with TPR/TNR methods were exactly the technical content I was looking for. The programming assignments were really valuable -and the guest speakers were great. If you're working with LLMs in production, this course is worth it.
user avatar
Alpa Dedhia

Engineering Lead | Solutions Architect | AWS | Applied Generative AI Solutions | Search & Relevance | CSM | CSPO

linkedin logo
⚓️ Why So Many Agentic AI Projects Are Getting Abandoned In the past year, I’ve seen a wave of agent-based products and prototypes—many promising, some even well-architected—quietly get shelved. The most common explanation? “It doesn’t work well.” But here’s the truth: it’s not that these systems can’t work. It’s that most teams never defined what “working well” even means. Without a robust evaluation strategy, you’re flying blind. You don’t know if the agent improved, regressed, or just hallucinated in a new and exciting way. Shipping an agent without a way to evaluate it is like launching software with no tests. And yet, that’s what most of us are doing. We need task-level evals, regression testing for LLM outputs, structured error analysis, and a feedback loop grounded in data—not vibes. I recently took the AI Evals for Engineers &PMs course by Shreya Shankar & Hamel H. and it completely changed how I think about building LLM systems. It gave me a concrete framework to move from anecdotal debugging to systematic evaluation. I now treat evals as code, build judgment pipelines I can trust, and iterate with measurable confidence. Highly recommend it for anyone serious about building reliable, production-grade AI systems. ⸻ #AI #GenerativeAI #LLM #AgenticAI #MLOps #AIEngineering #EvaluationFirst #MultiAgentSystems #LLMOps #AIEvaluation #LLMTesting #MavenCourses #StartupEngineering #AIProductDevelopment


Get 25% off our next cohort!

Enroll Here


user avatar
Geoffrey Pidcock

AI and Strategy at ANSTO | MBA Melbourne Business School | Ex Atlassian

linkedin logo
It's been a blast learning AI evals from Shreya Shankar and Hamel H. these last 4 weeks. Everybody who writes prompts - for API calls, for CoPilot agents, for ChatGPT projects - can benefit from their solid instruction backed by theory and grounded in practice. A new cohort starts soon, with details linked below (along with a 35% discount code). I've summarised my take-homes in the image below, which is the approach I now take to writing and tuning the ~20 prompts I use in my work and personal projects. I won't miss the extremely early starts though 😅 (New York and Sydney have a challenging overlap!) bit.ly/evals-ai
attached image
user avatar
Rasool Shaik

GenAI | Embedded/IoT | Medical Devices -Helping organizations to build and validate products

linkedin logo
I am attending the AI Evals course by Hamel H. and Shreya Shankar can’t recommend it enough! 👉 lnkd.in/gHQusTCG This course gave me a solid foundation in designing evaluation strategies for different types of GenAI applications. Some of the most valuable lessons for me included: How to perform effective error analysis and pinpoint areas where AI systems are falling short Building custom annotation tools—a completely new skill for me, and one that adds a lot of control to the evaluation process Crafting evaluation approaches that are tailored to different model architectures Techniques for scaling evaluations efficiently while keeping costs manageable What made the course stand out was the strong emphasis on hands-on learning—real exercises, live discussions, and opportunities to apply concepts to real-world problems. It was a well-balanced mix of theory and practice. I'm looking forward to digging into the homework next and continuing to apply what I’ve learned. Highly recommend this course to anyone working with GenAI products or systems.
U
Uday Ramesh Phalak

Machine Learning Engineer | RecSys, AI Evals, GenAI, Climate Change, UX | Co-Founder at HazAdapt

linkedin logo
The most dangerous bugs in your AI product aren't the ones that cause it to crash. They're the ones where the AI looks like it's working perfectly, but it's confidently wrong. That's why we need AI Evals. Just wrapped up Shreya Shankar and Hamel H. "AI Evals For Engineers & PMs" course. Honestly wasn't sure what to expect going in, but it ended up being way more practical than I thought. The biggest thing I took away was this framework they call the "Three Gulfs" - basically that every AI product struggles with three core problems: - Can we actually understand what our data/pipelines are doing? (Gulf of Comprehension) - Are we being clear enough in our prompts? (Gulf of Specification) - Will this work on new data we haven't seen? (Gulf of Generalization) Most teams try to tackle all three at once and just end up spinning their wheels. Approach presented by shreya and hamel is more systematic. They break it down into this "Analyze-Measure-Improve" cycle. The pitfalls they covered in cycles hit close to home. When outsourcing the error analysis and wondering why our evals were garbage. Turns out you need to actually look at the failures yourself. Also, we kept testing judge models on the same examples we used to build them; obviously they looked great until they hit production. A few reasons you might want to check this out: - If you're shipping AI features and crossing your fingers they work, this gives you actual ways to know what's happening - The cost optimization stuff alone could pay for the course (they showed examples of 60%+ savings) - Honestly, knowing how to do evals properly is becoming pretty essential if you want to work on AI products seriously There is another cohort starting soon if anyone's interested. (Link in comments) Worth it if you're building anything with LLMs in production.
attached image
user avatar
Ankur Bhatt

AI Engineering | Product and Technology Leader | CTO

linkedin logo
Just completed AI Evals for Engineers & PMs — an outstanding experience in applying rigor and clarity to AI systems. My biggest takeaway: Evaluation are not a one-time task — it’s a continuous learning process. Designing automated evaluators (code-based or LLM-as-judge) isn’t just about checks; it’s about building a feedback loop that helps teams and systems grow, improve, and deliver impact. That’s what true product leadership looks like. The frameworks, real-world examples, and collaborative discussions in this course have inspired me to raise the bar on AI product quality. Don’t miss their final live cohort starting July 21st: lnkd.in/g-DY74tC Big thanks to Hamel H. and Shreya Shankar for creating a learning environment that fosters both technical excellence and leadership growth!
attached image
user avatar
Robb Winkle

Fire your systems integrator. Conversational AI-native systems expert for ERP (Techstars '24)

linkedin logo
🚨 Just wrapped the AI Evals for Engineers & PMs course by Hamel H. and Shreya Shankar, and… WOW. 🔥 This wasn’t just another AI course — it was a firehose of practical knowledge, packed with insights I could put to work immediately. Even though it was intense, the format lets you take what you need now and revisit deeper topics as they become relevant. Biggest takeaway: Doing real error analysis and looking at your data is harder and more time-consuming than you’d think — but the payoff is massive. You build intuition about where your product fails, which lets you design systematic, automated evals that outperform off-the-shelf tools. It’s the difference between flying blind and instrument-level visibility into your product quality. Impact on our team: It filled a crucial gap in how we approach evals. One example: the 2-step process for generating synthetic user queries with an LLM — a structured method that’s far better than jumping straight into query generation. Huge quality boost. If you’re building AI-powered products, get in on the next one: 👉 lnkd.in/ebBtGKSE It might be the last live run of the course — don’t miss it. Massive thanks to Hamel and Shreya for sharing a framework that truly shifts how you think about evaluation. 👏👏
attached image
user avatar
Harris Brown

Fractional AI product leader for early-stage founders & teams | ex-Airbnb

linkedin logo
It's the final week of the Maven "AI Evals For Engineers & PMs" course. 🔥🔥🔥 Grateful for the time and effort Hamel and Shreya Shankar have put in to craft an amazing course. Just the right mix of practical and theoretical. 10/10 recommend. So many learnings that I'm excited to share, but one insight that I can't stop thinking about: 𝘆𝗼𝘂 𝗰𝗮𝗻𝗻𝗼𝘁 𝗼𝘂𝘁𝘀𝗼𝘂𝗿𝗰𝗲 𝘆𝗼𝘂𝗿 𝗲𝘃𝗮𝗹 𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝗮𝘀 𝗮 𝗰𝗼𝗺𝗽𝗮𝗻𝘆. As tempting as this is to do - just have an LLM do it, right? - it is so clearly an AI product-building failure mode. The companies, and product teams, that take evals seriously (and from the start) will be the ones that succeed. An added bonus to taking this course -- we've been able to implement many of the learnings directly with Native Studios clients, which is pretty cool.
PC
Peter Cardwell

Software Engineer at Snap, Inc.

linkedin logo
Hi all! I wanted to give a quick plug for the AI Evals For Engineers & PMs (lnkd.in/g6QBKJMs) course taught by Hamel H. and Shreya Shankar that I've been taking over the last few weeks. During my career break I’ve been operating as a solo dev on my own projects, wearing many hats I'm not used to—from UX design to mobile development to product management. On the backend, what I've found most challenging about building in the era of LLMs isn't engineering any one component, but putting a process - a product evaluation - in place that enables me to choose the most appropriate techniques and ultimately guide development in the right direction. At a high level, the process is straightforward: 1. Gather data (ideally real production data, or synthesize if you don't have it) 2. Analyze the data carefully 3. Build a hierarchy of errors you observe 4. Create automated evaluators for each error type: Simple programmatic evaluators when possible LLM-as-judge for more nuanced errors, aligned to human evaluators 5. Operationalize everything to guide development, catch regressions, and monitor production Of course, the devil is in the details - it all gets complicated quickly when you start layering in a retrieval pipeline, multi-turn conversations, memory, tool-calls, MCP and so on. I’ve found Hamel's blog posts on this topic (lnkd.in/gqnH6Exa, lnkd.in/gQHH3dJS, lnkd.in/g596NAHF) have been an incredibly useful roadmap for me putting this into practice, so I signed up for this course on a whim—and I'm very happy I did. What I particularly loved: Office hours were goldmines of useful discussion. Hearing how other practitioners across the industry are tackling their challenges was incredibly valuable. The discord chat + lecture combo worked perfectly for me. In a traditional in-person lecture, I would be far too introverted to stand up in front of a few hundred other people to ask my confused questions, but somehow I had no such reservations doing the same thing in a chat. Hands-on homework assignments made everything click. It's hard to apply these concepts while building new things, so having well-scoped projects designed to walk you through the process end-to-end was illuminating. Responsive instructors who adjusted course material based on Discord discussions, plus a clearly written course reader and excellent guest speakers. I’m excited to put everything I’ve learned into practice and I highly recommend this course to anyone building LLM-based products in 2025.
user avatar
Joshua Pittman

Prompt Engineer at Outlever

linkedin logo
Just finished the AI Evals course by Hamel H. and Shreya Shankar - solid training on how to properly test and measure AI applications. They're doing another cohort soon lnkd.in/dM9Ye55H The course does a great job connecting theory with real-world practice. Their Three Gulfs model helped me understand where things typically break in LLM workflows. The biggest takeaway was learning how to build proper failure taxonomies using coding techniques. I was making the common mistake of jumping straight to LLM-as-a-judge without first figuring out what I actually needed to measure. The sections on calibrating LLM judges with TPR/TNR methods were exactly the technical content I was looking for. The programming assignments were really valuable -and the guest speakers were great. If you're working with LLMs in production, this course is worth it.


Get 25% off our next cohort!

Enroll Here


Gurunath Parasaram
Gurunath Parasaram

Data Scientist at Tiger Analytics

attached

The most practical AI course I've taken, with immediate value.

As a Data Scientist, I joined “AI Evals for Engineers & PMs” by Hamel and Shreya to get better at evaluating LLM systems in real-world settings, particularly in the context of LLMs being used for a variety of tasks. The course covered everything from fundamentals and error analysis to production monitoring and cost optimization. What stood out was how practical it was—teaching us to define use-case-specific metrics, collaborate with PMs, slice errors meaningfully, and avoid over-relying on automated scores. The lessons and guest talks (like on RAG evals, failure funnels, and continuous human review) felt directly applicable to my work, especially when building retrieval-based bots or monitoring model drift. It’s not a course that spoon-feeds you; it gives solid frameworks and real-world habits to build eval pipelines that actually reflect user experience. If you’re working on production ML or LLM projects and want to move beyond standard metrics, this is worth your time. I’ve already applied a few techniques at work and found them super helpful. I would strongly suggest anyone with an interest in evals to take this course!
Omar Irfan Khan
Omar Irfan Khan

Dev Team Lead

Highly recommend this course!

The AI Evals course has helped me gain knowledge of how we can true evaluate LLMs in a meaningful way and actually understand whats going wrong. Hamel and Shreya did a wonderful job in explaining how evaluations can be done in a structured manner and on what to try. The course guests were an additional bonus who gave their insights on how they carried evaluations out. I would highly recommend this resource to anyone who builds with LLMs and are wondering how to effectively understand why the LLM isnt working the way it was trained and to better understand what is going on behind the scenes!
Muhammad Jarir Kanji
Muhammad Jarir Kanji

Data Scientist

Amazing instructors.

Hamel and Shreya are amazing instructors and this course has been a great resources for me in understanding how to build robust, enterprise-grade evals and AI pipelines. What I found particularly useful were the guest lectures, which bring a variety of opinions from industry experts and practitioners on different topics that relate to AI and evals.
effie goenawan
effie goenawan

Head of Product, Tavus

Great insights that is shaping how we evaluate AI products.

Taking the AI Evals course with Hamel and Shreya has been really valuable. The course has given me a solid framework that's already shaping how we evaluate our AI products. The homework mirrors real work challenges, and guest speakers bring great insights.

Alex Elting
Alex Elting

Software Engineer

I learned how to be truly effective in creating LLM-powered applications

I have a career developing software, and I've been tinkering with LLMs since before ChatGPT. I feel like the practical eval techniques that Shreya and Hamel teach in their course are what I needed to glue these two skills together and become truly effective in creating LLM-powered applications. Developing for LLMs is not like traditional software development, and evals are the big difference.

Yusong Shen
Yusong Shen

Software Engineer, Google

Comprehensive and practical curriculum

Indispensable for Robust AI Development The "AI Evals For Engineers & PMs" course provided an indispensable framework for evaluating LLM applications, fundamentally shifting my approach from guesswork to data-driven measurements. My key takeaway is the Analyze-Measure-Improve lifecycle, coupled with the "Three Gulfs" model for pinpointing failure origins. The rigorous methodology for building and validating LLM-as-Judge evaluators—including bias correction and confidence intervals—is a game-changer for trusting subjective evaluations. Hamel Husain and Shreya Shankar are truly experts, delivering a comprehensive and practical curriculum that directly addresses the challenges of building reliable AI in a dynamic environment. This course is a must for anyone serious about improving their AI development process.

Trina Sen
Trina Sen

Palette, CPO

A Masterclass in Practical AI Evaluation.

From Benchmark to Moat — A Masterclass in Practical AI Evaluation This course is at the cutting edge of AI research— and not just in theory. What stood out most to me is how deeply practical it is: it teaches you how to build evals that work for your own product, define product taste by sharpening what "good output" really means, and most importantly, how to scale this method across teams and decisions. The biggest shift for me was reframing evals not as a benchmark to clear, but as a strategic moat—core to how your product learns, evolves, and differentiates. As someone from a non-technical background, I could still grasp the concepts (even if the code got heavy at times).The community around the course is a major bonus—full of helpful discussions, fresh perspectives, and constant knowledge exchange. The guest lectures were especially valuable, showing how companies apply these ideas in the wild, and how they tailor their evaluation frameworks to suit specific needs and constraints. I’d highly recommend this course to anyone building with AI—especially those who want to go beyond shipping models to shaping real-world, high-trust outcomes.

Lukasz Kowalczyk
Lukasz Kowalczyk

Soothien HealthTech Advisory

A must for any developer or PM building AI products.

I’m a physician and have built health tech solutions and health AI solutions, but I’m not overly technical. This course was eye-opening about the importance of AI evaluations. It’s a must for any developer or PM building AI for enterprise or regulated industries. This is what will make AI products reliable. Hammel and Shreya are amazing, and so are their top-notch guest lectures as well. I took this course because I wanted to learn from the industry leaders actually doing the work. You’ll learn the entire process of building, AI evaluations, not just by reading, but also by doing. This is the technical component. Using windsurf and Claude I was able to complete it even though I don’t code as part of my main job. It’s well worth the effort. This course is dense, especially if you do not code or have a familiarity with statistics. My background in medicine and healthcare statistics helped me understand some of the core concepts. Overall, this is an amazing course and an essential skill set for building AI healthcare applications or in enterprise settings. I’m recommending it to all my colleagues.


Get 25% off our next cohort!

Enroll Here


Uday Phalak
Uday Phalak

Machine Learning Engineer | Co-Founder at HazAdapt

Good course if you want to build products people actually trust.

Coming from recommendation systems and a UX background, I knew specific evaluations. I'd run some A/B tests, check a few metrics, and call it good. But my approach to AI evals was completely naive. I used no systematic method and hoped things would work. This evals course gave me the structure I was missing. The Three Gulfs framework explained why I kept unknowingly failing. We don't understand our data (Comprehension), write vague prompts (Specification), and models behave unpredictably on real inputs (Generalization). The analyze-measure-improve cycle felt familiar from UX research but applied to AI. Instead of guessing what's broken, you look at failures first, build automated evaluators, and then make targeted improvements. This creates a flywheel where each cycle makes your product better. I am learning from others that LLM production failures were a huge plus from this course. e.g., Hearing about VLMs giving different results 18/55 times at temperature 0, and Shreya showed how model cascades cut her costs by 50%. Successful AI products need humans to regularly review outputs. There's no way around it. Good course if you want to build products people actually trust*. Evaluation separates demos from deployments.

Srinivasan Krishnamurthy
Srinivasan Krishnamurthy

Technology director - Wells fargo

A fantastic course offering an in-depth practical approach to evals.

Highly recommend this course to anyone building Gen AI products and solutions. The biggest takeaway for me was the methodical and scientific process that the instructors outline for doing model evaluation. It helps build a mental model that i am applying at work to build an eval pipeline for RAG solutions. The course also offers an in-depth and practical approach to understanding how generative AI models are evaluated using rubrics and metrics which are critical skills for AI engineers. Overall a fantastic course with a lot of learning and value.
KF
Kaname Favier

Founder, Supago Inc.

attached

This course completely transformed my approach to building AI applications.

This course completely transformed how I approach evaluating LLM applications. Before this course, my evaluation processes were informal at best. Now, I've gained a structured, rigorous methodology to identify errors, quantify improvements, and build automated evaluators. The hands-on assignments and deep dives into error analysis were particularly valuable, directly impacting how efficiently I debug and iterate on LLM products. Whether you're an engineer, product manager, or someone working closely with AI systems, this course is essential—highly recommend!
user avatar
Raymond Weitekamp
@raw_works
i have never been more excited to "look at my data"! @HamelHusain and @sh_reya could have easily charged 10x for their "AI Evals" course. amazing education all star guest lectures (ie @altryne & @charles_irl & @kwindla & @BEBischof ) not to mention the @modal_labs credits...
13
user avatar
Sandeep Pawar
@PawarBI
I am finishing one of the best courses I have taken recently - "AI Evals for Engineers and PMs" by @HamelHusain and @sh_reya . I was in dual mind before signing up but given the practical tips shared by industry leaders and practioners, it's been totally worth it. Building AI x.com/PawarBI/status/1933940567073075526/photo/1
attached image
17
user avatar
Eleanor Berger
@intellectronica
When you hear about @HamelHusain and @sh_reya's AI Evals Course, you probably think: great, a few lectures from two of the best experts in the field. What you really get? - Lectures from two of the best experts in the field. - The best, no, the only textbook on AI evals that twitter.com/825766640/status/1933695903266943234
39
user avatar
Prashant Mital

Applied AI @ OpenAI

linkedin logo
🤖 I've been taking the AI Evals For Engineers & PMs course on Maven over the last few weeks (because OpenAI wasn’t intense enough 😅). While I had picked up bits and pieces of the techniques Hamel and Shreya delve into in this course during the past year, it has been great to understand the full-lifecycle of building & scaling evals. If you think your job is likely to be automated by AI, there's probably a new job being created to evaluate how well that AI replacement performs. Next (and last!) live cohort starts July 21 — enroll at lnkd.in/g7BYtQiH
user avatar
Zara Khan

Account Director @ OpenAI

linkedin logo
Does your AI product actually work? Without evals, it's hard to tell. I just completed the 4-week Maven course "AI Evals For Engineers & PMs" taught by Shreya Shankar and Hamel Husain and it truly exceeded my expectations. Next cohort starts July 21st (registration link in comments). The course offers hands-on exercises, expert guidance, and practical frameworks, giving you a systematic approach you can apply immediately. I could easily follow along without an engineering background. My top customers don't just measure performance with evals. They use them to shape their entire roadmap. Without evals, you're shooting in the dark.


Get 25% off our next cohort!

Enroll Here


Video Poster
John Berryman

AI Consultant

"Absolutely recommend this course to anyone building AI applications"

Sebastian Lozano
Sebastian Lozano

Senior Product Manager at Redfin

attached

Error analysis (and this course) is all you need

Error analysis is all you need. This is the idea that gets drilled into your head over and over again in the AI Evals course. It's so simple, but it's profound...and it's actually way more complicated than you think when you start to consider multi-turn conversations, retrieval systems, agentic systems, multimodal inputs and more. Shreyas and Hamel have distilled the state-of-the-art in AI Evals (and often in development itself!) in this amazing class. Some of my favorite highlights: - Build a custom data annotation app! I was so intimidated by this, but I finally made the leap and vibe-coded something out in an afternoon. It has 10x'd my ability to review conversations. - It's okay to do a little pre-thinking around failure modes, but they really should EMERGE from your testing. It's really hard to build LLM judges so be really thoughtful about what you build them for. - Often, the biggest impact comes from talking disagreements out and figuring out why there is a disagreement in the first place: are your goals unclear? This seemingly technical course has made me a better PM. - And finally, folks in the course just know every AI tool out there. I learned about WhisperFlow and my workflow for typing has changed!

Trey Grainger
Trey Grainger

Founder, Searchkernel LLC

attached

This course is the best place to learn evals

I learned a ton from Hamel and Shreya's course on AI Evals! I've worked at the intersection of information retrieval and AI for nearly two decades, so I've done my fair share of evals throughout that time for search results quality. I've even written books, like AI-Powered Search, that include significant sections on ranking metrics, judgement lists, and model training. Nevertheless, with the rise of generative AI, RAG, and agentic workflows, the complexity of the evals process to handle complex pipelines with non-deterministic outputs has increased the complexity of performing good evals significantly. The discussions about end-to-end traceability and leveraging Transition Failure Matrices were particularly helpful for me in tackling these more challenging multi-step workflows. This course has been a goldmine by providing: 1. Up-to-date information on best practices for evals on the current state-of-the-art AI workflows 2. Deep insight from experts with decades of both real-world experience and academic research into evals 3. Lots of tips, tricks, and real-world examples (with code) for getting end-to-end evals implemented and working well. This course significantly improved my mental models and increased the size of my practical toolkit for doing AI evals, which is already paying dividends for my client engagements. This is a set of skills everyone working in AI should acquire, and this course is currently the best place to quickly do that!
user avatar
Andrei Bocan 🐐
@monsieur_pickle
Ok so looking back on the Evals Course by @HamelHusain and @sh_reya it’s been one of the best uses of my time in years. Very focused and practical, amazing guest lectures, stellar course reader and just overall just a great experience. 10/10 would recommend.
12
user avatar
Batu
@BatuAytemiz
Highly recommend @HamelHusain and @sh_reya's llm evals course! Clear high-level ideas + grounded low-level worked out examples. Loved the focus on reproducible, iterative workflows.
7
user avatar
Jodi M. Casabianca

Entrepreneurial Measurement & Research Scientist | Psychometrician | Scoring of Open-ended Tasks | AI Evaluation

linkedin logo
ATTN: Anyone interested in LLM evals... I recently decided to enroll in Hamel H. and Shreya Shankar LLM evaluation course.... All I can say is....WOW. The past 4 weeks have been full of seminars, guest lectures, homework assignments, Discord chats, and office hours. I enrolled so that I could learn more about the psychometric aspects of LLM evals (think: rating scales, rubrics, LLM-as-a-Judge, etc.) and I've learned that and so much more. And don't get me started on the course reader! 🤓 This course has really helped me get where I wanted to go wrt understanding the full pipeline. My biggest takeaway: be diligent with looking at your data, with your own eyes. Not everything can be automated or farmed out. This is helpful to me as a consultant helping others. Thank you Hamel and Shreya for offering this course and answering EVERY. SINGLE. QUESTION. If you are considering the course, just do it.
attached image
user avatar
Frazer Dourado
@FrazerDourado
If you've been thinking about building an AI application but aren't sure how to go about evaluation, then @HamelHusain and @sh_reya's evals course on Maven is everything you need. I'm about to finish the course, and I highly recommend it. Since most of my work is in enterprises x.com/FrazerDourado/status/1933266864534401167/photo/1
attached image
18
Jiho Bak
Jiho Bak

Independent AI Engineer

An essential resource for engineers & PMs

For AI Builders Hoping LLMs Will Fix It All This course has provided an exceptionally clear and systematic framework for approaching LLM evaluation. The comprehensive introduction to the Analyze-Measure-Improve lifecycle, alongside the detailed exploration of the Three Gulfs Model (Comprehension, Specification, Generalization), significantly deepened my understanding of the challenges inherent in building effective LLM pipelines. Particularly impactful was the practical guidance on error analysis—learning how to systematically categorize failure modes using open and axial coding, then translating qualitative insights into robust quantitative metrics. The deep dive into Automated Evaluators, including both Code-Based and LLM-as-Judge evaluators, was particularly valuable. Learning how to craft strong judge prompts and rigorously validate them using training, development, and test sets to ensure alignment with human preferences was eye-opening. The course also provided practical methods for estimating true success rates and quantifying uncertainty, which is vital for understanding actual pipeline performance beyond raw observed scores. The practical advice on estimating true success rates, quantifying uncertainty, and designing efficient human review interfaces for significantly enhanced labeling throughput further underscored its value. Most importantly, this course illuminated a critical shift in mindset—from traditional software development towards an iterative, human-centric evaluation approach—making it an essential resource for engineers, product managers, and data scientists looking to confidently address real-world LLM evaluation challenges.


Get 25% off our next cohort!

Enroll Here


user avatar
Reza Yousefzadeh
@reza_yz
Taking the AI evals course by @HamelHusain and @sh_reya has been like drinking from a firehose. So much valuable information. I'll have to come back and rewatch the recordings from time to time and refer to the course materials. Thanks so much to you both.
Tiago Freitas
Tiago Freitas

Scarlet AI

This course changed how I approach AI projects. Instructors provide great support.

The AI Evals course with Hamel and Shreya changed how I approach AI projects and consulting clients. I’ve picked up practical skills in systematically analyzing model errors and designing meaningful evaluations, making the whole AI dev process clearer. Having access to a private community of experienced AI engineers and direct support from Hamel and the team has been especially valuable—they’re always quick to answer questions or help with real-world problems. Highly recommend this course for anyone building AI products or consulting in the space!

Siddharta Govindaraj
Siddharta Govindaraj

Consultant, Silver Stripe Software

Learn how to put evals into practice. Practical and hands on instruction.

Prior to this class I had already read a bunch of stuff on evals (including Hamel's blog). But I struggled to convert that theory into practical steps. I had some apprehensions coming in -- will it be too theoretical? Will it assume a lot of background knowledge? And I can say now -- this course completely crushes it. It is fully hands on and practical, starting from zero and building up from there. You will learn every step of the evals process on what exactly to do (and not to do) and more importantly -- how to actually put it into practice. If you have been struggling with evals, then don't think twice and take the course.
VANESSA MARQUIAFAVEL SERRANI
VANESSA MARQUIAFAVEL SERRANI

Computational Linguist at ATENTO

attached
This course has been incredibly eye-opening. I’ve learned how important it is to follow a clear “Analyze - Measure - Improve” cycle when working with language models. What really stood out to me is that the biggest challenges often come not from the technology itself, but from how we approach the process — like jumping straight to complex solutions without truly understanding the problem, or using misaligned evaluation methods. My biggest takeaway is that every stage of the process has its own traps, and skipping steps or making quick fixes can easily backfire. Being intentional about collecting the right examples, measuring in a fair and meaningful way, and making thoughtful improvements can make all the difference. I’d definitely recommend this course to anyone working with AI systems. It helped me slow down, ask better questions, and be more strategic — and that’s something every team could benefit from.
Laurian Gridinoc
Laurian Gridinoc

Full Stack Computational Linguist, Bad Idea Factory

Now I can design meaningful evals! Highly recommend this course.

Before the AI Evals led by Hamel Husain and Shreya Shankar, I used evals sporadically, mostly relying on third-party ones. Now, I have a clearer understanding of how to design meaningful evals and communicate their value to the teams I work with.
Sydney Sarachek
Sydney Sarachek

Senior Director, AI

This course is comprehensive in a way that's hard to find elsewhere.

This course is a great place for PMs and engineers to learn practical tactics for building real-world AI applications. I've recommended it to people who want both a starting point and deeper knowledge about evals and implementation. Hamel brings in excellent speakers who share different techniques and insights from some really smart people in AI. Evals are super important, and what I appreciate about Hamel's approach is how he walks through data analysis tactics — this is especially helpful for anyone newer to this kind of evaluation work. Just having evals isn't enough — you need to think strategically about what you're evaluating and your methodology beforehand. With so much out there, even really talented engineers can benefit from having all the key considerations for applied AI building brought together in one place. This course does exactly that - it's comprehensive in a way that's hard to find elsewhere. Hamel and Shreyas put a lot of thought into the materials, and I can confirm from my own building experience that this covers the real considerations we're dealing with day-to-day (and have learned over 18+ months of trial and error!) without all the noise and buzzwords.