Yesterday, we showed you Harvard's research proving AI makes work harder. More tasks, blurrier boundaries, constant context-switching.
Today: the toolkit to actually enable the agentic engineering we discussed yesterday. Each t00l below maps to a compound engineering principle from yesterday's story, and most are free.
Cowork (Anthropic, free for $20/mo Pro users)
Point it at a folder. Describe what you need. Walk away. Cowork plans and executes autonomously: sorting files, building expense reports from receipt screenshots, drafting documents from scattered notes. It just added plug-ins for marketing, legal, sales, and data workflows. Anthropic built the whole thing in a week and a half... using Claude Code. The tool that makes tools made this tool.
Real workflow: Drop a folder of meeting notes into Cowork. Tell it: "Summarize action items from these notes, organize by team member, and draft follow-up emails for each." It reads every file, cross-references names, and outputs the drafts. What used to take 45 minutes takes about 90 seconds.
Codex App (OpenAI, temporarily free for all ChatGPT users)
OpenAI's new macOS app manages multiple AI agents in parallel. The killer feature: automations that run on a schedule in the background, with results queued for review. A 4-person team built Sora for Android in 28 days with it.
Real workflow: Set up three agents: one monitors your competitor's blog for updates, another summarizes your team's GitHub activity overnight, a third drafts your weekly status report every Friday at 8 AM. You review the outputs over coffee. Safety nets, not checkpoints.
Claude in Chrome (Anthropic, all paid subscribers)
A browser agent that sees what you see. Fills forms, manages calendars, drafts emails, handles expenses. The "record a workflow" feature lets you teach Claude your routine once. Half your time doing work, half teaching the system your work.
Real workflow: Record yourself submitting an expense report: open the portal, fill fields, attach receipt, submit. Next time, say "submit this receipt as a meal expense" and Claude replays the workflow. Boring tasks should only be done by a human once.
Tasklet (Shortwave team, free beta)
Describe any business process in plain English and Tasklet automates it. Connects to any API, MCP server, or cloud VM with scheduled triggers. Think Zapier, but with actual reasoning instead of rigid if-this-then-that.
Real workflow: "Every morning at 6 AM, check my support inbox, summarize any urgent tickets, and post a digest to Slack with priority tags." No flowcharts. No drag-and-drop. One sentence, and it runs forever.
MCP (Anthropic, open standard)
The plumbing underneath all of it. MCP (originally from Anthropic) connects AI to your actual tools (Slack, Drive, CRMs). Without it, every AI interaction starts from scratch. With it, your tools compound.
Build Your Own Agent Without Writing Code
Both OpenAI and Anthropic just released tools that let non-developers create custom agents.
OpenAI's Agent Builder (part of AgentKit) is a visual drag-and-drop canvas. Sam Altman described it as "Canva for building agents." You connect nodes, add tools (web search, file search, code interpreter), set guardrails, and hit preview. Ramp built a procurement agent in hours instead of months. An OpenAI engineer built two agents live onstage in under eight minutes.
On Anthropic's side, Cowork's plug-in system is basically the same idea, but file-first instead of canvas-first. Install a plug-in (marketing, legal, data analysis), point it at your files, and tell it what to do. Companies can even build custom plug-ins for their teams.
And Anthropic's Claude Agent SDK (recently renamed from the Claude Code SDK) is the engine underneath Claude Code and Cowork. It's more technical, but the key insight is: everything Claude Code does (reading files, running commands, editing documents, executing workflows) is now available as building blocks. Anthropic even published demo projects including a multi-agent research system and spreadsheet automation.
OpenAI's Responses API follows the same philosophy: it's "agentic by default," meaning the model can call multiple tools (web search, file search, code interpreter, MCP servers) within a single request. No manually wiring tool calls together. You describe what you want, the AI figures out which tools to use and in what order.
The pattern across both platforms: describe the outcome, not the steps. That's compound engineering in action.
Read more about why agentic workflows are replacing standalone apps.
The Death of Drag-and-Drop: Why Agentic Workflows Are Replacing Node-Based Automation
Here's where it gets really interesting. The tools above are polished consumer products. But under the hood, a bigger shift is happening in how automations are built, period.
Nate Herk (522K subscribers on YouTube) just dropped a tutorial that crystallizes the change. His analogy (0:27): traditional automation tools like n8n, Zapier, and Make are like cooking dinner from a recipe. You drag nodes onto a canvas, connect them, map data flows, configure every API call. Miss a step? The whole thing breaks.
Agentic workflows are like walking into a restaurant and telling the waiter "I want a delicious steak dinner." You describe the outcome. The system figures out the rest. Maybe it asks how you want it cooked. But you never touch a pan.
The secret sauce? Markdown files. Not code. Not flowcharts. Plain text documents with headers, bullet points, and natural language instructions that the AI reads and executes. Here's his framework, which he calls WAT (2:31):
- Workflows = Markdown (.md) files describing what to accomplish, what tools to use, and what "done" looks like
- Agent = Claude Code (the AI brain that reads the workflows and builds the tools)
- Tools = Python (.py) files that handle execution; the "ugly stuff" you never need to look at
Why markdown? Because it's human-readable (anyone on your team can understand the logic), version-controllable (works with Git like any other file), and LLM-native (these models were trained on markdown; they understand it instinctively).
Self-Healing in Action
The most jaw-dropping part of Nate's demo is watching what happens when things break (17:57).
He asks Claude Code to scrape dentist leads from the web. Claude tries a static scrape on the American Dental Association site; it doesn't work because the site uses JavaScript. So Claude pivots to Yellow Pages. Gets results, but the parsing is wrong (only 2 dentists found). Claude identifies the regex pattern issue, fixes its own tool automatically, re-runs, and outputs 120 unique dentist leads with phone numbers, addresses, websites, and specialties.
No error logs. No debugging. No Stack Overflow rabbit holes. The agent hit a wall, thought about what went wrong, fixed it, and kept going. And here's the kicker: it updated the underlying workflow and tool files so the same mistake won't happen next time (23:08).
One viewer comment summed it up: "Nothing says 'n8n is over' louder than this guy not talking about it anymore." (54 upvotes.)
Two Mistakes That Kill Agentic Workflows
Nate flags two common pitfalls (23:44):
1. Not being clear about your goal. "I need a lead scraper for LinkedIn" is too vague. Instead, put the agent in plan mode and say: "Here's my rough idea; help me turn this into a solid requirements doc." Let the agent ask the right questions before it starts building.
2. Not defining what "done" looks like. Agents without a finish line over-complicate things, break stuff, and loop forever. Instead of "Search for LinkedIn profiles of CEOs at tech companies" (open-ended), try: "I need exactly 75 LinkedIn profiles of CEOs at tech companies. Put them in a spreadsheet with name, company, email, and profile link. Once you have 75, you're done."
The full tutorial is worth a watch: How I'd Teach a 10 Year Old to Build Agentic Workflows (Claude Code).
But How Do You Know Your Agent Is Actually Doing a Good Job?
This is the question nobody asks until something goes wrong.
You built an agent. It runs every morning. Outputs look fine... until one day your support digest misses a critical ticket, your lead scraper returns garbage data, or your weekly report invents a metric that doesn't exist. Congrats, your agent just hallucinated, and nobody caught it because nobody was watching.
Enter two concepts that sound technical but are actually pretty intuitive: LLM-as-a-Judge and LLM-as-a-Verifier.
LLM-as-a-Judge: Your Agent's Performance Review
Think of it like this. You hire a new employee. They do good work... probably. But you don't just trust the work blindly. You have a manager review it. That's what LLM-as-a-Judge does, except the "manager" is a separate AI model evaluating the first AI's output.
Here's how it works in practice:
- Define what "good" looks like. You write evaluation criteria: Was the response accurate? Did it address the user's actual question? Was it concise? Did it hallucinate any facts?
- Feed the output to a judge model. A separate AI (often a more powerful model, or even the same model with a different prompt) reviews the output against your criteria.
- Get a score and explanation. The judge returns a rating (say, 1-5) with reasoning. "This response scored a 4/5 on accuracy. The revenue figure cited is correct, but the date of the announcement is off by one day."
The results are surprisingly reliable. Research shows LLM judges achieve about 80% agreement with human evaluators, which is roughly the same rate humans agree with each other. Turns out we're not great at consistency either.
Common things you'd judge an agent on:
- Answer relevancy: Did the output actually address what was asked?
- Faithfulness / hallucination detection: Did it stick to facts, or wander into fiction?
- Correctness: Are the numbers, dates, and claims accurate?
- Safety: Did it produce anything toxic, biased, or policy-violating?
- Task completion: For agentic workflows specifically, did each step make sense?
Where this gets really powerful is in production monitoring. Your agent runs overnight. In the morning, instead of manually reviewing every output, a judge model has already scored each one. Anything below your threshold (say, 3/5) gets flagged for human review. Everything else passes through. You're not reviewing every email your assistant writes; you're reviewing the ones that need a second look.
Arize AI, Evidently, and DeepEval (100% open-source) all offer frameworks to set this up. The tools aren't new. What's new is that agentic workflows make them necessary.
LLM-as-a-Verifier: The Math Teacher Who Shows Their Work
A judge gives you a quality score. A verifier tells you if something is provably right or wrong.
Sebastian Raschka's breakdown of the four main LLM evaluation approaches puts it clearly: judges handle subjective quality ("Is this well-written? Is it helpful?"), while verifiers handle objective correctness ("Is 2+2 actually 4? Does this code compile? Does this SQL query return the right results?").
Verifiers work best in domains where there's a clear right answer:
- Math and logic: Does the calculation check out? (This is how Google's Deep Think won its math olympiad medal; verifiers checked each reasoning step.)
- Code: Does the function produce the expected output? Run it and see.
- Data accuracy: Does this number match the source database?
- Compliance: Does this document include all required legal disclosures?
The key difference: a judge says "this looks right." A verifier says "this IS right, and here's the proof." Verifiers can use external tools (calculators, code interpreters, database queries, even web search) to check claims against ground truth, rather than just relying on vibes.
Process verifiers take it a step further. Instead of only checking the final answer, they examine each intermediate step. Did the agent's reasoning make sense at step 3, even if the final answer was correct? This matters because an agent can get lucky with a wrong process but right answer, then fail catastrophically the next time the inputs change slightly.
The Pro Move: Failure-Mode Grading
Here's a trick from Veris AI's research (February 2026) that's too good not to share.
Most people set up their judges with broad questions: "Did the agent handle errors well?" This sounds reasonable but produces wildly inconsistent scores. Their research found these success-oriented checks hit only 66% consistency across evaluations. Two-thirds of the time, two judges couldn't even agree.
The fix? Flip it. Instead of asking what went right, check for specific failure modes:
- ❌ "Did the agent hallucinate a customer phone number?"
- ❌ "Did the agent skip account verification before taking action?"
- ❌ "Did the agent provide pricing from an outdated rate card?"
These atomic, specific failure checks hit 94% consistency. That's a 28 percentage point jump just by changing how you frame the question. The biggest gains came in the trickiest categories: error handling jumped from 43% to 98% consistency.
Why? Because "did this specific bad thing happen?" is a much easier question to answer than "was this generally good?" It's the difference between asking a food critic "was dinner good?" versus "was the chicken undercooked?" One is subjective. The other is a fact.
Putting It All Together: The Quality Stack
For any agent you build (whether with Cowork, Tasklet, Claude Code, or Agent Builder), here's the quality control layer:
1. Verifiers for hard facts. If your agent pulls numbers, dates, prices, or anything with a definitive right answer, use a verifier. Run the code. Query the database. Check the math. This is your first line of defense.
2. Judges for everything else. Tone, relevance, completeness, helpfulness; all the squishy stuff that doesn't have a binary right/wrong. Set up failure-mode checks, not holistic vibes.
3. Human review for the flags. Anything that scores below your threshold gets routed to a person. Everything else flows through. The goal isn't to review every output; it's to build confidence that the 95% you don't review is solid.
IBM researcher Pin-Yu Chen put it best: "Use LLM-as-a-judge to improve your judgment, not replace your judgment."
This quality stack is what separates a toy demo from a system you'd actually trust with your business. The agents are here. The build tools are here. Now the verification layer is catching up.
From Prototype to Production: The Deployment Path
One more thing Nate Herk calls out that most tutorials skip (26:50): there's a critical difference between human-triggered and scheduled agent workflows.
When you're building with Claude Code in VS Code, the agent is right there with you. You can watch it work, talk to it, redirect it. That's what makes the self-healing magic possible; the agent is live and adaptive.
But when you want that workflow to run on a schedule (every morning at 6 AM, every Friday at 8 AM), you're deploying the code and workflows the agent created, not the agent itself. The agent is the builder. The artifacts it produces are what runs in production.
This is where frameworks like Laravel come in. Laravel 12 now ships with Boost, a built-in MCP server that gives Claude Code direct access to your app's structure, routes, and database. You can use Claude Code to build the automation, then deploy the resulting Python tools as Laravel API endpoints, schedule them with Laravel's task scheduler, and wrap everything in proper authentication.
The Claude PHP SDK makes this even more direct. You can call Claude's API from within your Laravel app, complete with tool use, streaming, and extended thinking. A Geocodio engineer wrote up their full Claude Code + Laravel workflow, including how they persist context across sessions for multi-day features using a simple file-based system.
The production path: build with the agent → test with judges and verifiers → deploy the artifacts → monitor with automated evaluation.
Pick Your Starting Point
Drowning in files? Cowork. Want background automation? Codex App or Tasklet. Browser tasks eating your day? Claude in Chrome. Ready to build something custom? Agent Builder or Cowork plug-ins. Want self-healing workflows? Claude Code + markdown. Need quality control? DeepEval (open-source) or Evidently (we'll try to pull together insights on more tools for this later).
Yesterday's compound engineering framework was theory. These are practice. And now you have the quality control layer to make sure they actually work.