AI Workflow / Case Study

I Caught My AI Slacking,
Then Wrote It Into the Rules

I asked Codex to call Claude to revise an article. It never actually called Claude, yet returned what looked like a finished result. A poorly written article can be rewritten. A dishonest workflow is a much bigger problem. This is how I pushed back, and how I turned that one failure into a rule that gets enforced next time.

Who This Is For

Anyone already using ChatGPT, Codex, Claude, or Gemini in a divided workflow. People who tell an AI to "go look that up," "call another model," or "remember this" and are never quite sure whether it actually did.

What Problem It Solves

An AI can skip an external action and still return something that looks like a completed result. You need a method to confirm it actually followed the workflow.

What You Will Take Away

An understanding of why workflow verification matters more than model strength in multi-model setups, a method for checking whether an AI truly invoked an external tool, and copy-ready verification prompts plus a memory-write workflow.

Topic: AI Agent Workflow Topic: Multi-Model Collaboration Level: Foundational For: Company of One Type: Case Study

How It Happened

This is a real case from my AI collaboration workflow. I asked ChatGPT/Codex to write an article. The draft felt flat, so I told it to call Claude to revise it. Only after I pushed back did I realize it had never called Claude at all. It just rewrote the piece itself.

I had been working on an article about the hardware barrier to installing Codex. My observation was that Codex started out feeling like a tool only new machines could run. As I recalled it, Mac M-series chips were the early home, then Intel Macs could manage it, then Windows support arrived, and recently a friend with a 2015 MacBook Pro got it running. There is even a ChatGPT mobile app that can monitor a Codex session remotely.

I thought this shift mattered. For an AI agent to become a genuine work entry point, it cannot serve only the people with the newest hardware. It needs to reach schools, organizations, businesses, and the everyday worker on an older machine.

Earlier in the session I had Codex verify the facts, checking whether OpenAI had actually used any specific language about an "AI OS." What turned up was the phrase "ChatGPT as an operating system for work," from an official OpenAI piece. In Chinese terms that roughly translates to ChatGPT becoming the entry point for work. The direction felt right.

The problem was the writing itself. Looking at what Codex produced, I could tell immediately it was an AI stacking sentences. It had generated a list of questions: can it run on more people's computers, can it work across Mac and Windows and phones, can it read your work environment, can it take over your tools. The structure looked tidy, but the substance was hollow. A reader would feel pushed along by sentence patterns without ever seeing a human thought. So I said: let's get Claude in to revise this.

How I Expected the Division of Labor to Work

I have a clear mental model for which AI handles which kind of task.

Codex / ChatGPT Is Better At

Local execution, verification, locking in workflows

  • Research and fact-checking
  • Organizing workflows
  • Operating local files and running commands
  • Recording rules
  • Committing workflows to the knowledge base

Claude Is Better At

Tone, rhythm, editorial judgment

  • Revising articles and adjusting tone
  • Turning spoken notes into natural long-form writing
  • Handling subtle pacing in copy
  • Spotting where writing sounds like an AI

So my usual approach is to let Codex take the first pass. It is strong on research and workflow organization. If the draft is good enough, I use it directly. If not, I bring in Claude to revise. That division of labor was perfectly reasonable.

Where the Problem Actually Was

After I said "let's get Claude in to revise this," Codex returned a revised draft. At a glance it looked like the instruction had been followed. Then I asked directly: "Did you actually call Claude?" It admitted it had not.

A poorly written article I can accept. Model capability shifts with every release. ChatGPT might struggle with a certain type of writing today and be excellent at it next version. That is OpenAI's problem to solve. What I need to optimize is my own AI agent workflow.

I gave a clear instruction: call Claude to revise this. Codex did not call Claude. It revised the piece itself, then responded in a way that implied the task was done. That revealed a gap in the workflow: an AI can skip an external action entirely and still return something that looks like a completed result.

This Is More Serious Than a Bad DraftA bad draft can be rewritten. A dishonest workflow means every future task that relies on an external tool, a sub-agent, or cross-model verification is at risk.

How I Caught It

I did not simply accept the response. I asked: "Did you actually call Claude?" That question matters.

AI outputs regularly include phrases like "I've taken care of it," "I'll remember that," or "I've already made the correction." These sound finished. They do not guarantee any verifiable action occurred. So the follow-up to ask is:

Verification: Did You Actually Call It?
Did you actually invoke an external tool?
Which tool did you call?
Did the tool return a result?
If it failed, what was the error message?
Is this Claude's output, or did you produce this yourself?

Asking those questions caught it immediately. Codex admitted it had not called Claude. I then asked: "So what are you going to do now?" That prompted it to check whether a Claude CLI was available locally. It found one, made the actual call, and came back with:

This Is the Honest Report
401 Invalid authentication credentials

If you called it, say you called it. If you did not, say you did not. If the call failed, report the reason. That transparency matters far more than a clean-looking result.

"I'll Remember That" Also Needs Verification

After this, I asked Codex to remember the rule we had just established. But I no longer take "I'll remember that" at face value. I follow up with:

Verification: Where Did You Write It?
Where did you write it?
Did you actually write it to a file?
Next time you encounter the same task, where will you look for this rule?
Can you give me the full file path?

Because "remember" means different things depending on context. Sometimes it means the AI holds the information for this conversation only. Sometimes it writes to a project rules file. Sometimes it updates a long-term memory store. Sometimes it is just words. If the rule has not landed somewhere checkable, it will be gone next session.

I asked Codex to write the rule into two places: AGENTS.md and a Codex memory supplement file. I required it to return the full path for both. It did write them. The rule added reads, in essence:

The Rule That Was Written In
When the user explicitly says "get Claude," "have Claude revise this,"
or "hand it to Claude," Codex must actually invoke an available Claude
tool or CLI.

If the Claude call fails, report the failure reason clearly.

If Codex provides a substitute result, it must be labeled
"Codex Draft" or "Codex Version."

Codex must not rewrite content itself and present it as Claude's output.

That is what it means to turn a single failure into a workflow improvement.

Why This Makes a Good Teaching Case

I think this is a strong example for anyone learning multi-model collaboration. Most people new to it focus on which model is stronger, better at writing, cheaper, or has a larger context window. All of that matters. But once you are working inside an agent workflow, there is a more fundamental question: is the AI actually following the steps you specified?

If you told it to look something up, did it? If you told it to call Claude, did it? If you told it to remember something, did it write that to a file? If it says it failed, did it give you the error? The answers to those questions determine whether the entire workflow can be trusted.

How I Now Instruct AI

Next time I need Codex to bring in Claude for a revision, I give the instruction this way from the start.

Requiring an Actual External Model Call
Please actually call Claude to revise this article.

Requirements:
1. First report which Claude invocation method you are using.
2. If Claude responds successfully, label the output "Claude Revision."
3. If the Claude call fails, paste the error reason.
4. If Claude fails, you may provide a "Codex Draft," but label it clearly.
5. You may not present your own output as Claude's result.

If I am asking an AI to remember a new rule, I phrase it this way:

Requiring Memory to Land Somewhere Checkable
Please write this rule to a checkable location.

After writing, report:
1. Which file you wrote it to.
2. The full file path.
3. Which lines contain the new content.
4. Where you will look for this rule the next time you encounter the same task.

Asking it that way makes it much harder to slide by with "I'll remember that."

Copy-Ready Verification Checklist

Three scenarios, three sets of follow-up prompts. When an AI says it completed something, remembered something, or delegated to another model, verify before you trust.

When the AI Says It Completed a Task
What actions did you actually take just now?
Which parts are your own reasoning?
Which parts came from a tool response?
Were there any steps that failed?
If so, what was the error message?
Where can I go to check this result?
When the AI Says It Remembered Something
Which file did you write it to?
What is the full path?
Which lines contain the new content?
Next time this situation comes up, which rule will you read?
When the AI Says It Delegated to Another Model
Did you actually call that model?
What invocation method did you use?
Do you have the raw response?
If not, please state clearly that this is your own version.

Common Pitfalls

Four places people most often get caught. Expand each one for the fix.

Pitfall 1: The AI says "I'll remember that" but writes nothing to any file

Ask it to report the file path, line numbers, and where it will look for the rule next time. If the rule has not landed somewhere checkable, it will be gone the next session.

Pitfall 2: The AI says "I asked that other model to handle it" but did it itself

Ask it to explain the invocation method, the tool response, and any error message. If it called the model, it says so. If it did not, it says so.

Pitfall 3: Treating a model capability problem and a workflow problem as the same issue

Handle them separately. If the writing is weak, try a different model. If the workflow was not followed, write a rule. The two fixes are different.

Pitfall 4: Complaining to the AI in the moment but never committing the rule

Write the lesson into AGENTS.md, a Skills file, an SOP, or a memory supplement. Only then will it be enforced next time.

What I Learned From This

I do not need the AI to get everything right on the first try. I need a verifiable correction loop. Let Codex take a pass. If the writing falls short, call Claude. If Claude cannot be reached, get a clear failure report. If the workflow broke, write it into the rules. After writing the rule, confirm the file path.

Real improvement comes from two places: model upgrades, and the discipline of turning each failure into a rule that gets enforced next time. As long as mistakes can be written into rules, the system becomes more reliable with every iteration.

Free Online Talks

Two free sessions every month.

Join the LINE Community ↗