My experience of using AI Agents

AI Questions
What is the llm.txt for?
llm.txt is a site-level instruction file for AI/LLM agents, like robots.txt but for models. It tells models what the site is about and which sources to prioritize. Example: a site could point models to its official docs and API reference; the spec and examples are described at https://llmstxt.org/ (which you can follow to publish your own).
What even is the difference? They all seem like "AI configuration stuff."
Yep, fair. Think of it this way:
Instructions = Passive rules are always applied in the background
Agents = Custom AI personas you manually switch into
Skills = Step-by-step playbooks you explicitly invoke for a specific task
So instructions just... always run?
Yes — automatically. The applyTo frontmatter pattern controls when they're injected. For example, context-engineering.instructions.md has applyTo: '**' so it's always active. The nextjs.instructions.md has applyTo: '**/*.tsx, **/*.ts, ...' so it only kicks in when you're editing TypeScript/JSX files.
You don't call them. Copilot just silently reads them and adjusts its suggestions accordingly. They're like a standing memo on your desk — always there.
When would I actually switch to a custom agent mode?
When you want a fundamentally different persona and toolset for a session. Look at debug.agent.md — it locks in a specific tool list (terminal, tests, problems panel) and a structured debugging workflow. Or expert-nextjs-developer.agent.md — it declares expertise, forces certain behaviors, and even pins a model (GPT-4.1).
Basically: use an agent when "default Copilot" isn't the right mindset for the job. Debugging, writing a PRD, doing a technical spike — those all benefit from a mode switch.
What about skills — do those auto-load like instructions?
Nope. Skills are NOT automatically loaded. They're on-demand playbooks. You invoke them explicitly by referencing them in a prompt, or by having an agent that knows to use them.
For example, conventional-commit won't do anything unless you tell Copilot "use the conventional-commit skill" or you build it into an agent's workflow. Think of skills as reusable procedures stored in a drawer — they only run when someone pulls them out.
What's the loading order/hierarchy when Copilot runs?
Roughly:
copilot-instructions.md — loaded first, always, the main entry point
.github/instructions/*.instructions.md— auto-applied based onapplyTomatching your open file.github/agents/*.agent.md— only loaded when you explicitly switch to that agent mode in the Copilot chat panel.github/skills/*/SKILL.md— only loaded when explicitly invoked
When do I use each?
Instructions: Set-and-forget rules (tech constraints, code style, a11y rules, security guidelines)
Custom Agent: When you need a different "mode" — debugging, planning, writing specs
Skill: When you have a repeatable multi-step workflow you want to reuse without re-explaining every time
What is AGENTS.md, and who is it for?
AGENTS.md is a universal guide following the agents.md open spec. It targets any AI coding agent (Claude, Gemini, OpenAI Codex, etc.) and covers setup commands, project structure, testing patterns, and workflow — anything tool-agnostic.
What is copilot-instructions.md, and who is it for?
It's a GitHub Copilot-specific file loaded automatically by VS Code. It focuses on IDE-level code generation rules: version constraints, ❌ NEVER / ✅ ALWAYS generation constraints, and pointers to granular context files in the copilot with applyTo glob patterns.
What content belongs in each file?
| Content | AGENTS.md | copilot-instructions.md |
|---|---|---|
| Setup / install commands | ✅ | ❌ (link to AGENTS.md) |
| Project structure | ✅ | ❌ (link to AGENTS.md) |
| Testing instructions | ✅ | ❌ (link to AGENTS.md) |
| Code generation constraints | ❌ | ✅ |
applyTo instruction files |
❌ | ✅ |
| Version compatibility rules | ❌ | ✅ |
Should they duplicate content?
No. copilot-instructions.md should delegate to AGENTS.md for shared content to avoid drift. Your current setup already does this correctly
What is the single most important design principle?
AGENTS.md is the single source of truth for project facts. copilot-instructions.md is a Copilot-specific lens on top of it — not a copy.
Why shouldn't we let AI write the AGENTS.md?
This is the big one. Research has shown (Gloaguen et al., ETH Zurich) that LLM-generated AGENTS.md files offer no benefit and can marginally reduce success rates (~3% on average) while increasing inference costs by over 20%. Developer-written context files, by contrast, provide a modest ~4% improvement. Never let an agent write to AGENTS.md directly. The lead must approve every line. Keep it shorter with clear sections:
https://addyosmani.com/blog/agents-md/
What are the common issues that lead to AI failures?
The AI breaks in five specific ways:
Hallucination from missing context: The agent doesn’t have the information it needs, so it fills the gap with confident fiction. You asked for a login page and it invented an auth library that doesn’t exist.
State loss in long workflows: At 10K tokens, the agent is sharp. At 150K, it forgets you wanted dark mode and rebuilds the entire UI in blinding white. Three times.
Tool misuse: Wrong tool, bad parameters, or just ignoring the output entirely. The agent has access to 15 tools and picks the worst one for the job.
Unproductive loops: Same failed approach, over and over. Burning tokens with zero progress. The agent version of pulling a door that says “push.”
Silent failure at scale: Here’s the math nobody mentions: if your agent is 95% accurate per decision and makes 20 decisions per task, it fails more often than it succeeds. Small inaccuracies compound. At production scale, 95% isn’t good enough.
Why do our AI agents get worse over time due to the increase in context window?
LLM output quality degrades as context fills. This isn’t a bug. It’s a property of how transformer attention works. Sharp at 10K tokens. Noticeably worse at 150K on the same tasks.
Think of context management like packing for a backpacking trip. You could bring your entire wardrobe. You’d move slowly and nothing would be easy to find. Pack light instead. Know where to resupply.
In context (what you carry): system prompt, instruction files, skill headers, the current task.
On disk (available when needed): full file contents, web search results, codebase, git history. The agent loads these on demand through tools.
In practice, you’ll find that system prompts, MCP tool definitions, and memory files can eat 8-9K tokens before you’ve typed a word. That’s context space not available for reasoning. Audit regularly. Trim what you can.
Compaction happens when the context hits its limit. The platform summarizes your conversation to free space. It works, but details get lost. Early decisions compress. Tool outputs disappear. Don’t let it surprise you — manage context proactively so the agent doesn’t lose important state at the worst moment.
What do agents actually cost?
Nobody publishes this part. Agents burn tokens fast.
The 60/30/10 Rule
Not every subtask needs the most expensive model:
60% of tokens → cheap models (Haiku, Flash). Scraping, classification, templated work.
30% → mid-tier (Sonnet). Writing, reasoning, code generation.
10% → frontier (Opus). Routing decisions, complex orchestration, high-stakes judgment.
Approximate math on 100 million tokens per month:
Same work. Roughly 60% cheaper. Token prices shift quarterly — check current rates — but the principle holds: most subtasks don’t need frontier-level reasoning.
How many levels of applying AI Agents in coding work?
Steve Yegge recently described eight stages of how developers grow with AI tools. It's a helpful way to see where you are and where you're going.
Most developers are at Level 3-4. Level 6 begins the orchestration stage, needing different skills than Level 5.
What is the shift from conductor to orchestrator?
In the conductor setup, you work with one AI agent in real-time, kind of like pair programming. It's all about guiding that single agent, and everything happens one step at a time within a fixed context. Tools like Claude Code CLI and Cursor's in-editor mode fit this style.
Switching to the orchestrator model is like managing a whole team. You've got multiple agents, each doing their thing with their own context, and they're working independently. You plan everything out, assign tasks, and check in now and then. Tools like Agent Teams, Conductor, Codex, and Copilot Coding Agent are perfect for this approach.
What are the single-agent limits?
Every developer eventually runs into three main issues with a single agent: context overload, lack of specialization, and zero coordination. Subagents tackle the first two, while Agent Teams handle all three.
Why can't one agent handle everything? There are three main reasons.
First, context overload. A single agent can only process so much info. Large codebases can overwhelm it, causing you to lose important details as the conversation drags on.
Second, lack of specialization. When one agent tries to do it all—data layer, API, UI, tests—it becomes a jack of all trades, master of none. An agent focused solely on the data layer, for instance, will write much better database code than a generalist trying to juggle everything.
Lastly, no coordination. Even if you bring in helpers, they can't communicate, share tasks, or manage dependencies. Adding more agents without coordination just makes things messier.
Why do we need multi-agents?
Parallelism, specialization, isolation, and compound learning don't just add up—they multiply. Three focused agents consistently do better than one generalist working three times as long.
Four reasons to use multiple agents:
Parallelism (3x speed) - Three agents work on frontend, backend, and tests at the same time.
Specialization (focused work) - Each agent handles only its own files. An agent focused on db.js writes better database code than one managing everything.
Isolation (safe work) - Git worktrees give each agent its own space. No merge conflicts while they work.
Compound learning - An AGENTS.md file collects patterns and tips, improving each session.
What is the Subagents pattern?
Subagents are the simplest multi-agent pattern and the one you should try first.
Subagents use the Task tool to spawn specialized child agents from a parent orchestrator. The parent decomposes a task into pieces, spawns subagents for each piece, and manages the dependency graph manually.
What subagents solve: context isolation per agent, specialization, parallel execution for independent tasks, and it's cost-neutral at roughly 220k tokens total.
What is still missing: the parent must manually manage the dependency graph. There's no peer messaging between agents. There's no shared task list. And if you're sloppy about file scoping, two agents could write to the same file.
Bottom line: subagents give you parallel execution with manual coordination. That is great for simple decomposition. But when coordination becomes the bottleneck, you need Agent Teams.
The pro tip of working with hierarchical subagents?
Don't just delegate once. Create feature leads that have their own specialists. This allows for deeper task breakdown without overwhelming anyone.
Instead of the orchestrator creating six subagents, it should create two feature leads. Each lead then creates two or three specialists.
The main orchestrator only communicates with two agents, keeping things simple. Feature Lead A, for example, gets the task "Build the search feature" and breaks it down into Data, Logic, and API subagents. The main orchestrator doesn't need to know these details.
This approach is similar to real engineering teams, where tasks are assigned through layers of tech leads, not directly from the VP of Engineering to individual engineers.
What is the Agent Team pattern?
Agent Teams add the coordination primitives that subagents lack: a shared task list with dependency tracking, peer-to-peer messaging between teammates, and file locking to prevent conflicts.
Agent Teams are Claude Code's experimental feature for true parallel execution. Enable it with:
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
The architecture has three layers:
Team Lead at the top - decomposes work, creates the task list, synthesizes results
Shared Task List in the middle - tasks with statuses (pending, in_progress, completed, blocked), dependency tracking, and file locking
Teammates at the bottom - each an independent Claude Code instance with its own context window, running in tmux split panes
Teammates self-claim tasks from the shared list. They message each other directly - peer-to-peer, not through the lead. When a teammate finishes and marks a task complete, any blocked tasks that depended on it automatically unblock. Press Ctrl+T at any time to toggle a visual overlay of the task list.
How does the Agent Teams work?
Two mechanisms make Agent Teams work: a shared task list with automatic dependency resolution, and peer-to-peer messaging that prevents the lead from becoming a bottleneck.
The shared task list gives each task a status: pending, in_progress, completed, or blocked. Blocked tasks have explicit dependencies. When the backend teammate marks the search API as completed, the blocked test-writing task automatically flips to pending and a teammate picks it up. File locking prevents two teammates from editing the same file simultaneously.
Peer messaging is the other critical piece. The backend agent tells the frontend agent the API contract directly: "GET /search?q= returns [{id,title,url}]." This doesn't go through the lead. When a teammate goes idle, the lead is automatically notified. This peer-to-peer approach prevents the lead from becoming a coordination bottleneck.
How can you manage with 5, 10, or 20+ agents across multiple repos and features?
When you need to manage 5, 10, or 20+ agents across multiple repos and features, you need purpose-built orchestration tools. Every tool in 2026 fits one of three tiers - pick the right tier for the job.
The 2026 tool landscape breaks into three tiers:
Tier 1: In-process subagents and teams
Claude Code subagents and Agent Teams. Single terminal session, no extra tooling needed. Start here.
Tier 2: Local orchestrators
Your machine spawns multiple agents in isolated worktrees. You stay in the loop with dashboards, diff review, and merge control. Best for 3-10 agents on known codebases. Tools include Conductor, Vibe Kanban, Gastown, OpenClaw + Antfarm, Claude Squad, Antigravity, and Cursor Background Agents.
Tier 3: Cloud async agents
Assign a task, close your laptop, return to a pull request. Agents run in cloud VMs. No terminal, no local setup. Tools include Claude Code Web, GitHub Copilot Coding Agent, Jules by Google, and Codex Web by OpenAI.
Most developers in 2026 will use all three tiers - Tier 1 for interactive work, Tier 2 for parallel sprints, Tier 3 to drain the backlog overnight.
How do we keep shipping the high-quality code when working with multiple agents?
Three quality gates make agent output trustworthy: plan approval catches bad architecture before code exists, hooks enforce automated checks on every lifecycle event, and AGENTS.md compounds learning across sessions.
Plan approval. Require teammates to write a plan before they start coding. The lead reviews the approach and approves or rejects. It's far cheaper to fix a bad plan than to fix bad code. The flow looks like:
teammate >>> writes plan >>> lead review >>> approve/reject >>> implement
Hooks. Automated checks on lifecycle events. A TeammateIdle hook verifies all tests pass before allowing an agent to stop working. A TaskCompleted hook runs lint and tests before marking a task as done. If the hook fails, the agent keeps working until it passes:
task done >>> hook runs npm test >>> pass? allow | fail? keep working
AGENTS.md for compound learning. This file captures discovered patterns, gotchas, and style preferences. Every agent reads it at the start of a session, and every session adds to it. Session one learns about a testing pattern, AGENTS.md is updated, session two avoids the same mistake.
What is the Ralph Loop?
The Ralph Loop pattern breaks development into small atomic tasks and runs an agent in a stateless-but-iterative loop. Each iteration: pick task, implement, validate, commit if pass, reset context and repeat. This avoids context overflow while maintaining continuity through external memory.
The five-step cycle:
Pick - select the next task from
tasks.jsonImplement - make the change
Validate - run tests, types, lint
Commit - if checks pass, commit and update task status
Reset - clear the agent context and start fresh with the next task
The key insight is stateless-but-iterative. By resetting each iteration, the agent avoids accumulating confusion. Small bounded tasks produce cleaner code with fewer hallucinations than one enormous prompt.
AI Tips
Establish AI guidelines for the project
=> Review the guidelines on instructions, agents, and skills from the Awesome Copilot GitHub repository, identify the useful ones, and download them to our source code. To simplify the download process, install the awesome Copilot extension from the VS Code marketplace. After downloading the guidelines, carefully review them to ensure they fit our codebase.
There are some useful GitHub repos that we can refer to for useful AI Agent guidelines: https://github.com/anthropics/skills/tree/main, https://github.com/obra/superpowers, https://github.com/addyosmani/agent-skills
Below are some AI guidelines from the GitHub Copilot repository that I use the most:
Copilot Instructions
code-review-generic.instructions.md: Instructions for creating effective and portable Agent Skills that enhance GitHub Copilot with specialized capabilities, workflows, and bundled resources.
context-engineering.instructions.md: Principles for helping GitHub Copilot understand your codebase and provide better suggestions.
context7.instructions.md: Use Context7 proactively whenever the task depends on authoritative, current, version-specific external documentation that is not present in the workspace context.
markdown-accessibility.instructions.md: Markdown accessibility guidelines based on GitHub's 5 best practices for inclusive documentation
markdown.instructions.md: Markdown formatting aligned to the CommonMark specification (0.31.2).
memory-bank.instructions.md: Coding standards, domain knowledge, and preferences that AI should follow.
nextjs-tailwind.instructions.md: Instructions for high-quality Next.js applications with Tailwind CSS styling and TypeScript.
nextjs.instructions.md: Best practices for building Next.js (App Router) apps with modern caching, tooling, and server/client boundaries (aligned with Next.js 16.1.1).
performance-optimization.instructions.md: The most comprehensive, practical, and engineer-authored performance optimization instructions for all languages, frameworks, and stacks.
security-and-owasp.instructions.md: Comprehensive secure coding instructions for all languages and frameworks, based on OWASP Top 10 and industry best practices.
Custom Agents
4.1-Beast.agent.md: It is a custom “agent profile” that tells Copilot how to behave: keep working until the task is fully solved, use lots of research (including web fetches), plan with to-do lists, test rigorously, and follow specific workflow rules. In short, it is a strict, autonomous mode configuration for long, research-heavy tasks.
Thinking-Beast-Mode.agent.md: A transcendent coding agent with quantum cognitive architecture, adversarial intelligence, and unrestricted creative freedom.
aem-frontend-specialist.agent.md: IExpert assistant for developing AEM components using HTL, Tailwind CSS, and Figma-to-code workflows with design system integration
blueprint-mode.agent.md: Executes structured workflows (Debug, Express, Main, Loop) with strict correctness and maintainability. Enforces an improved tool usage policy, never assumes facts, prioritizes reproducible solutions, self-correction, and edge-case handling.
critical-thinking.agent.md: Challenge assumptions and encourage critical thinking to ensure the best possible solution and outcomes.
debug.agent.md: Debug your application to find and fix a bug.
demonstrate-understanding.agent.md: Validate user understanding of code, design patterns, and implementation details through guided questioning.
devils-advocate.agent.md: I play the devil's advocate to challenge and stress-test your ideas by finding flaws, risks, and edge cases
expert-nextjs-developer.agent.md: Expert Next.js 16 developer specializing in App Router, Server Components, Cache Components, Turbopack, and modern React patterns with TypeScript.
expert-react-frontend-engineer.agent.md: Expert React 19.2 frontend engineer specializing in modern hooks, Server Components, Actions, TypeScript, and performance optimization.
principal-software-engineer.agent.md: Provide principal-level software engineering guidance with focus on engineering excellence, technical leadership, and pragmatic implementation.
refine-issue.agent.md: Refine the requirement or issue with Acceptance Criteria, Technical Considerations, Edge Cases, and NFRs
Agent Skills
agentic-eval: Patterns and techniques for evaluating and improving AI agent outputs.
architecture-blueprint-generator: Comprehensive project architecture blueprint generator that analyzes codebases to create detailed architectural documentation
autoresearch: Autonomous iterative experimentation loop for any programming task.
breakdown-epic-arch: Prompt for creating the high-level technical architecture for an Epic, based on a Product Requirements Document.
chrome-devtools: Expert-level browser automation, debugging, and performance analysis using Chrome DevTools MCP. Use for interacting with web pages, capturing screenshots, analyzing network traffic, and profiling performance.
context-map: Generate a map for all relevant files to a task before making changes
conventional-commit: Prompt and workflow for generating conventional commit messages using a structured XML format
copilot-intructions-blueprint-generator: Technology-agnostic blueprint generator for creating comprehensive copilot-intructions.md files that guide GitHub Copilot to produce code consistent with project standards, architecture patterns, and exact technology versions by analyzing existing codebase patterns and avoiding assumptions.
create-agentsmd: Prompt for generating an AGENTS.md file for a repository
create-implementation-plan: Create a new implementation plan file for new features, refactoring existing code or upgrading packages, design, architecture or infrastructure.
create-specification: Create a new specification file for the solution, optimized for Generative AI consumption.
create-technical-spike: Create time-boxed technical spike documents for researching and resolving critical development decisions before implementation
documentation-writer: Diátaxis Documentation Expert. An expert technical writer specializing in creating high-quality software documentation, guided by the principles and structure of the Diátaxis technical documentation authoring framework
doublecheck: Three-layer verification pipeline for AI output. Extracts verifiable claims, finds supporting or contradicting sources via web search, runs adversarial review for hallucination patterns, and produces a structured verification report with source links for human review
skill-creator: Create new skills, modify, and improve existing skills, and measure skill performance. Use when the user wants to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
model-recommendation: Analyze chatmode or prompt files and recommend optimal AI models based on task complexity, required capabilities, and cost-efficiency
prd: Generate high-quality Product Requirements Documents (PRDs) for software systems and AI-powered features. Includes executive summaries, user stories, technical specifications, and risk analysis.
project-workflow-analysis-blueprint-generator: Comprehensive technology-agnostic prompt generator for documenting end-to-end application workflows.
refactor: Surgical code refactoring to improve maintainability without changing behavior.
remember: Transforms lessons learned into domain-organized memory instructions (global or workspace)
dispatching-parallel-agents: Use when facing 2+ independent tasks that can be worked on without shared state or sequential dependencies
executing-plans: Use when you have a written implementation plan to execute in a separate session with review checkpoints
systematic-debugging: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
using-agent-skills: Discovers and invokes agent skills. Use when starting a session or when you need to discover which skill applies to the current task. This is the meta-skill that governs how all other skills are discovered and invoked
code-review-and-quality: Conducts multi-axis code review. Use before merging any change. Use when reviewing code written by yourself, another agent, or a human. Use when you need to assess code quality across multiple dimensions before it enters the main branch.
context-engineering: Optimizes agent context setup. Use when starting a new session, when agent output quality degrades, when switching between tasks, or when you need to configure rules files and context for a project.
MCP
Chrome DevTools MCP:
chrome-devtools-mcplets your coding agents (such as Gemini, Claude, Cursor, Copilot) control and inspect a live Chrome browser. It acts as a Model Context Protocol (MCP) server, giving your AI coding assistance access to the full power of Chrome DevTools for reliable automation, in-depth debugging, and performance analysis.Context7 MCP: Without Context7, LLMs rely on the generic and outdated information about the libraries we use. To resolve that, Context7 pulls up-to-date, version-specific documentations and code examples straight from the source and places them directly into your prompt.
Playwright MCP: This MCP provides browser automation capabilities using Playwright. This server enables LLMs to interact with webpages through structured accessibility snapshots, bypassing the need for screenshots and visually turned models.
Atlassian MCP Server: The Atlassian Rovo MCP server is a cloud-based bridge between your Atlassian cloud site and compatible tools. Once configured, it enables those tools to interact with Jira, Confluence, and Compass data in real time. This functionality is powered by secure authentication using OAuth 2.1 or API tokens, which ensures all actions respect the user's existing access controls.
Figma MCP Server: The Figma MCP Server brings Figma directly into your workflow by providing important design information and context to AI Agents. generating code from Figma design files.
shadcn/ui MCP: The Shadcn Server allows the AI Assistant to interact with items from registries. We can browse available components, search for specific ones, and install them directly into our project with natural language.
Github MCP Server: The Github MCP Server connects AI tools directly into Github's platform. This gives AI agents, assistants, and chatbots the ability to read repositories and code files, manage issues and PRs, analyze code, and automate workflows all through natural language interactions.
Permit's MCP-gateway: Permit MCP Gateway is a drop-in zero-trust proxy that adds authentication, fine-grained authorization, consent, and audit logging to any MCP server. Swap one URL. No code changes. Works with Salesforce, GitHub, Slack, Jira, and any MCP server you already use.
Hooks
secret-scanner: Scans files modified during a Copilot coding agent session for leaked secrets, credentials, and sensitive data
session-auto-commit: Automatically commits and pushes changes when a Copilot coding agent session ends
session-logger: Logs all Copilot coding agent session activity for audit and analysis
tool-guardian: Blocks dangerous tool operations (destructive file ops, force pushes, DB drops) before the Copilot coding agent executes them
=> Let's set up a rule for the AI so that whenever project guideline files are updated or there's a big change in our source code, like updating package versions, the AI automatically checks other documents to ensure everything stays consistent.
=> When searching for specific agents, instructions, or skills, simply run the suggest-awesome-copilot-agents, instructions, or skills tools. This will help you find exactly what you need from the awesome Copilot repository.
Experience of Working with AI
Spec-driven development
Plan your work before you write code. But rather than writing plans for humans, we're writing plans and tasks for agents.
What is spec-driven development?
Instead of prompting first and figuring it out as you go, you start with a spec. It's a short document that defines what you're building, the constraints, and the key decisions—a contract for how your code should behave. The spec becomes the source of truth the agent uses to generate code. Less guesswork. Fewer surprises.
I see people conflating PRDs, design docs, and specs all the time. They serve completely different purposes, and mixing them up — especially when working with AI agents — causes a lot of confusion down the line.
Here's how I separate them:
Product Requirements Document (PRD) — For humans. Product managers, stakeholders. Covers what we're building and why — user stories, success metrics, business value. This is your debate document.
Technical Design Document — For engineers. Covers how we're building it — architecture decisions, scalability trade-offs, security implications. Also reviewed and debated with the team.
AI Spec — For agents. This is a pure execution document. No debates, no open questions. It translates the finalized decisions from the PRD and design doc into something an agent can act on directly.
Traditional software development has always followed this pattern:
The process is the same. The audience is different.
My Workflow:
Generate the PRD
I use spec-driven-instructions-workflow.mdcombined with skills likeprd/skill.md,prd.agent.md, orbreakdown-epic-pmto generate the initial PRD. For large epics, I first usebreakdown-feature-prdto break them down into feature-level PRDs before any implementation starts.Important: always use your most capable (most expensive) model for these root documents. And never trust the first generated version. My verification pipeline:
Run
doublecheckskill — surfaces inconsistencies and gaps => Runagentic-eval— makes updates based on thedoublecheckreport => Manual review usingcritical-thinking— go through every single point and make sure you genuinely understand it => Use agents likeprinciple-software-engineer,devils-advocate,demonstrate-understanding, andrefine-issueto stress-test assumptions further.Generate the Technical Architecture
With a solidprd.md, use thebreakdown-epic-archskill to produce the high-level technical architecture for the epic. This becomes yourdesign.md.Generate the Spec
With bothprd.mdanddesign.mdready, generate a detailed spec using thecreate-specificationskill combined with thespec-driven-workflowinstructions. Thisspec.mdis what agents will actually work from.Generate the Implementation Plan
Fromspec.md, usecreate-implementation-plan/SKILL.mdto produce a task-by-taskplan.md.
At this point you have all four core documents:
| File | Purpose |
|---|---|
prd.md |
What & why |
design.md |
How (architecture) |
spec.md |
Execution details for agents |
plan.md |
Task-by-task implementation steps |
Review all four before touching any code.
During Implementation
Even with a solid plan, I don't just hand tasks over and walk away. My rules:
Run
doublecheckbefore every task — pull in external resources to verify the approach and understand the full impact of each change before it's made.Understand everything the AI produces — not "looks fine." Actually understand each decision and each line. If something's unclear, ask the agent to explain it with
demonstrate-understanding.One task = one commit — every task on the plan gets its own commit. Keeps history clean and makes it easy to trace what changed and why.
One big task = a new chat session -- every big task on the plan should be implemented in the new chat session. It keeps the context window fresh; the context window won't be poisoned by the irrelevant information from those previous responses.
Mark tasks as Done — always tell the agent to update the
Completecolumn inplan.mdafter finishing each task. Small habit, makes tracking much easier.For tasks that are too big or too complex, it's better to split them into multiple tasks. If a big task can't be split, we should ask the AI to apply rules or instructions like those in (copilot-thought-logging.instructions.md) for noting AI thoughts to a dedicated memory file to avoid losing context.
=> We can build a skill named executing-plans => this skill will contain 2 modes (mode 1 : phase review mode, mode 2: task implementation mode) => mode 1 process (load phase => using agentic-eval skill to review => use double-check skill to review => update relevant docs for consistency => Report) => mode 2 process (load task => implement => verify: using verification-before-completion skill => review: using code-review skill => Mark complete on plan.md)
=> For example, skills like verification-before-completion, code-review, we can look at an example on this powerful Github repo (remember to customize those skills to be suitable with our project, tools, and AI Agent)
=> So every time we need to review each phase before implementation or implement each task in the phase => we just need to attach executing-plans skill to AI Agent
After we completed all tasks in the plan
When we are in this step, there are several things I constantly follow up:
Requesting AI to review and ensure the current AI documents, such as AGENTS.md, copilot-instructions.md, and other guidelines, are consistent with the updates made in plan.md.
We completed all tasks in plan.md, which doesn't mean we have completely finished the feature. We need to carefully verify the AI's work for any issues we find and ask AI to fix them. Remember to ask AI to note these in the spec folder, keeping those documents up to date with our code changes.
To fix any issues we find, let's develop a debugging skill (contain verification health check skill, code-review skill, commit-convention skill, ...) for verifying and resolving them. We can use the debugging skill template available at https://github.com/obra/superpowers.By doing this, we simply attach a debugging skill, include the spec folder with all the necessary context for the AI Agent, and provide an issue description. This allows the AI to quickly assist in resolving issues.
Keeping Docs in Sync
Docs live in multiple places — the specs/ folder in the repo, Jira tickets, Confluence pages — and they drift apart fast.
My rule: whenever a doc changes in one place, all related docs must be updated too. I explicitly tell the AI agent: if you modify a file in the specs/ folder, identify and update all related documents across Jira, Confluence, and any other linked sources. It's not a perfect system, but it cuts down on the drift problem significantly.
=> More upfront work than just prompting and hoping for the best — but once all four documents are locked in and the verification habit is in place, implementation gets surprisingly smooth. And more importantly, you actually understand what was built when it's done.
Spec-Driven Workflow with Linear
Use Linear MCP to give Claude Code direct access to your project management. The agent reads tickets, updates status, creates sub-issues, and follows your workflow — without you copy-pasting descriptions.
For larger features, combine specs with Linear tickets:
Write the spec
Read my spec template and generate a spec for: user authentication with JWT tokens. Save it to specs/auth.mdCreate tickets from the spec
Read specs/auth.md and create a parent Linear issue titled "JWT Authentication" in the Auth project, team Engineering.Then create one sub-issue per task from the spec, linked to the parent. Each sub-issue should reference the spec: "Spec: specs/auth.md"
Works on tickets
/implement PROJ-43The agent reads the ticket, finds the spec reference, reads the spec, and implements with full context.Track Progress
Your Linear board now shows:Parent issue with progress bar (auto-calculated from sub-issues)
Each sub-issue moving through: Todo → In Progress → In Review → Done
Comments with PR links on each ticket
You get visibility without doing any project management manually.
Review AI Code
https://newsletter.owainlewis.com/p/how-i-use-ai-to-review-ai-code
Here’s the four-layer setup I use. Each layer filters out a category of problems so the next layer sees less noise.
Automated checks run your linter, tests, and security scanner before the agent can finish.
Local AI review gets a second agent to review the code before you push.
CI review runs AI code review automatically on every PR. The safety net for when you skip step two (it happens)
Human review handles what’s left: architecture, business logic, and “should we even build this?”
By the time a human looks at the code, the only things remaining are the things only a human can judge.
Layer 1: Automate The Obvious
Claude Code has a feature called hooks. A hook is a shell script that runs automatically at certain points in the agent lifecycle (like when the agent finishes a task). If the script fails, the agent is blocked from completing and has to fix the issues first.
I use a Stop hook that runs my linter and scanner every time Claude finishes work.
The config goes in your Claude Code settings:
{
"hooks": {
"Stop": [
{
"hooks": [
{
"type": "command",
"command": ".claude/hooks/stop-checks.sh"
}
]
}
]
}
}
The script itself is just whatever checks you already run:
#!/bin/bash
set -e
rubocop .
brakeman -q
bundle exec rspec --fail-fast
Swap those for whatever your project uses. Ruff and pytest for Python. ESLint for JavaScript. The point is the same: the agent can’t say “done” until these pass.
This alone catches a surprising amount. Formatting issues, unused imports, type errors, broken tests. None of that makes it into a review.
Layer 2: Agent Review
After automated checks pass, review the code yourself and get an AI second opinion before you push.
Two things matter here. First, actually run the code. This sounds obvious but it catches the most embarrassing bugs in two minutes. Second, read the diff. You don’t need to understand every line; understand the shape of the change. What files were touched? Does the scope match what you asked for? Did the agent silently change something you didn’t ask it to?
For the AI review, the key is a fresh context window. Don’t ask the same agent that wrote the code to review it. It has sunk-cost bias and is less likely to challenge its own decisions.
There are a few ways to do this:
Custom Claude Code command. A review prompt in .claude/commands/review.md, paired with a REVIEW file at the project root that encodes your project-specific rules. Portable across tools, fully customisable. Claude Code also ships with some built in plugins.
Codex /review**.** Four presets covering every scenario (base branch, uncommitted changes, specific commit, custom instructions). Priority-ranked findings. The best local review UX I’ve seen. Bonus: writing with Claude and reviewing with Codex means cross-model review built into your workflow. Different models have different blind spots.
CodeRabbit. /coderabbit:review locally. 40+ linters and scanners running behind the scenes, purpose-built for code review. There are many other great code review tools like Greptile to explore also.
I use a custom review command that reads a REVIEW file at the project root. This file has project-specific rules, things I always want checked.
# REVIEW.md
## Project Patterns
- Repository pattern for data access. Direct DB queries in handlers are a flag.
- New API routes need an integration test. Flag if missing.
The general review catches general problems. The project-specific rules catch the things that are unique to your codebase.
Layer 3: External Review
Sometimes I forget to run the local review. Sometimes I’m in a rush. So I have an automated check on GitHub that reviews every PR before a human sees it.
There are a few options for this. Codex has a GitHub integration that reviews PRs automatically. CodeRabbit has a GitHub App that does the same thing. Anthropic has an open source GitHub Action for security-focused review.
I like having this as a separate layer because it catches things even when I skip the local step. Set it up once, runs on every PR for free.
Layer 4: Human Review
By the time a teammate opens the PR, the linter has passed, tests are green, and an AI has already flagged obvious issues. The human reviewer doesn’t need to catch formatting problems or unused variables.
What’s left is the stuff only a human can judge. Is this the right approach? Does it solve the actual business problem? Will this cause issues in three months? Five minutes of focused review on those questions is more valuable than thirty minutes of line-by-line reading.
TL;DR
I spent a long time trying to find the perfect code review setup. The experience was frustrating. There are hundreds of tools, plugins, and approaches, many of them doing the same thing in slightly different ways.
Don’t get lost looking for the perfect solution or perfect prompt. Start with Layer 1. Set up your linter and your hooks; that alone eliminates an entire category of review noise. Then find one way to get an AI review locally that you trust. Add CI when you’re ready.
Start simple and never accept the first output from an agent.
References
https://github.com/owainlewis/youtube-tutorials/tree/main
https://github.com/obra/superpowers
https://ainative.to/p/ai-agents-full-course-2026



