My experience of using AI Agents

AI Questions

What are the real differences between Vibe Coding and Agentic Engineering?

This is the part most explainers get wrong. They describe agentic engineering as “using better AI tools” or “more automation.” That’s not it.

Vibe coding means going with the vibes and not reviewing the code. That’s the defining characteristic. You prompt, you accept, you run it, you see if it works. If it doesn’t, you paste the error back in and try again.

The problem isn’t the AI. The problem is the human not reviewing what the AI built.

Agentic engineering means AI does the implementation, human owns the architecture, quality, and correctness. You might write only a fraction of the code by hand. The rest comes from agents working under your direction. That’s agentic. And you’re applying engineering discipline throughout.

Same tools. Completely different discipline.

The one-line distinction:

Vibe coding = YOLO. Agentic engineering = AI builds, human owns.

What is Harness Engineering?

To understand where harness engineering fits, here’s how it relates to the other disciplines you’ve probably heard about:

Prompt Engineering → Single interaction to craft the best input to the model (single request-response interaction).
Context Engineering → Control what the model sees during a whole session (multiple interactions until clearing).
Harness Engineering → Designs the environment, tools, guardrails, and feedback loops (multiple sessions).
Agent Engineering → Design the agent’s internal reasoning loop (define specialized agents).
Platform Engineering → Infrastructure to manage deployment, scaling, and cloud operations (where agents can run).

Prompt engineering is about what you say to the model.

Context engineering is about what the model sees.

Harness engineering is about the entire world the model operates in. It includes the tools the agent can call, the constraints it cannot violate, the documentation structure it reads, and the automated feedback loops that catch its mistakes before they reach production

What is Prompt Engineering?

Prompt engineering is the art of communicating effectively with AI. It's not about magic incantations or secret tricks - it's about being clear, specific, and structured in your instructions, the same skills that make someone a good technical writer or a good manager.

Think about delegating a task to a new team member. If you say "fix the login," you'll get very different results than if you say "the login form on /auth/login is rejecting valid passwords when they contain special characters - fix the password validation in auth/validators.ts, and add a test case for passwords with @, #, and ! characters." The second prompt gives context, locates the problem, and defines what "done" looks like.

That's prompt engineering in a nutshell: providing enough context, constraint, and clarity so the AI does what you actually intended.

What are the best practices of using Prompt Engineering?

Be specific: "Add error handling" is vague. "Wrap the API call in a try/catch, log the error with the request ID, and return a 500 response with a user-friendly message" is actionable
Provide context: Point to relevant files, mention the framework you're using, reference existing patterns in the codebase
Define done: Include acceptance criteria. What should the code do? What tests should pass? What edge cases should be handled?
Use examples: Show the agent a pattern you want followed. "Follow the same structure as UserService.ts" is more effective than describing the structure from scratch
Iterate: If the first result isn't right, refine the prompt with more specific feedback rather than starting over. "The error handling is good, but also handle the case where the API returns a 429 rate limit response" is better than re-explaining everything

What is the llm.txt for?

llm.txt is a site-level instruction file for AI/LLM agents, like robots.txt but for models. It tells models what the site is about and which sources to prioritize. Example: a site could point models to its official docs and API reference; the spec and examples are described at https://llmstxt.org/ (which you can follow to publish your own).

What even is the difference? They all seem like "AI configuration stuff."

Yep, fair. Think of it this way:

Instructions = Passive rules are always applied in the background
Agents = Custom AI personas you manually switch into
Skills = Step-by-step playbooks you explicitly invoke for a specific task

So instructions just... always run?

Yes — automatically. The applyTo frontmatter pattern controls when they're injected. For example, context-engineering.instructions.md has applyTo: '**' so it's always active. The nextjs.instructions.md has applyTo: '**/*.tsx, **/*.ts, ...' so it only kicks in when you're editing TypeScript/JSX files.

You don't call them. Copilot just silently reads them and adjusts its suggestions accordingly. They're like a standing memo on your desk — always there.

When would I actually switch to a custom agent mode?

When you want a fundamentally different persona and toolset for a session. Look at debug.agent.md — it locks in a specific tool list (terminal, tests, problems panel) and a structured debugging workflow. Or expert-nextjs-developer.agent.md — it declares expertise, forces certain behaviors, and even pins a model (GPT-4.1).

Basically: use an agent when "default Copilot" isn't the right mindset for the job. Debugging, writing a PRD, doing a technical spike — those all benefit from a mode switch.

What about skills — do those auto-load like instructions?

Nope. Skills are NOT automatically loaded. They're on-demand playbooks. You invoke them explicitly by referencing them in a prompt, or by having an agent that knows to use them.

For example, conventional-commit won't do anything unless you tell Copilot "use the conventional-commit skill" or you build it into an agent's workflow. Think of skills as reusable procedures stored in a drawer — they only run when someone pulls them out.

What's the loading order/hierarchy when Copilot runs?

Roughly:

copilot-instructions.md — loaded first, always, the main entry point
.github/instructions/*.instructions.md — auto-applied based on applyTo matching your open file
.github/agents/*.agent.md — only loaded when you explicitly switch to that agent mode in the Copilot chat panel
.github/skills/*/SKILL.md — only loaded when explicitly invoked

When do I use each?

Instructions: Set-and-forget rules (tech constraints, code style, a11y rules, security guidelines)
Custom Agent: When you need a different "mode" — debugging, planning, writing specs
Skill: When you have a repeatable multi-step workflow you want to reuse without re-explaining every time

What is AGENTS.md, and who is it for?

AGENTS.md is a universal guide following the agents.md open spec. It targets any AI coding agent (Claude, Gemini, OpenAI Codex, etc.) and covers setup commands, project structure, testing patterns, and workflow — anything tool-agnostic.

What is copilot-instructions.md, and who is it for?

It's a GitHub Copilot-specific file loaded automatically by VS Code. It focuses on IDE-level code generation rules: version constraints, ❌ NEVER / ✅ ALWAYS generation constraints, and pointers to granular context files in the copilot with applyTo glob patterns.

What content belongs in each file?

Content	AGENTS.md	copilot-instructions.md
Setup / install commands	✅	❌ (link to AGENTS.md)
Project structure	✅	❌ (link to AGENTS.md)
Testing instructions	✅	❌ (link to AGENTS.md)
Code generation constraints	❌	✅
`applyTo` instruction files	❌	✅
Version compatibility rules	❌	✅

Should they duplicate content?

No. copilot-instructions.md should delegate to AGENTS.md for shared content to avoid drift. Your current setup already does this correctly

What is the single most important design principle?

AGENTS.md is the single source of truth for project facts. copilot-instructions.md is a Copilot-specific lens on top of it — not a copy.

Why shouldn't we let AI write the AGENTS.md?

This is the big one. Research has shown (Gloaguen et al., ETH Zurich) that LLM-generated AGENTS.md files offer no benefit and can marginally reduce success rates (~3% on average) while increasing inference costs by over 20%. Developer-written context files, by contrast, provide a modest ~4% improvement. Never let an agent write to AGENTS.md directly. The lead must approve every line. Keep it shorter with clear sections:

https://addyosmani.com/blog/agents-md/

Why does AI-generated AGENTS.md increase the cost of AI?

Here’s the core problem with auto-generated context files. What does /init produce? Codebase overviews. Directory structure. Tech stack description. Module explanations. The ETH Zurich paper found that 100% of Sonnet 4.5’s auto-generated context files contained codebase overviews. 99% of GPT-5.2’s did the same.

These are precisely the things an agent can discover on its own by listing directories and reading your existing READMEs - which it does anyway, regardless of whether AGENTS.md exists. So you’ve added a file that the agent reads, then goes and confirms by reading your actual code, and now has to reconcile two sources of truth. More reasoning tokens. More steps. Same outcome, except slower and more expensive.

What information should we put in the AGENTS.md?

The practical filter is simple: can the agent discover this on its own by reading your code? If yes, delete it. Every line should represent information that isn’t already in the repo.

That means your AGENTS.md should look more like:

Use uv for package management
Always run tests with –no-cache or you’ll get false positives from the fixture setup
The auth module uses a custom middleware pattern; do not refactor to standard Express middleware
The legacy/ directory is deprecated but imported by three production modules - don’t delete anything in it

And almost nothing else.

What is the right mindset of AGENTS.md?

Stop running /init. The auto-generated output is redundant with your existing documentation and adds overhead without benefit, unless your repo has genuinely zero documentation - in which case it’s marginally better than nothing.
Before adding any line to AGENTS.md, ask: can the agent find this by reading the code? If yes, don’t write it.
When an agent struggles with something repeatedly, treat it as a codebase problem before treating it as a context problem. Restructure the code. Add a linter rule. Improve the test coverage. Reach for AGENTS.md only after you’ve exhausted those options.
If you’re running agents at scale in CI/CD or automated review pipelines, the 15-20% cost overhead from context files compounds across thousands of runs. Calculate it before assuming the tradeoff is worth it.
Consider building a maintenance agent whose job is keeping the context file accurate rather than letting it rot.
And hold your intuitions about what the agent needs loosely. What seems obviously useful to you might be noise to the model. The empirical evidence suggests the gap between what we think agents need and what actually helps them is substantial - and probably not in the direction we’d expect.
The instinct to onboard your coding agent like a new hire - give it the office tour, explain the org chart, walk it through the architecture - comes from a reasonable place. But coding agents aren’t new hires. They can grep the entire codebase before you finish typing your prompt. What they need isn’t a map. They need to know where the landmines are.
One technique worth trying: start your AGENTS.md nearly empty, and add a single instruction: “If you encounter something surprising or confusing in this project, flag it as a comment.” Most of the agent’s proposed additions aren’t things you want to keep permanently - they’re indicators of where the codebase is unclear. Fix those. Keep the file minimal.

Why do AI Agents fail on large files?

Large Language Models are probabilistic engines. They predict the next token based on patterns in their context window. When the context window is filled with thousands of lines of structured data, the model’s attention gets diluted. It correctly identifies the node you want to modify, but it loses track of sibling keys, nested brackets, and structural integrity. The result is a file that looks right at the point of change but is broken somewhere else.

What are the common issues that lead to AI failures?

The AI breaks in five specific ways:

Hallucination from missing context: The agent doesn’t have the information it needs, so it fills the gap with confident fiction. You asked for a login page and it invented an auth library that doesn’t exist.
State loss in long workflows: At 10K tokens, the agent is sharp. At 150K, it forgets you wanted dark mode and rebuilds the entire UI in blinding white. Three times.
Tool misuse: Wrong tool, bad parameters, or just ignoring the output entirely. The agent has access to 15 tools and picks the worst one for the job.
Unproductive loops: Same failed approach, over and over. Burning tokens with zero progress. The agent version of pulling a door that says “push.”
Silent failure at scale: Here’s the math nobody mentions: if your agent is 95% accurate per decision and makes 20 decisions per task, it fails more often than it succeeds. Small inaccuracies compound. At production scale, 95% isn’t good enough.

Why do our AI agents get worse over time due to the increase in context window?

LLM output quality degrades as context fills. This isn’t a bug. It’s a property of how transformer attention works. Sharp at 10K tokens. Noticeably worse at 150K on the same tasks.

Think of context management like packing for a backpacking trip. You could bring your entire wardrobe. You’d move slowly and nothing would be easy to find. Pack light instead. Know where to resupply.

In context (what you carry): system prompt, instruction files, skill headers, the current task.

On disk (available when needed): full file contents, web search results, codebase, git history. The agent loads these on demand through tools.

In practice, you’ll find that system prompts, MCP tool definitions, and memory files can eat 8-9K tokens before you’ve typed a word. That’s context space not available for reasoning. Audit regularly. Trim what you can.

Compaction happens when the context hits its limit. The platform summarizes your conversation to free space. It works, but details get lost. Early decisions compress. Tool outputs disappear. Don’t let it surprise you — manage context proactively so the agent doesn’t lose important state at the worst moment.

How many levels of applying AI Agents in coding work?

Steve Yegge recently described eight stages of how developers grow with AI tools. It's a helpful way to see where you are and where you're going.

Most developers are at Level 3-4. Level 6 begins the orchestration stage, needing different skills than Level 5.

What is the shift from conductor to orchestrator?

In the conductor setup, you work with one AI agent in real-time, kind of like pair programming. It's all about guiding that single agent, and everything happens one step at a time within a fixed context. Tools like Claude Code CLI and Cursor's in-editor mode fit this style.

Switching to the orchestrator model is like managing a whole team. You've got multiple agents, each doing their thing with their own context, and they're working independently. You plan everything out, assign tasks, and check in now and then. Tools like Agent Teams, Conductor, Codex, and Copilot Coding Agent are perfect for this approach.

What are the single-agent limits?

Every developer eventually runs into three main issues with a single agent: context overload, lack of specialization, and zero coordination. Subagents tackle the first two, while Agent Teams handle all three.

Why can't one agent handle everything? There are three main reasons.

First, context overload. A single agent can only process so much info. Large codebases can overwhelm it, causing you to lose important details as the conversation drags on.

Second, lack of specialization. When one agent tries to do it all—data layer, API, UI, tests—it becomes a jack of all trades, master of none. An agent focused solely on the data layer, for instance, will write much better database code than a generalist trying to juggle everything.

Lastly, no coordination. Even if you bring in helpers, they can't communicate, share tasks, or manage dependencies. Adding more agents without coordination just makes things messier.

Why do we need multi-agents?

Parallelism, specialization, isolation, and compound learning don't just add up—they multiply. Three focused agents consistently do better than one generalist working three times as long.

Four reasons to use multiple agents:

Parallelism (3x speed) - Three agents work on frontend, backend, and tests at the same time.

Specialization (focused work) - Each agent handles only its own files. An agent focused on db.js writes better database code than one managing everything.

Isolation (safe work) - Git worktrees give each agent its own space. No merge conflicts while they work.

Compound learning - An AGENTS.md file collects patterns and tips, improving each session.

What is the Subagents pattern?

Subagents are the simplest multi-agent pattern and the one you should try first.

Subagents use the Task tool to spawn specialized child agents from a parent orchestrator. The parent decomposes a task into pieces, spawns subagents for each piece, and manages the dependency graph manually.

What subagents solve: context isolation per agent, specialization, parallel execution for independent tasks, and it's cost-neutral at roughly 220k tokens total.

What is still missing: the parent must manually manage the dependency graph. There's no peer messaging between agents. There's no shared task list. And if you're sloppy about file scoping, two agents could write to the same file.

Bottom line: subagents give you parallel execution with manual coordination. That is great for simple decomposition. But when coordination becomes the bottleneck, you need Agent Teams.

The pro tip of working with hierarchical subagents?

Don't just delegate once. Create feature leads that have their own specialists. This allows for deeper task breakdown without overwhelming anyone.

Instead of the orchestrator creating six subagents, it should create two feature leads. Each lead then creates two or three specialists.

The main orchestrator only communicates with two agents, keeping things simple. Feature Lead A, for example, gets the task "Build the search feature" and breaks it down into Data, Logic, and API subagents. The main orchestrator doesn't need to know these details.

This approach is similar to real engineering teams, where tasks are assigned through layers of tech leads, not directly from the VP of Engineering to individual engineers.

What is the Agent Team pattern?

Agent Teams add the coordination primitives that subagents lack: a shared task list with dependency tracking, peer-to-peer messaging between teammates, and file locking to prevent conflicts.

Agent Teams are Claude Code's experimental feature for true parallel execution. Enable it with:

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

The architecture has three layers:

Team Lead at the top - decomposes work, creates the task list, synthesizes results
Shared Task List in the middle - tasks with statuses (pending, in_progress, completed, blocked), dependency tracking, and file locking
Teammates at the bottom - each an independent Claude Code instance with its own context window, running in tmux split panes

Teammates self-claim tasks from the shared list. They message each other directly - peer-to-peer, not through the lead. When a teammate finishes and marks a task complete, any blocked tasks that depended on it automatically unblock. Press Ctrl+T at any time to toggle a visual overlay of the task list.

How does the Agent Teams work?

Two mechanisms make Agent Teams work: a shared task list with automatic dependency resolution, and peer-to-peer messaging that prevents the lead from becoming a bottleneck.

The shared task list gives each task a status: pending, in_progress, completed, or blocked. Blocked tasks have explicit dependencies. When the backend teammate marks the search API as completed, the blocked test-writing task automatically flips to pending and a teammate picks it up. File locking prevents two teammates from editing the same file simultaneously.

Peer messaging is the other critical piece. The backend agent tells the frontend agent the API contract directly: "GET /search?q= returns [{id,title,url}]." This doesn't go through the lead. When a teammate goes idle, the lead is automatically notified. This peer-to-peer approach prevents the lead from becoming a coordination bottleneck.

How can you manage with 5, 10, or 20+ agents across multiple repos and features?

When you need to manage 5, 10, or 20+ agents across multiple repos and features, you need purpose-built orchestration tools. Every tool in 2026 fits one of three tiers - pick the right tier for the job.

The 2026 tool landscape breaks into three tiers:

Tier 1: In-process subagents and teams

Claude Code subagents and Agent Teams. Single terminal session, no extra tooling needed. Start here.

Tier 2: Local orchestrators

Your machine spawns multiple agents in isolated worktrees. You stay in the loop with dashboards, diff review, and merge control. Best for 3-10 agents on known codebases. Tools include Conductor, Vibe Kanban, Gastown, OpenClaw + Antfarm, Claude Squad, Antigravity, and Cursor Background Agents.

Tier 3: Cloud async agents

Assign a task, close your laptop, return to a pull request. Agents run in cloud VMs. No terminal, no local setup. Tools include Claude Code Web, GitHub Copilot Coding Agent, Jules by Google, and Codex Web by OpenAI.

Most developers in 2026 will use all three tiers - Tier 1 for interactive work, Tier 2 for parallel sprints, Tier 3 to drain the backlog overnight.

How do we keep shipping the high-quality code when working with multiple agents?

Three quality gates make agent output trustworthy: plan approval catches bad architecture before code exists, hooks enforce automated checks on every lifecycle event, and AGENTS.md compounds learning across sessions.

Plan approval. Require teammates to write a plan before they start coding. The lead reviews the approach and approves or rejects. It's far cheaper to fix a bad plan than to fix bad code. The flow looks like:

teammate >>> writes plan >>> lead review >>> approve/reject >>> implement

Hooks. Automated checks on lifecycle events. A TeammateIdle hook verifies all tests pass before allowing an agent to stop working. A TaskCompleted hook runs lint and tests before marking a task as done. If the hook fails, the agent keeps working until it passes:

task done >>> hook runs npm test >>> pass? allow  |  fail? keep working

AGENTS.md for compound learning. This file captures discovered patterns, gotchas, and style preferences. Every agent reads it at the start of a session, and every session adds to it. Session one learns about a testing pattern, AGENTS.md is updated, session two avoids the same mistake.

What is the Ralph Loop?

The Ralph Loop pattern breaks development into small atomic tasks and runs an agent in a stateless-but-iterative loop. Each iteration: pick task, implement, validate, commit if pass, reset context and repeat. This avoids context overflow while maintaining continuity through external memory.

The five-step cycle:

Pick - select the next task from tasks.json
Implement - make the change
Validate - run tests, types, lint
Commit - if checks pass, commit and update task status
Reset - clear the agent context and start fresh with the next task

The key insight is stateless-but-iterative. By resetting each iteration, the agent avoids accumulating confusion. Small bounded tasks produce cleaner code with fewer hallucinations than one enormous prompt.

AI Tips

Establish AI guidelines for the project

=> Review the guidelines on instructions, agents, and skills from the Awesome Copilot GitHub repository, identify the useful ones, and download them to our source code. To simplify the download process, install the awesome Copilot extension from the VS Code marketplace. After downloading the guidelines, carefully review them to ensure they fit our codebase.

There are some useful GitHub repos that we can refer to for useful AI Agent guidelines: https://github.com/anthropics/skills/tree/main, https://github.com/obra/superpowers, https://github.com/addyosmani/agent-skills, https://github.com/codejunkie99/agentic-stack

Below are some AI guidelines from the GitHub Copilot repository that I use the most:

Copilot Instructions

code-review-generic.instructions.md: Instructions for creating effective and portable Agent Skills that enhance GitHub Copilot with specialized capabilities, workflows, and bundled resources.
context-engineering.instructions.md: Principles for helping GitHub Copilot understand your codebase and provide better suggestions.
context7.instructions.md: Use Context7 proactively whenever the task depends on authoritative, current, version-specific external documentation that is not present in the workspace context.
markdown-accessibility.instructions.md: Markdown accessibility guidelines based on GitHub's 5 best practices for inclusive documentation
markdown.instructions.md: Markdown formatting aligned to the CommonMark specification (0.31.2).
memory-bank.instructions.md: Coding standards, domain knowledge, and preferences that AI should follow.
nextjs-tailwind.instructions.md: Instructions for high-quality Next.js applications with Tailwind CSS styling and TypeScript.
nextjs.instructions.md: Best practices for building Next.js (App Router) apps with modern caching, tooling, and server/client boundaries (aligned with Next.js 16.1.1).
performance-optimization.instructions.md: The most comprehensive, practical, and engineer-authored performance optimization instructions for all languages, frameworks, and stacks.
security-and-owasp.instructions.md: Comprehensive secure coding instructions for all languages and frameworks, based on OWASP Top 10 and industry best practices.

Custom Agents

4.1-Beast.agent.md: It is a custom “agent profile” that tells Copilot how to behave: keep working until the task is fully solved, use lots of research (including web fetches), plan with to-do lists, test rigorously, and follow specific workflow rules. In short, it is a strict, autonomous mode configuration for long, research-heavy tasks.
Thinking-Beast-Mode.agent.md: A transcendent coding agent with quantum cognitive architecture, adversarial intelligence, and unrestricted creative freedom.
aem-frontend-specialist.agent.md: IExpert assistant for developing AEM components using HTL, Tailwind CSS, and Figma-to-code workflows with design system integration
blueprint-mode.agent.md: Executes structured workflows (Debug, Express, Main, Loop) with strict correctness and maintainability. Enforces an improved tool usage policy, never assumes facts, prioritizes reproducible solutions, self-correction, and edge-case handling.
critical-thinking.agent.md: Challenge assumptions and encourage critical thinking to ensure the best possible solution and outcomes.
debug.agent.md: Debug your application to find and fix a bug.
demonstrate-understanding.agent.md: Validate user understanding of code, design patterns, and implementation details through guided questioning.
devils-advocate.agent.md: I play the devil's advocate to challenge and stress-test your ideas by finding flaws, risks, and edge cases
expert-nextjs-developer.agent.md: Expert Next.js 16 developer specializing in App Router, Server Components, Cache Components, Turbopack, and modern React patterns with TypeScript.
expert-react-frontend-engineer.agent.md: Expert React 19.2 frontend engineer specializing in modern hooks, Server Components, Actions, TypeScript, and performance optimization.
principal-software-engineer.agent.md: Provide principal-level software engineering guidance with focus on engineering excellence, technical leadership, and pragmatic implementation.
refine-issue.agent.md: Refine the requirement or issue with Acceptance Criteria, Technical Considerations, Edge Cases, and NFRs

Agent Skills

PatternsDev/skills: Patterns.dev JavaScript Skills: 58 Agent Skills bringing JS, React, and Vue design patterns into your agentic coding workflow!
agentic-eval: Patterns and techniques for evaluating and improving AI agent outputs.
architecture-blueprint-generator: Comprehensive project architecture blueprint generator that analyzes codebases to create detailed architectural documentation
autoresearch: Autonomous iterative experimentation loop for any programming task.
breakdown-epic-arch: Prompt for creating the high-level technical architecture for an Epic, based on a Product Requirements Document.
chrome-devtools: Expert-level browser automation, debugging, and performance analysis using Chrome DevTools MCP. Use for interacting with web pages, capturing screenshots, analyzing network traffic, and profiling performance.
context-map: Generate a map for all relevant files to a task before making changes
conventional-commit: Prompt and workflow for generating conventional commit messages using a structured XML format
copilot-intructions-blueprint-generator: Technology-agnostic blueprint generator for creating comprehensive copilot-intructions.md files that guide GitHub Copilot to produce code consistent with project standards, architecture patterns, and exact technology versions by analyzing existing codebase patterns and avoiding assumptions.
create-agentsmd: Prompt for generating an AGENTS.md file for a repository
create-implementation-plan: Create a new implementation plan file for new features, refactoring existing code or upgrading packages, design, architecture or infrastructure.
create-specification: Create a new specification file for the solution, optimized for Generative AI consumption.
create-technical-spike: Create time-boxed technical spike documents for researching and resolving critical development decisions before implementation
documentation-writer: Diátaxis Documentation Expert. An expert technical writer specializing in creating high-quality software documentation, guided by the principles and structure of the Diátaxis technical documentation authoring framework
doublecheck: Three-layer verification pipeline for AI output. Extracts verifiable claims, finds supporting or contradicting sources via web search, runs adversarial review for hallucination patterns, and produces a structured verification report with source links for human review
skill-creator: Create new skills, modify, and improve existing skills, and measure skill performance. Use when the user wants to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
model-recommendation: Analyze chatmode or prompt files and recommend optimal AI models based on task complexity, required capabilities, and cost-efficiency
prd: Generate high-quality Product Requirements Documents (PRDs) for software systems and AI-powered features. Includes executive summaries, user stories, technical specifications, and risk analysis.
project-workflow-analysis-blueprint-generator: Comprehensive technology-agnostic prompt generator for documenting end-to-end application workflows.
refactor: Surgical code refactoring to improve maintainability without changing behavior.
remember: Transforms lessons learned into domain-organized memory instructions (global or workspace)
dispatching-parallel-agents: Use when facing 2+ independent tasks that can be worked on without shared state or sequential dependencies
executing-plans: Use when you have a written implementation plan to execute in a separate session with review checkpoints
systematic-debugging: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
using-agent-skills: Discovers and invokes agent skills. Use when starting a session or when you need to discover which skill applies to the current task. This is the meta-skill that governs how all other skills are discovered and invoked
code-review-and-quality: Conducts multi-axis code review. Use before merging any change. Use when reviewing code written by yourself, another agent, or a human. Use when you need to assess code quality across multiple dimensions before it enters the main branch.
context-engineering: Optimizes agent context setup. Use when starting a new session, when agent output quality degrades, when switching between tasks, or when you need to configure rules files and context for a project.

MCP

Chrome DevTools MCP: chrome-devtools-mcp lets your coding agents (such as Gemini, Claude, Cursor, Copilot) control and inspect a live Chrome browser. It acts as a Model Context Protocol (MCP) server, giving your AI coding assistance access to the full power of Chrome DevTools for reliable automation, in-depth debugging, and performance analysis.
Context7 MCP: Without Context7, LLMs rely on the generic and outdated information about the libraries we use. To resolve that, Context7 pulls up-to-date, version-specific documentations and code examples straight from the source and places them directly into your prompt.
Playwright MCP: This MCP provides browser automation capabilities using Playwright. This server enables LLMs to interact with webpages through structured accessibility snapshots, bypassing the need for screenshots and visually turned models.
Atlassian MCP Server: The Atlassian Rovo MCP server is a cloud-based bridge between your Atlassian cloud site and compatible tools. Once configured, it enables those tools to interact with Jira, Confluence, and Compass data in real time. This functionality is powered by secure authentication using OAuth 2.1 or API tokens, which ensures all actions respect the user's existing access controls.
Figma MCP Server: The Figma MCP Server brings Figma directly into your workflow by providing important design information and context to AI Agents. generating code from Figma design files.
shadcn/ui MCP: The Shadcn Server allows the AI Assistant to interact with items from registries. We can browse available components, search for specific ones, and install them directly into our project with natural language.
Github MCP Server: The Github MCP Server connects AI tools directly into Github's platform. This gives AI agents, assistants, and chatbots the ability to read repositories and code files, manage issues and PRs, analyze code, and automate workflows all through natural language interactions.
Permit's MCP-gateway: Permit MCP Gateway is a drop-in zero-trust proxy that adds authentication, fine-grained authorization, consent, and audit logging to any MCP server. Swap one URL. No code changes. Works with Salesforce, GitHub, Slack, Jira, and any MCP server you already use.

Hooks

secret-scanner: Scans files modified during a Copilot coding agent session for leaked secrets, credentials, and sensitive data
session-auto-commit: Automatically commits and pushes changes when a Copilot coding agent session ends
session-logger: Logs all Copilot coding agent session activity for audit and analysis
tool-guardian: Blocks dangerous tool operations (destructive file ops, force pushes, DB drops) before the Copilot coding agent executes them

=> Let's set up a rule for the AI so that whenever project guideline files are updated or there's a big change in our source code, like updating package versions, the AI automatically checks other documents to ensure everything stays consistent.

=> When searching for specific agents, instructions, or skills, simply run the suggest-awesome-copilot-agents, instructions, or skills tools. This will help you find exactly what you need from the awesome Copilot repository.

Agentic Harness Patterns

https://generativeprogrammer.com/p/12-agentic-harness-patterns-from

Persistent Instruction File Pattern

Without a persistent instruction file, every agent session starts blank. The user repeats the same conventions, commands, and boundaries each time. The agent makes the same mistakes in session five that it made in session one.

This pattern introduces a durable project-level configuration file loaded automatically at the start of every session. It defines build commands, test commands, architecture rules, naming conventions, and coding standards. It ships with the repo, not with the user’s clipboard.

Use this when your agent works on a codebase across multiple sessions. The main trade-off is the maintenance burden: the file must stay current as the project evolves, and a stale instruction file can be worse than none if it teaches the agent outdated rules.
Scoped Context Assembly Pattern
A single instruction file works for small projects. As a codebase grows, one file either becomes a giant blob that gets ignored, or stays too generic to be useful for any specific directory.

This pattern loads instructions dynamically from multiple files at different scopes: organization, user, project root, parent directories, and child directories. The agent sees different rules depending on where in the codebase it is working. Import syntax allows splitting large instruction sets across files without duplication.

Use this in monorepos, multi-language projects, or any codebase where different directories follow different conventions. The main trade-off is discoverability: when instructions live in many files, it becomes harder to understand what the agent actually sees. Conflicting rules across scopes can produce surprising behavior.
Tiered Memory Pattern

An agent that remembers everything the same way ends up remembering nothing well. Loading all memory into the context window every session wastes tokens, hits size limits, and buries useful information under noise.

This pattern organizes agent memory into distinct layers with different loading strategies. A compact index (capped at 200 lines in Claude Code) stays in context at all times. Topic-specific files load on demand when the current task matches them. Full session transcripts stay on disk and are only searched when needed. One of the most detailed breakdowns of the leaked code confirmed this three-layer design.

Use this when your agent runs across multiple sessions and needs to retain preferences, decisions, or workflow state. The main trade-off is the added complexity in deciding what goes where, when to promote or demote information between layers, and how to keep the index in sync with the underlying files.
Dream Consolidation Pattern

Even with tiered memory, agent memory degrades over time. Duplicate entries accumulate, outdated facts contradict new ones, and the index grows until it is no longer compact.

This pattern runs a background process that periodically reviews, deduplicates, prunes, and reorganizes agent memory during idle time. Think of it as garbage collection for agent state. The leaked code revealed an “autoDream” mode that merges duplicates, prunes contradictions, and keeps the index tight. A separate analysis found 8 phases of memory management and 5 types of context compaction.

Use this when your agent accumulates memory across many sessions and you cannot rely on users to curate it manually. The main trade-off is that the consolidation process itself uses tokens and can make mistakes. An overly aggressive prune might delete something the user still needs.
Progressive Context Compaction Pattern

Long agent sessions eventually hit the context window limit. The agent either loses its earliest context entirely or stops working. Neither is acceptable for tasks that require sustained reasoning across many turns.

This pattern applies multiple stages of compression tuned for different ages of the conversation. Recent turns stay at full detail. Older turns get lightly summarized. Very old turns get aggressively collapsed. The leaked code used four layers: HISTORY_SNIP, Microcompact, CONTEXT_COLLAPSE, and Autocompact, each progressively more aggressive.

Use this when sessions regularly exceed 20-30 turns. The main trade-off is lossy compression: every summarization step discards detail, and if the agent later needs something from a collapsed segment, it may hallucinate rather than admit it forgot.
Explore-Plan-Act Loop Pattern
When an agent jumps straight to editing files, it makes changes based on incomplete understanding. The result is edits to the wrong files, missed dependencies, and approaches that ignore existing patterns.

This pattern separates the workflow into three phases with increasing write permissions. In the explore phase, the agent can only read, search, and map the codebase. In the plan phase, it discusses the approach with the user. Only in the act phase does it get full tool access. The leaked code showed distinct plan and act phases with system prompts that steer the agent away from editing before it understands the codebase.

Use this for any task that touches an unfamiliar codebase or involves non-trivial changes across multiple files. The main trade-off is speed: enforcing exploration and planning adds turns before the agent produces output, which feels slow for simple tasks.
Context-Isolated Subagents Pattern

In a long agentic session, the context window accumulates everything: research findings, planning discussions, code edits, test output, error logs. By the time the agent is deep into editing, its context is polluted with irrelevant material from earlier phases.

This pattern runs separate agents with their own context windows, system prompts, and restricted tool access. Research agents cannot edit code. Planning agents cannot execute commands. Each subagent sees only what it needs for its specific task.

Use this when sessions are long, multi-phase, or involve tasks with very different context needs. The main trade-off is coordination overhead: the main agent must decide what to pass to each subagent, and important nuance from an earlier phase can get lost in the handoff.
Fork-Join Parallelism Pattern

Large tasks that could be split into independent units still run sequentially if the agent can only work on one thing at a time. A migration across 20 files takes 20 sequential steps even though most of those files have no dependencies on each other.

This pattern spawns multiple subagents in parallel, each working in an isolated git worktree on an independent copy of the repo. The parent’s cached context is reused by each fork, making parallel branching essentially free in token cost. Results merge when all branches complete.

Use this when a task decomposes into independent units that do not depend on each other’s output. The main trade-off is merge complexity: when parallel branches touch overlapping files, the merge can produce conflicts harder to resolve than sequential work would have been.
Progressive Tools Expansion Pattern

Giving an agent access to every available tool at once creates a selection problem. With 60 tools visible, the model spends more time deciding which to use and is more likely to pick the wrong one.

This pattern starts with a small default set (fewer than 20 tools in Claude Code) and activates additional tools on demand. Justin Schroeder noted the deliberately small default set. The agent starts with Read, Edit, Write, Bash, Grep, Glob, and a handful of others. MCP tools, remote tools, and custom skills activate only when needed.

Use this when your agent has access to many tools but most tasks only need a few. The main trade-off is that the expansion logic adds complexity: the harness must decide when to activate tools, and activating too late means the agent wastes turns without the right capability.
**Command Risk Classification Pattern
**
Letting an agent run arbitrary shell commands without inspection is dangerous. But asking the user to approve every command creates fatigue, and users end up clicking “yes” to everything.

This pattern applies deterministic pre-parsing and per-tool permission gating before execution. Each tool has individual allow, ask, and deny rules with pattern matching. Shell commands pass through a classification layer that parses the verb, flags, and target to assess risk. The auto-mode classifier found in the code auto-approves low-risk actions while keeping a safety classifier for anything dangerous.

Use this when your agent can execute shell commands or interact with external systems. The main trade-off is rigidity: a deterministic classifier cannot anticipate every safe or dangerous command, so the rules need ongoing tuning.
**Single-Purpose Tool Design Pattern
**
When an agent routes every file operation through a general shell (cat, sed, grep, find), the commands are harder to review, harder to permission, and harder for the model to use correctly. A sed command that edits a file looks identical in structure to one that corrupts it.

This pattern replaces the general shell with purpose-built tools for each common operation: FileReadTool, FileEditTool, GrepTool, GlobTool. Each tool has typed inputs, a constrained scope, and its own permission rules. Raschka calls this out explicitly: the harness provides “predefined tools with validated inputs and clear boundaries” rather than allowing improvised commands.

Use this when your agent performs common file and search operations frequently. The main trade-off is flexibility: purpose-built tools cannot cover every edge case, so you still need a general shell as a fallback.
Deterministic Lifecycle Hooks Pattern

Some actions must happen every time without exception: run the formatter after every edit, validate commands before execution, reload configuration when the working directory changes. Relying on the model to remember these through prompt instructions is unreliable. The model will forget, skip, or reinterpret the instruction depending on context pressure.

This pattern runs shell commands or other actions automatically at specific points in the agent lifecycle, outside the prompt entirely. The leaked code includes 25+ hook points such as PreToolUse, PostToolUse, SessionStart, and CwdChanged. Anything that must happen every time belongs in a hook, not in an instruction.

Use this when you have invariant behaviors that should never be skipped. The main trade-off is debugging difficulty: when something goes wrong in a hook, it can be harder to diagnose than a prompt-level instruction because hooks run outside the conversation.

Experience of Working with AI

Spec-driven development

Plan your work before you write code. But rather than writing plans for humans, we're writing plans and tasks for agents.

What is spec-driven development?

Instead of prompting first and figuring it out as you go, you start with a spec. It's a short document that defines what you're building, the constraints, and the key decisions—a contract for how your code should behave. The spec becomes the source of truth the agent uses to generate code. Less guesswork. Fewer surprises.

I see people conflating PRDs, design docs, and specs all the time. They serve completely different purposes, and mixing them up — especially when working with AI agents — causes a lot of confusion down the line.

Here's how I separate them:

Product Requirements Document (PRD) — For humans. Product managers, stakeholders. Covers what we're building and why — user stories, success metrics, business value. This is your debate document.

Technical Design Document — For engineers. Covers how we're building it — architecture decisions, scalability trade-offs, security implications. Also reviewed and debated with the team.

AI Spec — For agents. This is a pure execution document. No debates, no open questions. It translates the finalized decisions from the PRD and design doc into something an agent can act on directly.

Traditional software development has always followed this pattern:

The process is the same. The audience is different.

My Workflow:

Generate the PRD
I use spec-driven-instructions-workflow.md combined with skills like prd/skill.md, prd.agent.md, or breakdown-epic-pm to generate the initial PRD. For large epics, I first use breakdown-feature-prd to break them down into feature-level PRDs before any implementation starts.

Important: always use your most capable (most expensive) model for these root documents. And never trust the first generated version. My verification pipeline:

Run doublecheck skill — surfaces inconsistencies and gaps => Run agentic-eval — makes updates based on the doublecheck report => Manual review using critical-thinking — go through every single point and make sure you genuinely understand it => Use agents like principle-software-engineer, devils-advocate, demonstrate-understanding, and refine-issue to stress-test assumptions further.
Generate the Technical Architecture
With a solid prd.md, use the breakdown-epic-arch skill to produce the high-level technical architecture for the epic. This becomes your design.md.
Generate the Spec
With both prd.md and design.md ready, generate a detailed spec using the create-specification skill combined with the spec-driven-workflow instructions. This spec.md is what agents will actually work from.
Generate the Implementation Plan
From spec.md, use create-implementation-plan/SKILL.md to produce a task-by-task plan.md.

At this point you have all four core documents:

File	Purpose
`prd.md`	What & why
`design.md`	How (architecture)
`spec.md`	Execution details for agents
`plan.md`	Task-by-task implementation steps

Review all four before touching any code.

During Implementation

Even with a solid plan, I don't just hand tasks over and walk away. My rules:

Run doublecheck before every task — pull in external resources to verify the approach and understand the full impact of each change before it's made.
Understand everything the AI produces — not "looks fine." Actually understand each decision and each line. If something's unclear, ask the agent to explain it with demonstrate-understanding.
One task = one commit — every task on the plan gets its own commit. Keeps history clean and makes it easy to trace what changed and why.
One big task = a new chat session -- every big task on the plan should be implemented in the new chat session. It keeps the context window fresh; the context window won't be poisoned by the irrelevant information from those previous responses.
Mark tasks as Done — always tell the agent to update the Complete column in plan.md after finishing each task. Small habit, makes tracking much easier.
For tasks that are too big or too complex, it's better to split them into multiple tasks. If a big task can't be split, we should ask the AI to apply rules or instructions like those in (copilot-thought-logging.instructions.md) for noting AI thoughts to a dedicated memory file to avoid losing context.

=> We can build a skill named executing-plans => this skill will contain 2 modes (mode 1 : phase review mode, mode 2: task implementation mode) => mode 1 process (load phase => using agentic-eval skill to review => use double-check skill to review => update relevant docs for consistency => Report) => mode 2 process (load task => implement => verify: using verification-before-completion skill => review: using code-review skill => Mark complete on plan.md)

=> For example, skills like verification-before-completion, code-review, we can look at an example on this powerful Github repo (remember to customize those skills to be suitable with our project, tools, and AI Agent)

=> So every time we need to review each phase before implementation or implement each task in the phase => we just need to attach executing-plans skill to AI Agent

After we completed all tasks in the plan

When we are in this step, there are several things I constantly follow up:

Requesting AI to review and ensure the current AI documents, such as AGENTS.md, copilot-instructions.md, and other guidelines, are consistent with the updates made in plan.md.
We completed all tasks in plan.md, which doesn't mean we have completely finished the feature. We need to carefully verify the AI's work for any issues we find and ask AI to fix them. Remember to ask AI to note these in the spec folder, keeping those documents up to date with our code changes.
To fix any issues we find, let's develop a debugging skill (contain verification health check skill, code-review skill, commit-convention skill, ...) for verifying and resolving them. We can use the debugging skill template available at https://github.com/obra/superpowers.By doing this, we simply attach a debugging skill, include the spec folder with all the necessary context for the AI Agent, and provide an issue description. This allows the AI to quickly assist in resolving issues.

Keeping Docs in Sync

Docs live in multiple places — the specs/ folder in the repo, Jira tickets, Confluence pages — and they drift apart fast.

My rule: whenever a doc changes in one place, all related docs must be updated too. I explicitly tell the AI agent: if you modify a file in the specs/ folder, identify and update all related documents across Jira, Confluence, and any other linked sources. It's not a perfect system, but it cuts down on the drift problem significantly.

=> More upfront work than just prompting and hoping for the best — but once all four documents are locked in and the verification habit is in place, implementation gets surprisingly smooth. And more importantly, you actually understand what was built when it's done.

Spec-Driven Workflow with Linear

Use Linear MCP to give Claude Code direct access to your project management. The agent reads tickets, updates status, creates sub-issues, and follows your workflow — without you copy-pasting descriptions.

For larger features, combine specs with Linear tickets:

Write the spec
Read my spec template and generate a spec for: user authentication with JWT tokens. Save it to specs/auth.md
Create tickets from the spec
Read specs/auth.md and create a parent Linear issue titled "JWT Authentication" in the Auth project, team Engineering.

Then create one sub-issue per task from the spec, linked to the parent. Each sub-issue should reference the spec: "Spec: specs/auth.md"
Works on tickets
/implement PROJ-43 The agent reads the ticket, finds the spec reference, reads the spec, and implements with full context.
Track Progress
Your Linear board now shows:
- Parent issue with progress bar (auto-calculated from sub-issues)
- Each sub-issue moving through: Todo → In Progress → In Review → Done
- Comments with PR links on each ticket
You get visibility without doing any project management manually.

Review AI Code

https://newsletter.owainlewis.com/p/how-i-use-ai-to-review-ai-code

Here’s the four-layer setup I use. Each layer filters out a category of problems so the next layer sees less noise.

Automated checks run your linter, tests, and security scanner before the agent can finish.
Local AI review gets a second agent to review the code before you push.
CI review runs AI code review automatically on every PR. The safety net for when you skip step two (it happens)
Human review handles what’s left: architecture, business logic, and “should we even build this?”

By the time a human looks at the code, the only things remaining are the things only a human can judge.

Layer 1: Automate The Obvious

Claude Code has a feature called hooks. A hook is a shell script that runs automatically at certain points in the agent lifecycle (like when the agent finishes a task). If the script fails, the agent is blocked from completing and has to fix the issues first.

I use a Stop hook that runs my linter and scanner every time Claude finishes work.

The config goes in your Claude Code settings:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/stop-checks.sh"
          }
        ]
      }
    ]
  }
}

The script itself is just whatever checks you already run:

#!/bin/bash
set -e
rubocop .
brakeman -q
bundle exec rspec --fail-fast

Swap those for whatever your project uses. Ruff and pytest for Python. ESLint for JavaScript. The point is the same: the agent can’t say “done” until these pass.

This alone catches a surprising amount. Formatting issues, unused imports, type errors, broken tests. None of that makes it into a review.

Layer 2: Agent Review

After automated checks pass, review the code yourself and get an AI second opinion before you push.

Two things matter here. First, actually run the code. This sounds obvious but it catches the most embarrassing bugs in two minutes. Second, read the diff. You don’t need to understand every line; understand the shape of the change. What files were touched? Does the scope match what you asked for? Did the agent silently change something you didn’t ask it to?

For the AI review, the key is a fresh context window. Don’t ask the same agent that wrote the code to review it. It has sunk-cost bias and is less likely to challenge its own decisions.

There are a few ways to do this:

Custom Claude Code command. A review prompt in .claude/commands/review.md, paired with a REVIEW file at the project root that encodes your project-specific rules. Portable across tools, fully customisable. Claude Code also ships with some built in plugins.
Codex /review**.** Four presets covering every scenario (base branch, uncommitted changes, specific commit, custom instructions). Priority-ranked findings. The best local review UX I’ve seen. Bonus: writing with Claude and reviewing with Codex means cross-model review built into your workflow. Different models have different blind spots.
CodeRabbit. /coderabbit:review locally. 40+ linters and scanners running behind the scenes, purpose-built for code review. There are many other great code review tools like Greptile to explore also.

I use a custom review command that reads a REVIEW file at the project root. This file has project-specific rules, things I always want checked.

# REVIEW.md

## Project Patterns
- Repository pattern for data access. Direct DB queries in handlers are a flag.
- New API routes need an integration test. Flag if missing.

The general review catches general problems. The project-specific rules catch the things that are unique to your codebase.

Layer 3: External Review

Sometimes I forget to run the local review. Sometimes I’m in a rush. So I have an automated check on GitHub that reviews every PR before a human sees it.

There are a few options for this. Codex has a GitHub integration that reviews PRs automatically. CodeRabbit has a GitHub App that does the same thing. Anthropic has an open source GitHub Action for security-focused review.

I like having this as a separate layer because it catches things even when I skip the local step. Set it up once, runs on every PR for free.

Layer 4: Human Review

By the time a teammate opens the PR, the linter has passed, tests are green, and an AI has already flagged obvious issues. The human reviewer doesn’t need to catch formatting problems or unused variables.

What’s left is the stuff only a human can judge. Is this the right approach? Does it solve the actual business problem? Will this cause issues in three months? Five minutes of focused review on those questions is more valuable than thirty minutes of line-by-line reading.

TL;DR

I spent a long time trying to find the perfect code review setup. The experience was frustrating. There are hundreds of tools, plugins, and approaches, many of them doing the same thing in slightly different ways.

Don’t get lost looking for the perfect solution or perfect prompt. Start with Layer 1. Set up your linter and your hooks; that alone eliminates an entire category of review noise. Then find one way to get an AI review locally that you trust. Add CI when you’re ready.

Start simple and never accept the first output from an agent.