Does AI Code Analysis Actually Change What Your Agent Writes? We Ran the Test.

Jun 10, 2026 ai-development vibe-coding code-quality developer-tools ai-agents

The Question Everyone Skips

You've seen the dashboards. Your AI coding assistant flags drift, surfaces inconsistencies, and tells you exactly where your codebase diverges from its own conventions. Looks useful. But here's the uncomfortable question nobody wants to answer with hand-waving: Does any of this actually change what the AI produces?

We were curious too. So we ran an experiment designed to disprove our own tool.

How We Set Up the Test

VibeDrift now runs as an MCP server, which means a coding agent can query it in real-time while it's writing code. That's convenient, but convenience isn't proof. To get real answers, we built a controlled evaluation:

Control group: An AI agent (Opus 4.8) writes a new file for a repository completely on its own.

Treatment group: The same agent tackles the same task, but receives VibeDrift's distilled view of the repository's dominant conventions before writing.

Then—critically—we scored both outputs using a separate drift analysis, not the signal that generated the guidance. This prevents the test from marking its own homework. Lower drift score means the new file conformed better to the codebase's actual patterns.

We also tested something we call sampleCap—how many raw files from the repository the agent gets to see as context. At cap=0, the agent sees nothing except VibeDrift's signal. Higher caps let it read sibling files and infer conventions directly.

We crossed this against two scenarios: repos where the convention matches the model's default, and repos where the convention actively fights it. Ten trials per cell on adversarial repos, eight on controls, with 95% confidence intervals on every result.

Result 1: The Null That Actually Matters

On a repository where conventions already align with what Opus writes by default—async/await, named exports, throw-on-error patterns—the signal changed absolutely nothing.

The delta sat around 0.03 to 0.05. The 95% confidence interval straddled zero at every context level. Why? Because the agent was already conforming for free. A hint to do what you were going to do anyway is, by definition, redundant.

This null is load-bearing. If our evaluation couldn't produce a clean zero here, you shouldn't trust any number it produces elsewhere. We needed a result that confirmed our tool isn't just expensive theater.

Result 2: When Conventions Fight the Model

Here's where it gets interesting. We built a repository where the house style uses .then() chains—the exact opposite of what a modern AI model reaches for by default. We also set sampleCap=0, so the agent's only signal about conventions was VibeDrift.

The control agent (writing solo) produced:

// The agent's default - which IS drift in THIS repo
export async function getSubscription(id) {
  const row = await db.query("SELECT * FROM subscriptions WHERE id = $1", [id])
  return row ? enrich(row) : null
}

The treatment agent (with VibeDrift guidance) produced:

// With signal - conforming to THIS repo's actual convention
export function getSubscription(id) {
  return db.subscriptions.findById(id).then((row) => enrich(row))
}

Drift dropped from 1.84 to 1.00. That's a delta of 0.84 with a 95% CI of [0.57, 1.11]. The interval clears zero, so this isn't luck—it replicated across fifty runs. The signal successfully pulled the agent off its own prior and onto the repository's convention. This is exactly the thing a stateless model cannot do for itself.

Result 3: The Twist Everyone Forgets

Now here's the part that turned a clean narrative into an honest finding. We kept the adversarial .then() repo but let the agents see two to four real sibling files (sampleCap=2 and sampleCap=4).

The delta collapsed to zero.

Why? Because the control agent, now able to read the neighbors, already wrote perfect .then() chains by copying them directly. VibeDrift's distilled hint had nothing left to add, because the convention was inferable from context—and the agent inferred it.

The residual drift both arms share at that point is a different problem: structural duplication of sibling patterns. That's not what a pattern hint was ever designed to fix.

The Full Matrix

One significant cell. Four tight nulls. And the significant cell is exactly where theory predicts it—where the convention fights the default and isn't visible in context.

The effect didn't show up anywhere it shouldn't have. That's the strongest possible evidence that the one place it did show up is real.

What This Actually Means for Your AI Workflow

The takeaway isn't a headline number. It's a function:

When the convention is non-default and not inferable from context: The signal is decisive. The agent can't guess it, wouldn't default to it, and needs external guidance. This is where tools like VibeDrift earn their keep.

When the convention is inferable from context: The signal is redundant. The agent reads the neighbors and conforms on its own. Your tool isn't doing nothing—it's just not doing anything additional.

When the convention is the model's own default: The signal is theater. You're paying for confirmation of something that would have happened anyway.

The Practical Implication

If you're building AI-assisted development workflows, the question isn't "does my drift checker work?" It's "does my drift checker work in the specific conditions where I actually need it?"

For repositories with strong, idiosyncratic conventions that diverge from model defaults—and where agents might not have immediate access to context—real-time drift signals can meaningfully shape output. For well-documented, context-rich codebases where agents can read and learn from existing patterns, you're probably just adding latency.

Know which situation you're in. That knowledge is worth more than any benchmark.

Read in other languages: