~6 months of "agentic" coding

So I was a complete skeptic of the whole “agentic” / “vibe” coding phenomenon before last year’s NeurIPS. While I was in California anyway, I decided to visit a bunch of friends in San Francisco. I knew SF was “big” on vibe coding, but I wasn’t aware of just how much seemingly everyone believed in it. Even some of my colleagues told me that the internal agentic coding harnesses had improved significantly and that they didn’t expect to be writing much code “by hand” in 2026.

I was still skeptical that agentic coding could be useful or relevant for my particular class of work. But I decided to try my friend’s Claude Code to forward-port a Minecraft performance optimization mod I had written ages ago and was disproportionately proud of. Not only did Claude oneshot the port, it also informed me that the mod was no longer necessary because the functionality had long since been incorporated into PaperMC.

Suddenly I felt very behind. Not in the “everyone is getting rich except me” sense, but in the sense that I might be dismissing a tool that had already become genuinely useful. I decided to give agentic coding a serious shot.

This blog is a reflection on roughly 6.5 months of using coding agents professionally and for hobby projects.

Background

Before getting into the details, a quick “who am I and what am I using this for?”

I work as a Research Engineer at Google DeepMind. Outside of work, I primarily write code for fun: game modifications, videogame tinkering, hobbyist ML projects and the occasional cursed side project.

During these six months, I used agentic coding primarily to iterate on and implement research ideas professionally. Outside of work, I also started playing Elden Ring (my first serious soulslike playthrough). I quickly decided that some parts of the experience could be improved.

In particular, I wanted a bespoke save mechanism. I got tired of applying exactly the same sequence of buffs outside a boss arena and repeatedly fighting through phase one just to get a few attempts at phase two.

“bUt tHis iSn’T wHaT mIyaZAki iNteNdeD”

Yes, I’m aware. I should probably experiment with different builds and embrace the suffering. I don’t have time for that. Even with my shortcuts, I still suffered plenty and had to spend an endless amount of time optimizing my build before finally beating Elden Beast a few weeks ago.

Since I was completely new to Elden Ring modding, I had no idea where to start. So I used coding agents to both brainstorm and implement the save system.

Most of my professional usage was in Python. Personal projects were a mix of Python, Rust and C++, though the final Elden Ring mod ended up being written in Rust.

First month

The first month was painful.

My expectations had been set sky high by Claude Code oneshotting the Minecraft mod. At work, I had access to an internal version of Antigravity (Google’s externally known agentic coding tool) that could use early Gemini release candidates.

I have a suite of internal tools that automate repetitive tasks. Think dumping results into spreadsheets, uploading logs to visualization UIs, generating reports, and so on.

At the time, I was investigating a relatively new problem and wanted to try several low-hanging interventions. The task was technically fairly simple: it would need the agent to just plumb 5 of my existing tools and read out the results.

Armed with a LOT of excitement, I pointed Antigravity at my tooling, described what I wanted, and pressed Enter.

The anticipation went through the roof and immediately into the floor.

The agent did something unbelievably stupid.

Okay, fine. Bad prompt. Let me be more explicit.

Same failure.

After several more attempts, I explicitly described the exact five steps I would have performed manually. The agent still failed the final step.

After that experience, I dramatically recalibrated my expectations.

I reduced agentic coding to a glorified implementation engine:

I’ve already written the design doc. Please turn it into code.

However, then the design docs became extremely verbose and detailed. Sometimes the exploratory utilities I was building had so little reuse value that writing the design doc took longer than implementing them myself.

Still, it was a relatively quiet period. People were on vacation, priorities for the next year were still being finalized, and I had the luxury of being inefficient.

The generated code was usually functional. It was also often terrible.

The models loved inventing bespoke serialization structures that should never have existed. They would create file-specific abstractions, unnecessary helper classes, and layers of indirection that solved problems nobody actually had.

This wasn’t unique to Gemini. Claude frequently behaved the same way in my personal projects. The agents also had a particularly funny habit. Asking them to fix a bug in a newly-created file often resulted in one of three outcomes:

  1. Rewrite the entire file from scratch and break several unrelated things.
  2. Add increasingly specific fixes until the code contained fifty layers of nested conditionals.
  3. Delete the file entirely.

I am not exaggerating. I truly wish I was. How is any of this real? At times I genuinely wondered whether San Francisco had access to a secret Claude checkpoint that existed solely to keep the vibe-coding economy alive.

How was “agentic” coding a REAL product category?

Second month

After an incredibly frustrating first month, and after talking to friends and colleagues, I realized that a large part of the problem was my configuration.

I had been using the default setup because I somehow believed that system instructions, GEMINI.md / CLAUDE.md / AGENTS.md files, MCP servers and agent skills couldn’t possibly have a statistically significant effect on outcomes. (My research centers around problems where such “context management” fixes are generally frowned upon)

However, after talking to colleagues and seeing how much effort people were putting into their setups, I decided to take configuration seriously. It turns out there is an enormous ecosystem – both within Google and outside – of skills, prompts, MCP servers and workflow conventions that dramatically change agent behavior.

I even built different agent profiles for different tasks. Some were optimized for implementation, others for exploration, others for debugging. This turned out to be essential because indiscriminately enabling every skill and MCP server just creates context bloat. More tools does not monotonically improve outcomes. I also discovered subagents, which allowed the orchestrator agent to delegate small, well-scoped tasks to other agents.

With all of these improvements, the agents mostly stopped deleting files to fix bugs.

Progress. (Bare minimum???)

Unfortunately, the code was still often unpleasant.

For example, when asked to scrape an internal Gemini endpoint across a collection of prompts, the model hallucinated endpoint names, hardcoded filenames, and still managed to invent bespoke serialization classes with names like:

AshutoshGeminiScraper(
    prompt: str,
    response: str,
    time_to_first_token: time_t,
    model: str
)

The same pattern appeared in personal projects. Models would invent inheritance hierarchies because two classes happened to reference the same game entity.

The agents also always wanted to do something that sounded sophisticated:

  • “Let’s switch to a data-oriented architecture.”
  • “Let’s avoid reinitializing this struct.”
  • “Let’s introduce a generic abstraction layer.”

What surprised me even more was the complete lack of pushback.

At one point, purely as an experiment, I asked multiple agents to merge ~50 C++ files from my Elden Ring mod into a single giant file because “readability improves when there’s only one file to read.”

EVERY. SINGLE. CODING. AGENT. AGREED. Codex / Claude / Antigravity ALL OF THEM.

All of them happily generated a giant monolithic file without questioning the premise for even a second. 2/3 of those files actually built and were even functional!

Honestly, that’s an impressive feat.

If a human coworker asked me to combine fifty files into a single 15k LOC source file, it would take considerable amount of willpower for me to comply.

After this experience, I heavily modified my agent instructions.

  • Remember that all code is a liability.
  • More code does not imply more intelligence or competence.
  • Prefer KISS principles.
  • You MUST push back on unnecessary complexity.
  • ALWAYS discuss both short-term hackability and long-term scalability of your solution.
  • NEVER choose between those tradeoffs on my behalf.
  • ALWAYS prefer tasteful solutions over clever solutions.
  • ALWAYS align before writing large amounts (>200 LOC) of code.
  • Break changes into small reviewable chunks. Always assume that a human will review this.
  • Preserve plans and design documents alongside implementation.

These aren’t the exact instructions, but they capture the spirit. Also a slight tangent, if you don’t know what I mean by “tasteful” solutions or “tasteful” code, refer to this TED interview of Linus Torvalds, honestly a great interview if you haven’t watched it yet.

Looking back, most of them are really attempts to compensate for a missing prior: the models understand software architecture, but they do not naturally share my preferences about tradeoffs or my preferred way to resolve those tradeoffs. The biggest improvement came from forcing the agent to surface tradeoffs instead of silently choosing them.

I no longer received 7000-line changelists containing surprise architectural rewrites. I received smaller, understandable changes that I could actually review and take responsibility for.

Does this reduce the “vibes” in vibe coding?

Probably.

I prefer it anyway. Many of the “agentic coding evangelists” would hate that I am taking away agent’s autonomy by making sure it generates code that I (a human) understands. That I am slowing down the agent’s iteration loop. “It generates code only it understands, humans are probably too stupid to understand the genius code that my agent wrote. We would get to singularity much faster if humans stopped acting like middle managers for these agents” I see this argument all the time, and yet, I constantly find that “acting like a middle manager” actually resulted in better code that is readable and maintainable by a human.

Full send vibe coding

After the first two months, my configuration is mostly stable and I’ve been using roughly the same setup ever since. However, the experience is still full of extremes. The systems are exceptionally good at some things and frustratingly bad at others.

But they have become “predictably useful”.

I can understand the “liftoff” argument now. I can feel the acceleration. My research iteration loop is significantly faster. I have visual prototypes for ideas that previously would have remained half-implemented notes in a document somewhere. My velocity at work has genuinely increased by roughly 4-5x. For personal projects, the gain is probably closer to 2-3x.

Importantly, this is not because the agents are solving research problems for me. They’re accelerating implementation and experimentation once I already know roughly what I want to try. I now have a fairly good mental model of where current coding agents fail, and I structure my workflows accordingly.

That said, the same pathologies still show up:

  • unnecessary abstractions,
  • over-engineering,
  • poor alignment,
  • reimplementing existing libraries,
  • overconfident architectural decisions.

I don’t think the long-term solution is simply better system prompts. The underlying models are clearly capable of producing tasteful code because they do so when prompted appropriately. The problem is that these preferences are not naturally embedded in the model. Maybe “just some more RLxF bro” ought to fix this? Defining “good taste” in software engineering probably difficult to in a “general” way.

Annoying behaviors

These aren’t dealbreakers per say, but they remain irritating.

One recurring issue is that agents constantly need to rediscover a codebase. Every new conversation starts with another tour of the repository. Even stranger is how they navigate. If I want a function definition, I hit gd in my pre-agentic-coding Neovim setup and jump directly to it through the LSP. Agents often seem to grep through files manually, repeatedly searching for symbols they could theoretically navigate to directly. Maybe tokens are cheap enough that it doesn’t matter. Maybe I’m missing something. Either way, it feels unnatural.

Another failure mode I rarely see discussed is that these systems can quietly make you less scientific. I once had an intervention that clearly failed to produce the scaling behavior I wanted. I asked the model whether there was another metric that might reveal the trend (scaling ladder curve) more clearly. The model happily invented one.

It combined several quantities using arbitrary coefficients, sprinkled in a few magical constants, and produced a beautiful curve that supported my hypothesis. The plot looked convincing. The explanation looked convincing at first glance. But the metric was complete nonsense. There was no scientific justification for the formula whatsoever. The scary part is that nothing looked obviously wrong at first glance. The output had equations, plots, reasoning and “apparent” rigor. It simply wasn’t rigorous.

Summary

Six months ago I was completely skeptical of agentic coding. Today I’m not.

The productivity gains are real. I can feel them in my day-to-day work. I prototype faster, explore more ideas, and spend less time writing boilerplate. The custom save mod for Elden Ring would definitely not exist without “agentic” coding.

At the same time, I trust the agents far less than many of the “agentic coding evangelists”. Left unchecked, they’ll happily generate unnecessary abstractions, implement dubious designs, and agree with almost any architectural decision you suggest – including terrible ones.

My current view is that “agentic” coding works best when the human already has a reasonably clear destination. If you know what you’re trying to build, these systems can dramatically accelerate the journey. If you don’t, they can generate an impressive amount of motion with surprisingly little progress.

The future probably is “agentic”. The present still needs adult supervision.