🎙️ Tune into the On Rails Podcast! From the Rails Foundation. Hosted by Planet Argon's CEO, Robby Russell. Real talk for Rails maintainers.
Article  |  Development

Measuring the Real Impact of AI Coding Tools on Developer Productivity

Reading time: ~ 5 minutes

Measuring the Real Impact of AI Coding Tools on Developer Productivity

Inspired by DX’s webinar: “Measuring AI Code Assistants and Agents with the AI Measurement Framework”

I’ve been following the growth of developer experience metrics platform DX since I met a representative of the company at the 2022 LeadDev conference in Oakland. Their CEO Abi Noda is an invaluable resource for summaries of in-depth DevEx and DORA metrics studies, as well as practical advice for choosing the right metrics for my team and using the data collected to help support devs in their day-to-day work.

So when I saw Abi and DX CTO Laura Tacho were hosting a webinar on measuring AI code assistants, I signed up right away. They shared a practical framework for evaluating AI tools— what to measure, how to use system data, and how to factor in developer experience.

Here’s what stood out to me from their webinar and how we’re thinking about applying the framework at Planet Argon.

Takeaway #1: Developers feel faster… but sometimes are actually slower.

A recent study by METR found that senior developers were 19% slower on familiar tasks with AI than without it. Why? They spent more time cleaning up and adjusting AI-generated code, even though they felt like it was helping.

This wasn’t shocking to hear. The adoption of any new tool can slow down a team until they get over the learning curve, and anyone that’s worked with an AI coding agent can attest to what a crapshoot the suggested code can be. What really stood out for me in this section was not the findings, but Abi and Laura’s application of those findings. They stressed the importance of measuring AI’s real impact and recommended looking at metrics like PR cycle time, change failure rate, and developer sentiment about maintainability. These indicators help differentiate between perceived speed and actual improvement. They also emphasized the value of comparing current metrics with pre-AI baselines to identify whether performance is trending in the right direction.

This point also set the tone for the rest of the webinar, which was that it’s not enough to trust our intuition. We need data that reflects how AI tools are affecting real outcomes, not just influencing developer feelings of productivity.

Takeaway #2: We need to measure more than just adoption and usage.

Abi and Laura introduced a three-part framework: utilization, impact, and cost. Utilization refers to whether people are using the tools and how frequently they’re using them, but that data in isolation doesn’t tell you much. A high usage rate might look like success at first, but without any correlation to output quality or long-term maintainability, it can be misleading.

In the discussion, they explained that many engineering leaders stop at measuring utilization and miss the deeper picture. Just because a developer is using an AI assistant every day doesn’t mean it’s improving the quality or efficiency of their work. Usage patterns need to be linked with delivery outcomes and cost data in order to draw useful conclusions.

Takeaway #3: AI’s impact on team productivity depends on what you measure.

DX shared their “Core 4” metrics framework, which includes throughput, change failure rate, code maintainability, and developer experience. They claim that these four metrics provide a structured way to evaluate whether a new tool or workflow is improving engineering performance.

Since Planet Argon works with a lot of outdated legacy code bases, I appreciated Abi and Laura’s emphasis on maintainability. Using AI tools may increase developer productivity, but if that comes at the cost of longterm maintainability – if it becomes harder for developers to understand or feel confident editing later – that’s a negative outcome.

This is one of the areas we’ll be tracking through our quarterly developer satisfaction surveys. It won’t give us a precise signal, but it will help us understand whether the tools are improving or degrading the experience of working in our codebases.

Takeaway #4: Cost needs to be tracked along with value.

AI tools come with direct and indirect costs. Direct costs like licensing and usage fees are pretty obvious. What’s not so obvious are the indirect costs, discussed mostly as opportunity costs. If a team spends weeks evaluating and learning a tool that ends up getting abandoned, that lost time costs us and our clients money. It’s not always easy to quantify that impact, but it should be at least acknowledged and factored into future decisions.

This has changed how I think about our own pilot process. We’ll need to define what success looks like before rolling out a tool more broadly and be more deliberate about tracking usage patterns and impact throughout the trial period.

Adapting these takeaways for our agency-style engineering team.

Okay, so this was all great for a standard product-based development team, but there are certain tweaks we’ll need to make at Planet Argon to make it fit our agency model and the way we work on multiple client applications. Here’s how I am thinking of implementing the DX framework:

  1. Work in phases.
    Not all of our clients’ repos are within our Github company org, and not all of our clients use the same deployment pipelines or cloud services. So in order to track PRs that include AI-generated code we’ll need to use the same PR flags and pull the same deployment metrics across multiple accounts and tools. This will certainly be a challenge, so we'll likely need to roll it out in waves, starting with a handful of clients that have similar set-ups.

  2. Gather team feedback.
    We’re expanding our quarterly developer experience surveys to include questions about AI tool usage, confidence in reviewing AI-generated code, and perceptions of maintainability. This will hopefully shed some light on how the team feels about the tools they use. After all, if our productivity metrics are improving but the team hates using the tools, that’s bad for morale and bad for the team longterm.

  3. Measure at a team level.
    In order to avoid performance pressure, we'll track usage data by team, not by individual. This should still give us the adoption metrics we want, but will avoid singling-out any one person for struggling to integrate the tool or tools into their workflow. Likewise, if we see very low usage data, we can assume the majority of the team isn’t into the tool we’re tracking and either replace it with a better one or scrap it altogether.

  4. Mind the cost vs. benefits.
    As an agency, we always keep an eye on our tooling costs and the value each tool is bringing to our clients compared to what we’re paying for it. This won’t change for our AI tools. We’ll look at cost and impact metrics at the end of each pilot period and use that information to decide if we continue with it or search for something that gets us more bang for our buck.

Like all tools, AI should work for us.

This webinar gave me a better way to think about AI’s role in our engineering workflows. It also reinforced something I’ve come to believe over the past few years: the best tools are the ones that improve developer experience and maintain code quality long-term, not just get more code out the door more quickly.

By focusing on real outcomes instead of vague promises, and by being intentional about what we measure, we can support our teams while still exploring the potential that AI has to offer.

If you’re trying to evaluate the impact of AI tools in your own organization, I recommend watching the DX webinar or reading their AI Measurement Framework. And if you’ve already rolled out something like this, I’d be interested to hear how you’re tracking success!

Have a project that needs help?