🎙️ Tune into the On Rails Podcast! From the Rails Foundation. Hosted by Planet Argon's CEO, Robby Russell. Real talk for Rails maintainers.
Article  |  Development

Measuring AI Tools by Their Impact on Developer Experience

Reading time: ~ 4 minutes

Measuring AI Tools by Their Impact on Developer Experience

Inspired by DX’s webinar: “Measuring AI Code Assistants and Agents with the AI Measurement Framework”

I first stumbled into the world of developer experience metrics at LeadDev Oakland back in 2022, chatting with a rep at the DX booth. Since then, DX’s CEO, Abi Noda, has become one of my go-to voices for clarity in a noisy metrics landscape—breaking down DevEx and DORA studies, and offering practical ways to use numbers to actually support developers, not bury them in dashboards. (If you’re not already following Abi on LinkedIn or Substack, you’re missing out!)

So when Abi and DX CTO Laura Tacho announced a webinar on measuring the impact of AI tools, I signed up on the spot. I expected a few charts and opinions about AI copilots. What I got was a framework—an actual map for navigating whether AI is genuinely helping teams. They covered where to look, what to measure, and how to make sure developer sentiment doesn’t get lost in the data shuffle.

Here’s what I learned—and how we’re thinking of using this framework at Planet Argon.

Takeaway #1: Developers might feel faster… but they're actually slower.

METR recently found that senior developers were 19% slower on familiar tasks when using AI than without it. The culprit? Time spent cleaning up AI-generated code. It felt like a speed boost… but the stopwatch told another story.

This wasn’t shocking to hear. The adoption of any new tool can slow down a team until they get over the learning curve, and anyone who’s worked with an AI coding agent can attest to what a crapshoot the suggested code can be. What really stood out for me in this section was not the findings, but Abi and Laura’s application of those findings. They stressed the importance of measuring AI’s real impact and recommended looking at metrics like PR cycle time, change failure rate, and developer sentiment about maintainability. These indicators help differentiate between perceived speed and actual improvement. They also emphasized the value of comparing current metrics with pre-AI baselines to identify whether performance is trending in the right direction.

This point also set the tone for the rest of the webinar, which was that it’s not enough to trust our intuition. We need data that reflects how AI tools are affecting real outcomes, not just influencing developer feelings of productivity.

Takeaway #2: We need to measure more than just adoption and usage.

Abi and Laura introduced a three-part framework: utilization, impact, and cost. Utilization refers to whether people are using the tools and how frequently they’re using them, but that data in isolation doesn’t tell you much. A high usage rate might look like success at first, but without a correlation to output quality or long-term maintainability, it can be misleading.

In the discussion, they explained that many engineering leaders stop at measuring utilization and miss the deeper picture. Just because a developer is using an AI assistant every day doesn’t mean it’s improving the quality or efficiency of their work. Usage patterns need to be linked with delivery outcomes and cost data in order to draw useful conclusions.

Takeaway #3: AI’s impact on team productivity depends on what you measure.

DX’s “Core 4” metrics—throughput, change failure rate, code maintainability, and developer experience—offer a practical lens for evaluating whether new tools are improving productivity or simply adding complexity.

At Planet Argon, we spend a lot of time untangling legacy codebases, so their emphasis on perceived maintainability really resonated. Code that “works” but makes future changes feel risky isn’t progress—it’s a slow-growing liability. AI-generated code can amplify this if developers don’t fully understand what’s been added, making long-term support harder.

To stay ahead of this, we’re incorporating sentiment into our assessments. Quarterly developer satisfaction surveys won’t give us a perfect metric, but they’ll help us spot early signs of friction—whether these tools are genuinely improving the developer experience or quietly making it harder to work in our codebases.

Takeaway #4: Cost needs to be tracked along with value.

AI tools come with obvious expenses—licensing fees, subscription costs, usage-based pricing—but the hidden costs can be just as significant. As the webinar pointed out, a single developer can rack up thousands of dollars in token usage surprisingly quickly.

Then there are the harder-to-quantify costs: time lost to evaluating, testing, and learning a tool that ultimately gets abandoned. That’s not just internal overhead; it can also affect clients if we’ve invested in implementing a tool for their app. Even if we can’t assign a precise dollar value, acknowledging those opportunity costs helps us make smarter decisions in the future.

For me, this reinforced the importance of structure in our pilot process. We need clear success criteria before rolling out any tool broadly, and we have to track both usage and impact throughout the trial. It’s not just about what we spend today—it’s about protecting our time, resources, and focus tomorrow.

Adapting these takeaways for our agency-style engineering team

The DX framework is built with product teams in mind, but at Planet Argon, we juggle multiple client apps with different tech stacks, repos, and workflows. That means we’ll need to tweak how we measure impact to get meaningful insights.

  1. Data consistency will be a challenge. Not every client repo lives in our GitHub org, and deployment pipelines vary widely. To track PRs with AI-generated code and compare metrics like review time, deploy frequency, and post-deploy incidents, we’ll need standardized PR flags and a process for pulling metrics across multiple accounts and tools. Rolling this out in waves—starting with clients who share similar setups—will make it manageable.
  2. Developer sentiment matters as much as speed. We’ll expand our quarterly developer experience surveys to include AI-specific questions about confidence in reviewing AI-generated code and perceptions of maintainability. Productivity gains won’t mean much if the tools frustrate the team or undermine morale.
  3. We’ll measure adoption at the team level, not the individual level. This protects developers from feeling performance pressure while giving us clear insight into whether a tool is sticking. If usage is low, that’s a strong sign the tool isn’t a fit, and we’ll adjust or scrap it.
  4. Cost-versus-value stays central. As an agency, every tool is judged on ROI—not just its impact on our workflows, but the value it brings to our clients. We’ll continue evaluating costs and outcomes at the end of every pilot period to decide if a tool is worth keeping, replacing, or retiring.

Shouldn't tools work FOR us?

This webinar gave me a clearer lens for evaluating AI in our workflows—and reinforced something I’ve believed for years: the best tools aren’t the ones that churn out code faster, but the ones that improve developer experience and protect long-term code quality.

By focusing on measurable outcomes, not hype, we can adopt tools that truly support our team. We don’t need to chase every shiny new product, but we do need to understand how the tools we choose are shaping our work.

If you’re trying to evaluate the impact of AI tools in your own organization, I recommend watching the DX webinar or reading their AI Measurement Framework. And if you’ve already rolled out something like this, I’d be interested to hear how you’re tracking success!

Have a project that needs help?