Clayton Davis Blog

The blog of an overly opinionated cloud consultant

Beyond Story Points: How to Prove Productivity Gains in an AI-Enabled SDLC

Third instalment in the “AI-Powered Engineering” series

You can see part 1 here and part 2 here

TL;DR

  1. Traditional agile metrics are fallible under AI acceleration and need to be reworked.
  2. Swap opinion-based inputs (story points) for data you can’t game: flow, quality, complexity, and experience.
  3. Instrument before your first AI pilot or you’ll be chasing a moving baseline forever.

Why “Show Me the Numbers” Suddenly Matters

Gen AI promises 2-10x output per engineer, but finance, security, and HR still ask the same question: “How do we know?” Slide-deck anecdotes won’t cut it—especially when a Board is approving seven-figure Gen AI budgets.


2 Legacy Metrics—and Why They Break

Legacy Metric Why It Breaks in an AI World
Story-point velocity Humans re-calibrate estimates once AI tooling feels “normal,” creating a false regression even if output rises.
Bug density (bugs ÷ points or LOC) Story-point denominator shrinks; LOC explodes or shrinks depending on LLM style—ratio becomes noise.
Manual cycle-time sampling Once commits land every hour, clipboard analytics can’t keep up.

3 Principles for Next-Gen Measurement

  1. Automated: Numbers flow straight from your toolchain, not human opinion.
  2. Outcome-based: Focus on value delivered, not effort spent.
  3. Real-time & historical: Dashboards should refresh daily and let you compare to last quarter.
  4. Harder-to-game: Prefer metrics derived from immutable events (merge, deploy, restore) and unbiased.

4 Metric Stack for the AI Era

Layer What to Track Why It Works
Flow DORA 4 – Lead-time for changes, deployment frequency, change-failure rate, MTTR Captures speed & stability in one bundle 
Quality Defect rate, flaky-test trend, pass-rate of AI-generated tests Objective pass/fail; surfaces hidden regression debt
Complexity Cyclomatic & cognitive complexity per KLOC; cost-to-produce via scc LOC alone is noisy; complexity can normalize LLM verbosity 
Experience Dev-survey pulse (flow state, tool friction), Copilot adoption telemetry Productivity ≠ happiness, but correlated 
Outcome Features/week, revenue per deploy, customer-satisfaction delta Ties engineering to business value

5 Practical Baseline Playbook

  1. Freeze a quarter of pre-AI data—extract DORA, defect counts, complexity, NPS-style DevEx scores.
  2. Pilot with an A/B model:
    • Control team ships as usual.
    • Pilot team adopts AI-powered patterns, gen-AI tests, and auto-sizing using prescriptive guidance
    • Measure all the things and compare
  3. Instrument everything:
    • Git hooks publish commit metadata.
    • MCP for tool connectivity driving more efficiency
    • CI posts DORA + complexity stats to a warehouse.
    • Weekly Slack bot triggers a one-minute DevEx survey.
  4. Run for 6–8 weeks.
  5. Compare deltas (and normalise for scope creep). Google’s 2025 DORA-AI study found even a 2 % lift in individual productivity with just a 25% bump in AI usage  —small, yes, but compounding over quarters.

6 Modern Alternatives to Story-Point Sizing

Method How It Works Pros Cons
LLM-based sizing agent Prompt an LLM with codebase context + story text; returns a complexity bucket (XS–XL) Consistent, instant, bias-free Needs prompt tuning; difficult to plan sprint capacity
Automated scope-diff Measure delta in file-touches & complexity after merge; back-fill effort score Zero manual input Post-factum only; can’t forecast work
Third-party “effort” heuristics Tools like scc run COCOMO-style cost estimates on PR diff Works per-commit; language-agnostic Estimates, not actual hours

Pro tip: If you must keep sprints, map your new complexity buckets to Fibonacci points—but lock the mapping in a config file so it can’t drift.

7 Key Takeaways

  • Stop treating story points as gospel. They were a workaround for data you can now collect automatically.
    Baseline early, iterate often. Measuring after you scale AI is like weighing luggage after the flight.
  • Use a portfolio of metrics. Flow + Quality + Complexity + Experience paints the full picture and resists gaming.