11 Aug 2025

Beyond Story Points: How to Prove Productivity Gains in an AI-Enabled SDLC

Third instalment in the “AI-Powered Engineering” series

You can see part 1 here and part 2 here

TL;DR

Traditional agile metrics are fallible under AI acceleration and need to be reworked.
Swap opinion-based inputs (story points) for data you can’t game: flow, quality, complexity, and experience.
Instrument before your first AI pilot or you’ll be chasing a moving baseline forever.

Why “Show Me the Numbers” Suddenly Matters

Gen AI promises 2-10x output per engineer, but finance, security, and HR still ask the same question: “How do we know?” Slide-deck anecdotes won’t cut it—especially when a Board is approving seven-figure Gen AI budgets.

2 Legacy Metrics—and Why They Break

Legacy Metric	Why It Breaks in an AI World
Story-point velocity	Humans re-calibrate estimates once AI tooling feels “normal,” creating a false regression even if output rises.
Bug density (bugs ÷ points or LOC)	Story-point denominator shrinks; LOC explodes or shrinks depending on LLM style—ratio becomes noise.
Manual cycle-time sampling	Once commits land every hour, clipboard analytics can’t keep up.

3 Principles for Next-Gen Measurement

Automated: Numbers flow straight from your toolchain, not human opinion.
Outcome-based: Focus on value delivered, not effort spent.
Real-time & historical: Dashboards should refresh daily and let you compare to last quarter.
Harder-to-game: Prefer metrics derived from immutable events (merge, deploy, restore) and unbiased.

4 Metric Stack for the AI Era

Layer	What to Track	Why It Works
Flow	DORA 4 – Lead-time for changes, deployment frequency, change-failure rate, MTTR	Captures speed & stability in one bundle
Quality	Defect rate, flaky-test trend, pass-rate of AI-generated tests	Objective pass/fail; surfaces hidden regression debt
Complexity	Cyclomatic & cognitive complexity per KLOC; cost-to-produce via scc	LOC alone is noisy; complexity can normalize LLM verbosity
Experience	Dev-survey pulse (flow state, tool friction), Copilot adoption telemetry	Productivity ≠ happiness, but correlated
Outcome	Features/week, revenue per deploy, customer-satisfaction delta	Ties engineering to business value

5 Practical Baseline Playbook

Freeze a quarter of pre-AI data—extract DORA, defect counts, complexity, NPS-style DevEx scores.
Pilot with an A/B model:
- Control team ships as usual.
- Pilot team adopts AI-powered patterns, gen-AI tests, and auto-sizing using prescriptive guidance
- Measure all the things and compare
Instrument everything:
- Git hooks publish commit metadata.
- MCP for tool connectivity driving more efficiency
- CI posts DORA + complexity stats to a warehouse.
- Weekly Slack bot triggers a one-minute DevEx survey.
Run for 6–8 weeks.
Compare deltas (and normalise for scope creep). Google’s 2025 DORA-AI study found even a 2 % lift in individual productivity with just a 25% bump in AI usage —small, yes, but compounding over quarters.

6 Modern Alternatives to Story-Point Sizing

Method	How It Works	Pros	Cons
LLM-based sizing agent	Prompt an LLM with codebase context + story text; returns a complexity bucket (XS–XL)	Consistent, instant, bias-free	Needs prompt tuning; difficult to plan sprint capacity
Automated scope-diff	Measure delta in file-touches & complexity after merge; back-fill effort score	Zero manual input	Post-factum only; can’t forecast work
Third-party “effort” heuristics	Tools like scc run COCOMO-style cost estimates on PR diff	Works per-commit; language-agnostic	Estimates, not actual hours

Pro tip: If you must keep sprints, map your new complexity buckets to Fibonacci points—but lock the mapping in a config file so it can’t drift.

7 Key Takeaways

Stop treating story points as gospel. They were a workaround for data you can now collect automatically.
Baseline early, iterate often. Measuring after you scale AI is like weighing luggage after the flight.
Use a portfolio of metrics. Flow + Quality + Complexity + Experience paints the full picture and resists gaming.

Clayton Davis Blog