Why Story Points Don’t Work

11 min readJan 4, 2020

The poor authors of the Agile Manifesto had high hopes for humanity when they outlined a way to determine a methodology for doing work. As with many hopes, they were dashed by humanity’s desire for silver bullets and an inability to grok that there is no one-size-fits-all solution to every problem. Worse yet, like incorrectly translated scripture, people chose to cement whatever they felt sounded nice, excluded any logic behind it, and went about religiously trying to convert everyone to their will. They even wrote books of their own, and made their own clergy. Plenty of ink has been spilt explaining why the Agile Manifesto’s mission failed, by one of their own authors no less. This is not about that. This is about story points, and why accurately estimating complexity in unitless numbers is not only impossible, but shouldn’t even be a goal to begin with, according even to the potential creator of story points.

After years of working in many places all claiming to be “Agile,” no two of which agreed on almost anything except the use of JIRA, none of whom ever achieved predictable burndown rates, nor reaped any of the other supposed benefits of their “Agile” approach, I feel the need to break this house of cards down once and for all. The snake oil sales must cease. The buzzwords must die. This is my final desperate attempt at convincing everyone that the Earth is not flat. If after this you are still convinced that story points and burndown rates work, then it is unlikely we will ever agree on many things related to productivity.

No True Scotsman Preamble (“this isn’t really Agile/story points”)

This being the internet, more than one person will come out to armchair critique with how story points are being done wrong, and not “the right way.” Fair enough, but all I ask of your comments is that you explicitly state what exactly you see as the right way, and why.

What’s in a story?

In the world promoted by tools like JIRA, a user story is “the smallest unit of work in an agile framework. It’s an end goal, not a feature, expressed from the software user’s perspective.” Sounds simple. It isn’t. If an end goal is the ability to do something, where do you write about features? Are features solely a concern of developers, or should the users be involved in defining them as well? And if a story is a goal, and not the set of specific tasks a developer is doing to fulfill that goal, wouldn’t the smallest unit of work be the tasks within the story? It is this kind of ambiguity that calls the entire concept into question.

But sure, let’s pretend this definition makes sense. The problem statement is this: given a user story, estimate its complexity in a unitless number. Often it is restricted to the range of 1–10, sometimes 1–5, sometimes it’s in “t-shirt” sizes or another arbitrary meaningless metric.

Let’s make some simplifying assumptions. All engineers are roughly of equivalent skill, have a similar set of skills, are present throughout an entire sprint, are not part of any other sprint, work predictable consistent hours, and agree 100% on point estimates. None of this is true in reality, but let’s assume it is. Here is an incomplete list of things that story points do not and cannot take into consideration:

Ramp-up time (uninterrupted time in which to understand a problem, load it into the human brain, and process it to begin tackling it)
Interruptions (meetings, random person comes to your desk, fires, other projects, shifting business priorities, personal emergencies)
Hidden complexity (“this was supposed to be easy”, upstream brokenness, infrastructure limitations)

Ah, but the burndown rate will solve for these hidden variables, and will allow you to lower your expected points for the sprint you say. Sure… with the planet-sized caveat that the hidden variables are relatively constant in their impact on the burndown rate. Reality check: they are not. Wild swings in interruptions, and large variability in ramp-up time and hidden complexity will ensure the burndown rate fluctuates to the point of becoming meaningless (other than being an awesome graphic of a rollercoaster).

One underlying assumption is that given a story with some number of points, one can find another story that has the same number of points (and isn’t lying). But if the tasks to be done differ greatly, and a large spectrum of tasks are performed at any given time, then there is no template story to base estimates off of. Some tasks are comparable, others are not.

“Some estimates are better than no estimates.” Incorrect. The old business tenet of underpromise and overdeliver ends up becoming overpromise and underdeliver, which makes everyone look bad. Estimates are indeed important, but are more precisely estimated at the quarter (3-month) level, same as business and resource planning. Ask yourself this: would you rather claim something takes 2 weeks and be a quarter late, or estimate 1 quarter and be a quarter late?

Goals and Realities of Story Points

What are story points supposedly used for? What’s the end-game here? Let’s break down the purported benefits of story points, and why none of these goals are achieved.

Predictable burndown rate

Assuming a fixed size sprint length and accurate point estimates, there should be a predictable burndown rate. This allows a team to precisely estimate what will be done within a fixed length of time to a large degree of certainty, thereby enabling better resource planning, including new hires, or reallocation of personnel. Sounds fantastic.

Problem: accurate point estimates don’t happen, so the burndown rate is unpredictable. Let’s assume for the sake of argument that we can tolerate a relatively high degree of unpredictability. But if this is so, and we tolerate +/- 1 quarter, then story points do not gain us anything, contradicting our assumption. Alright, let’s say we are more stringent, and assume a sprint length of 2 weeks. Ideally, if the burndown error rate is +/- 1 sprint, then our worst case scenario is that a task takes 2 sprints = 1 month. How many tasks can you think of that would take over 1 month, that you wouldn’t be able to tell ahead of time need to be split up into smaller pieces? I’d wager very few for veteran developers. “Ah, but we’re not just measuring one task in isolation, we’re measuring a group of tasks.” Fair enough, then by the above error rate assumption, a set of tasks that you thought would all take 2 weeks, have now taken 4 weeks total. Again, how likely is it that you could misjudge a task or set of tasks so badly, that you wouldn’t break them down into smaller pieces? “Ah hah! Very likely, because people are notoriously bad at estimat… oh I see what you did there.” Mhm.

Predictable quarterly planning

If you can predict what gets done during a sprint, then by induction, you can predict what gets done in a quarter. Now, you can give wonderful roadmaps to the leadership, you deliver everything on time, and everything is wonderful.

Problem: Yeah, the burndown rate not being predictable means by induction that the quarterly planning won’t be predictable, but worse than that, it’s likely the roadmaps are far too ambitious because the errors in estimation compound themselves over the period of the length of your roadmap. Ouch.

Points of Failure

So aside from what’s already been mentioned, what are some of the things that go wrong with story points?

Fibonacci points

Sometimes the numbers are suggested to be Fibonacci sequence numbers. This is based on the idea that all team members must come to a consensus on the estimate of a story such as by playing Planning Poker (yes, this is a real thing apparently people do), and to speed up resolving disagreements, the non-linear spacing between successive numbers reduces arguments. While this can technically reach a faster consensus, it is not guaranteed to reach a better estimate, given that people estimating a task vary in skill level, or set of skills, and only one developer will end up working on the story, whose estimate has now been shifted by another’s input. The error is compounded if the team roster varies within a sprint. The Law of Large Numbers won’t save you either, because sprints are largely independent, unless you work on a very specific subset of repeated problems.

Multiple concurrent sprints

In many cases, a developer is working on several projects, each of which tends to have their own separate sprint. Any errors in estimation are compounded by the hidden variables at play when a developer is switching context, as well as the sum of any intrinsic variability of the respective projects. Aside from it not being obvious exactly when to context switch or which project to give a higher priority, this will naturally throw off the estimate total burndown for each project for a given sprint. “Well you shouldn’t have people assigned to multiple projects” you say? Sure, that would work in an ideal world, but in a world of constant engineering shortages, that’s unrealistic. Which reminds me…

Unpredictable resourcing

People are swapped in, and swapped out of a project. Vacations are taken. People get sick. Projects get deprioritized then suddenly reprioritized. If using planning poker, and having to agree on story points with others, this means every time the team changes roster or its relative priority with respect to other projects changes, the burndown rate will swing wildly as well.

Point estimations tied into the context of when story is scheduled, which varies the point cost

If estimations attempt to take into account the current level of chaos or noise caused by any number of factors ranging from company politics to sudden arbitrary deadlines to variations in consistency of requirements gathering, this adds to the story point error, assuming the chaos prediction will be less than great, as is reasonable to assume.

The sum of story points in a sprint is different than the sum of its parts

One underlying assumption beneath story points is that any two stories that have the same point value are interchangeable in the sprint, allowing for dependencies to also be scheduled of course. In reality, some stories that are not dependencies of each other, but that are more similar in nature, are more likely to be completed faster when scheduled together because it involves less context switching. However, it’s not always obvious when to do this, nor is it always possible (especially if limited to a particular number of story points in a sprint), and so some work gets unnaturally split apart. Stories of similar point values are not interchangeable, except insofar as their point estimates are both likely to be off.

“Shirt sizes” and other nonsense as if somehow this changes anything

Needs little explanation. Excrement, by any other word, is still excrement.

Complexity

Defining complexity is not straightforward. We can certainly give binary answers of whether or not some task has a certain kind of complexity, but describing its degree with a non-qualitative number is meaningless. What is a 3? What is an 8?

It is a common claim that story points estimate complexity, not time, but in practice complexity has a high correlation to time taken. Humans can measure and understand time, and are likely to work back to complexity based on their idea of how long something would take. Telling a human not to think of time when measuring complexity is like expecting a fish to swim without taking into account water. It’s absurd.

Nevertheless, sooner or later someone insists on story point estimates, and this is how the movie plays out.

Story point estimates are enforced
Burndown rate is extremely variable
Intervention, urging better estimates
Developers give almost all stories the same high point value to take into account margin of error
Parkinson’s Law dictates that humans will fill the gaps in the overestimates
Burndown rate stabilizes, but actual progress is hidden behind story complexity/ambiguity
Quarterly roadmaps are just as or more inaccurate as before
Team either realizes this is meaningless and stops estimating story points, or the system continues and developers continue to game the system

So what do we do?

“Alright alright, so story point estimates don’t work. We should all just give up and blindly march forward, right?” Well now hold on there, nobody is saying that time estimates in general are without value. Here is a better way of looking at the world of software development.

1. Priorities first

Asking how long something is going to take before deciding on priority is like asking how long heart surgery will take compared to going on a run as you’re having a serious heart attack. The surgery is important no matter how long it takes and comes first. Do not put the cart before the horse. Prioritize first.

2. Break down goals

Going along with JIRA, break epics into stories, and if need be, subtasks. This requires a bit of exploration, and some rapid prototyping at times, to get a better idea of what needs doing. It turns out that simply breaking things down gives a better notion of complexity of an overall goal than individual point estimates on the stories that achieve the goal.

3. Estimate time

Given a quarter (3 months), going top to bottom on the priorities list, estimate which goals you believe can be reasonably completed within the quarter. Repeat for more future quarters continuing down the priorities list, as desired, with the understanding that the more into the future the quarter is, the higher the degree of error.

4. Execute

If using sprints, go down the stories in order of the priority of their corresponding epic, or necessary order of operations (e.g. create database before optimizing it). How many stories can you get done in the sprint ideally? Go with your gut. In the worst case, this is what you get from point estimation anyway. In the best case, you finish your sprint early and take on more things from the backlog. Perhaps your output per week is less predictable, but if the per-quarter estimates are accurate, this is acceptable.

“But what about the flexibility normal sprints and point estimates give us in moving unexpected projects into the quarter?” Wait, according to the story point estimate theory, wouldn’t this mean that some existing goal has to be pushed out to another quarter to accommodate this new goal? And don’t you also have the flexibility to do the same with the per-quarter estimates? In fact, it turns out that moving a problem elsewhere does not resolve it. Unpredictable business needs are unpredictable.

“Alright, I accept quarterly roadmaps, but how can I ensure that we are likely to reach our goals?” Excellent question! Now we’re reaching the heart of the matter. Developers work best when: business requirements are clear, design specifications are clear, their time is not wasted in meaningless meetings, nor wasted in arbitrary and meaningless complexity-estimating point systems, and people respect their time. If a project has a clear purpose and meaning, and coworkers are great, even better.

The hard truth is, complexity is hard. If we manage our expectations to reflect reality, we can get reasonably accurate estimates to within a quarter or two. Failing that, there are deeper problems afoot (e.g. indecisive management, ineffective personnel). If your business runs week-to-week, software development output variability may kill you. On the other hand, if you’re like so many other businesses out there just trying to get some grasp on when to expect things within a year, with careful prioritization and specifications and some hard work, you are likely to achieve your goals.

The Real Goal is Quality Delivery

There is a tendency to blindly prescribe process for its own sake, especially if a particular process has gained popularity, but a process should exist in service of a goal, and that goal in software should be quality delivery. It is better to deliver a quality experience a bit late, than to deliver sub-par work a bit late but with story points.

To instead optimize for timely delivery, as seems to be the ultimate endgoal of measuring things in story points, the focus should be on chiseling the business requirements to a point where developers can get to work quickly, and then not shifting those requirements around, and allowing the maximum possible focus to the developers to achieve their tasks. This is a greedy algorithm strategy. Optimize for focus, clarity and environment, and you will optimize for output. Of course, taking this too far leads to analysis paralysis, and where to draw the line is a judgment call, but one that has no need of story points. Attempts to optimize story point predictability are, well, missing the entire point of this story.