As many organizations move to an agile approach, many coaches recommend an estimation method called “Points” or “Story Points”. This approach differs from traditional approaches, and so causes discomfort to many, especially at the beginning. This note is about capturing the basics so that, if nothing else, we have a baseline of thinking.
We need to remind ourselves why it is we estimate. Let's face it, estimation is mostly about planning and forecasting. There is a huge amount of baggage in most organizations associated with the process of estimation. There are also legitimate questions the business has:
From the business perspective, the main reason for estimates is to provide data to the business (through the Product Owner / Manager) to understand, manage, and forecast plans. They are trying to make sure that, for a given capacity for the organization (enterprise, program, or team), we make trade-off decisions about how to best use that capacity based on the strategy we have and the need for self-investment, and then understand progress against these decisions. A secondary reason is to help the organization understand their capacity so they do not over-commit; so they can establish a sustainable pace.
Agile requires that we provide the business (through the Product Owner / Product Manager) with good enough data, that we work to improve the estimates when they do not provide the data required. In other words, when we say “make estimation work” what we mean is that the business can easily plan using the estimates and velocity and can make informed business decisions. If the business cannot make it work, it is up to the team / team-of-teams to help fix the problem.
The agile approach to estimation stresses speed, full team involvement, and information that is accurate enough for the purpose intended, not pure precision. This is aimed at reducing the burden of estimation, while still providing the data needed.
For many organizations, we need to be able to forecast. But, as Ron Jeffries says:
“Yes, estimation is fraught. It is inaccurate, and politically dangerous. But we do have some knowledge and the project deserves to have it.”
Estimates are not evil in and of themselves. But the results are often used for evil. Dilbert summarizes the traditional approach and the way we feel about it. Problems include:
The agile approach to estimation is aimed at improving these outcomes.
While the initial benefit to the Team of estimating is in understanding their capacity to deliver value, there are in fact other benefits of estimating to the Team:
I like this quote from Steve McConnell:
“The point of estimation is not to predict the future but to understand if we are even within a chance of managing our way to success.”
The business view of this “chance” is the forecast use of capacity. The Team / Team-of-Teams (Train) view of this is to work to their capacity.
The basis of the estimating process is “Team based relative size estimates” or “Story Point estimation” to estimate User Stories (the need). This means:
If two work items are the same size, should they always have the same estimate?
It is worth delving into the notion of a relative size estimate for a moment. Most traditional estimates are based on duration (“it will take about 20 hours”), and that leads to a whole bunch of problems (see What Are The Problems with Estimation? for more information). But it is worse than that. If you have two pieces of work to do, both of which you think will take about 20 hours to complete, but the first piece of work is straight forward, while the second is in an area where there is a traditionally a lot of problems, should these items have the same estimate? No, you'll probably want to put some level of buffer in place to offer a more realistic estimate for the second piece of work. This is another reason to move away from pure, absolute time based estimates.
One thing that often confuses people who are used to traditional task based estimates is that the resultant estimate is a Team's view of how big this item is. When we say the estimate is a “3”, we are really saying that the Team's view of this piece of work is that it is a “3”. In particular, it is not a single Team member's view. When starting, many Teams fall into the trap of saying “There are design, implementation, and testing components of this piece of work. George will do the design so what do you think the estimate is for the design piece … Jenny will implement, so what do you think the estimate for the implementation piece is …” and then just sum up the results. In Agile the unit of execution is the “Team” and so the estimates indicate the size of work for the Team not the individuals on the Team. Sure the time to “design” and “implement” are part of that. But to deliver the value represented by the Story the Team might, for example, want to pair implementation and testing as they do the design to improve the design. Or the Team might find that testing might need an “all hands on deck” approach to assure quality. What we are estimating is the Team's ability to deliver value.
In Agile the unit of execution is the “Team” and so the estimates indicate the size of work for the Team not the individuals on the Team.
For more information on the basis of the process see Planning Poker for Estimates from Mike Cohn. Affinity mapping is based on Play, Pass, or Move approach.
Estimates need to reflect reality. Some basic characteristics of what we need from estimates include:
It is worth saying again:
The estimates consistently reflect the capacity of the Team to deliver.
They must be based on reality.
Story Points should be used instead of absolute hours because people are naturally better at relative estimation. Relative estimation with story points:
But Story Points are not equal to hours. A lot of people like to equate hours or days with Story Points. Sometimes this is because there:
If, for example, you are using SAFe, and have never done the Story Point approach before, SAFe introduces the idea of standardizing points where “the shared understanding of 1 story point 'is '~ 1 day of effort to develop & test a story'”. This ensures we were able to do reporting at higher levels by rolling up numbers. SAFe also establishes the initial Velocity for the Team by multiplying the number of people on the Team (not including PO and SM) by 8 (days) and then adjusting the result down by subtracting the number of days that people would not be available during a Iteration. The training material indicates that this was a starting point.
The problem is that many people are still using this understanding when estimating Stories by saying things like “a day is equals a point and so since this item is about 10 days, the estimate is 10 points”. They then fudge the number to an 8 or 13 because “we are only allowed to use the Fibonacci numbers.” This is not the intent, and if you head in this direction there is no benefit to using Story Points.
To be clear, if the estimate is really just a proxy for time (hours or days), we should stop calling them Story Points. After all, they are just “ideal days” or “ideal hours” with all the inherent problems. Daryl Kulak says:
If a development team is equating some number of hours to a storypoint, it is missing the point (pun intended). Saying, “Thirty hours equals one point” is ridiculous. If you’re doing that, just use hours for goodness sake. You haven’t gained anything from storypoints except to be able to claim to be doing “agile.”
Here are some things to understand when using Story Points instead of a more traditional approach to estimating.
Team Velocity is expected to increase over time as the Team learns how to deliver value more effectively. If this is grounded in time-based estimates, you cannot expect Velocity to increase as the reality is that there is only so much time available in a two week period (80 hours, right?). Velocity numbers usually increase over time as the Team learns how to work together more effectively and so their effective capacity increases.
Sure, “how long” you expect something to take is part of the estimate, but you also want to factor in risk / uncertainty, and complexity. For example, if the work looks like it will take the Team a day to do, but there is a lot of risk involved, then the Team might want to give a higher (not 1) estimate – say a 2 or a 3. If the duration is expected to be 8 Team days, and it is low risk and complexity, the Team might consider the estimate to be an 8, or the Team might say “we are rubbish at estimating week long efforts so let’s give it a 13 to address this uncertainty.” In this example the 13 hopefully leads to a discussion about splitting the story to improve its chances of delivery.
Estimates should be relative to something; a known piece of work. Start with a 1 sized piece of work (about a day, low risk, low complexity, and low uncertainty) that the whole Team can understand. Then, as you do other estimates the Team talks in terms of “in comparison to this 1 sized piece of work, I think this is about 3 times bigger, so this is a 3.” Perhaps you also identify another sized piece of work, say a common understanding of an 8 sized story. Having an understanding of two different sized stories will help the Team triangulate the estimate for any new stories – “its bigger than this 1 sized story, but smaller than this 8 sized story so it a …” Some Teams call these known stories their “keystone” stories.
Each Team will have a different view of the factors that contribute to “medium risk” but within a Team they will have a shared understanding. So one Team might say “if it involves new screen then, at a minimum, it is an 8 because we need to work UX issues, work closer with the customer, …” Another team might have UX expertise and so this may not be a reason to differentiate for that team. But they may have an approval process that seems to slow things down, so stories that have this approval step are larger. Each team will have a different view of the factors that effect “size” of work for them.
Over time, as Teams do more estimating, the Team will start to build up a table in their heads:
This chart is an example of how teams develop their view of “size.” The Team that developed this chart had specifics about what they thought size meant, including complexity and risk. I sanitized it for the article. So, for example, for this Team “risk” meant “known risks” while “uncertainty” was “unknown unknowns” (and there was more of this the further out the work was.) In their minds it was not the same thing. To be clear if your Team formalizes the development of a chart like this, their chart will look different based on the work, your Team’s make-up, etc.
What is really cool about estimating this way is that you will establish a stable Team velocity over time which will help the Team make and meet commitments. As you determine future work, you can rely on the historical velocity of the Team since the numbers already factor in different risk, complexity and uncertainty profiles.
Or “When Should We Update Our Estimates?”
Old habits dies hard. One of the interesting things that happens when we estimate using Team based Story Point estimates is that once the work is complete, Teams want to update the estimate based on what they know now, and so “improve” the estimate.
This is a mistake.
The reality is that no amount of additional planning would have resulted in a better estimate. In fact the evidence indicates otherwise; estimates get worse the more time you spend on it (mainly because you are building guesses on top of guesses as you think longer about the problem). It is simply wrong to “blame” the estimate for the fact that there is now more work than expected. The only way you could have found this out was by doing the work, getting the feedback from the customers. (See Our Estimates are Terrible! for an example of this kind of thinking)
What this tells you is that you should not update a record of the estimate once you have started real work on it. That way an estimate stays an estimate, and has the same basis (and is good as) as any other estimate that we have in the system. Using this basis we can compare all the estimates, safe in the knowledge that they have the same status - “we haven't start working on this yet”.
There is nothing wrong with collecting information about how much effort something actually takes; just don't call that an “estimate” as it represents what “actual”ly happened.
What does this mean in a practical sense? Lets take a Story as a starting point. Associated with this Story should be a field which stores the “estimate”. The estimate is filled by the Team as a result of their favorite estimation method. The “actual” field for the Story is currently “null”. The estimate can be refined while the Team is not doing the work associated with the Story:
All these things might necessitate an update to the Story estimate.
Now the Team starts working the Story. Perhaps they committed to it during Iteration (Sprint) Planning. The Team is decomposing the work, determining who does what etc. At this point you should not update the “estimate” for this Story any more. As you complete the work, you might update the “actual” field.
This kind of thinking applies the higher level structures (Epics / Features) as well.
The basic rule is that you shouldn't update the estimate once you have “started work” on it so you can look at all the data in your (Portfolio / Program) backlog with a single understanding of the status of the work effort - its a guess / no work. How you determine what it means to “start work” on an Epic or Feature will probably require some discussion since it may not be as clear cut as the Story / Team example.
Often, you need to provide estimates for items of work that are beyond the size of a typical Story, say an Epic or a Feature. For example, you might want to determine an overall road-map of intent and need to line that up with a calendar. This is only realistic if you have some kind of view of how long something will take. Or you might want to determine the Cost of Delay, which includes a sizing component in it, to determine what work you should schedule first to maximize the return.
The question therefore is “How do you go about getting size information for these bigger pieces of work?”
For the sake of discussion we are going to make a couple of assumptions:
In general there is a different focus in the “why we estimate” question as we scale to epics and features. At the team level you are interested capacity to take on work. At the program level the main use is to prioritize the work. At the portfolio level its to help with overall budgeting. Pictorially this is represented as:
(Note: Thanks to Steve Sanchez for the graphic.)
There are two basic ways to scale estimates:
In practice, the two approaches really are not that far apart from each other. What you will find is that even with the summated approach, you will end up with Features in the 10's (ie 10, 20, etc) and Epics in the 100's. The hard part is getting people to really do relative size estimates that include a view of risk and complexity at all levels. As a general note SAFe starts with the “Summated Story point” approach, but it is not that specific over the long haul. In fact SAFe assumes a relative size approach (not duration) as it says “Start with day is a point then never look back”.
What this means is that in most places I have worked, organizations end up with the Pure Feature and Epic Points approach where Feature Points are the Fibonacci numbers times 10 (10, 20, 30, …, 130) and the Epic points are the Fibonacci numbers times 100 (100, 200, 300, …, 1300). The other thing that organizations do is limited the highest number to 13 or 20, no 100’s etc, with the idea that this encourages people to split the work up if it gets to this level. So if there is a Feature Point estimate of “this is more than 130”, the discussion is “Perhaps this is an Epic? Or perhaps we need to split the work so that it will fit in a quarter.” This is a good discussion to have.
One final note on this. Many organizations I've worked with like to abstract estimation one step further by using t-shirt sizes for estimates. For many it is easy to say “in comparison to this small piece of work, this is a large”; it helps because since there are no numbers, you don't think about time. Once they have the t-shirt size, organizations usually settle on a mapping between these t-shirt sizes and Feature or Epic points. The following diagram shows a sample mapping that might be put in place:
Note that the mapping and the numbers would be validated to ensure that there is in fact a meaningful mapping between a feature we call “small” and the actual story points needed to complete that feature.
One final note. Some organizations I’ve worked with step back from an either or approach and work a both and approach. When little is know about the Feature, say when it is still being analyzed, they use a Feature point scaling based on S, M, L t-shirt sizing. They then equate the t-shirt size to numbers, so a S might be a 3 Feature Points, M might be 8, and so on. Then, as they learn more and understand the kind of work they have they move to a summated story point approach to Feature size; a second estimate, if you like.
We have Team based Story Point estimates. How can this information be used to forecast work?
Let's start with the simplest case first. Have a look at the following picture:
On the left you see a prioritized backlog of work in User Stories where the items in that Backlog all have Story Point estimates on them. The Team selects the set of work they think they can deliver in the next, say, 2 week period to form the “Iteration” or “Sprint” Backlog. Here the Team has selected the first 4 Stories. The estimates for these Stories are 3, 5, 8, and 5.
What the Team is saying at this point is that they think they can complete 3+5+8+5, or 21 Points of work in this 2 week period. If at the end of the 2 week period the Team actually delivers this set of work, then the 21 Points is the Team Velocity. We say “the Team's Velocity is 21 Points for this Iteration.” What this really is saying is that your Team can deliver 21 Points of User Stories in an Iteration. This is a measure of Team capacity to deliver value.
What is a good guess for how much the Team can deliver in the next Iteration? We know they can do 21 points, so we might want to use that data to say “They can deliver 21 Points in the next Iteration”. And 21 after that, and 21 after that … You get the idea. If we have a prioritized Backlog with estimates we can now forecast how many Iterations it is going to take to deliver however much of the Backlog we are interested in. (Note: This estimating approach is called “Yesterday's Weather” - What is a good guess for the weather tomorrow? Whatever it was today.)
But what happens if the Team does not deliver the whole 21 Points. Perhaps they deliver on the first three items, but miss the fourth item. One of the basic philosophies of agile is that we only record “progress” when we complete work "Working software (solutions) is the primary measure of progress". If the Team has not completed the work of the fourth item, then we should not count it in the Team Velocity. The Velocity in this case would be 3+5+8, or 16 points. And, of course, the information we use to forecast should change as well. After all, the Team was not able to do the 21 Points they originally thought. A good guess for the next Iteration, based on the data we have now, is 16 and so on going forward.
Of course the reverse is also true. The Team might have completed the 21 points before the end of the Iteration. They taken on another Story. They think “Looking at the Backlog, the 13 is too big to do with what remains of this Iteration, so is the 8, but perhaps we can take on the 3.” If they take this on and complete the work, what is the Team's Velocity? We completed 3+5+8+5 + 3, so the Team Velocity is 24 Points and we can use that information for forecasting.
The reality is that Teams do not produce the same Velocity every Iteration. Sometimes they do more. Sometimes less. Over time Teams typically get better as they learn how to work together. Every couple of weeks there is another data point. So a Team might see:
Iteration | Velocity |
---|---|
1 | 21 |
2 | 24 |
3 | 20 |
4 | 30 |
5 | 28 |
The Product Owner should leverage this information in their forecasts. A typical approach is to use three data points - the minimum, the average, and the maximum - to be more informed about the forecast. So here the minimum is 21 Points, the maximum is 30, and the average is around 24-25. We can use this information with our prioritized backlog:
The way to think about this is to look at the items between the average and maximum Velocity - these have a significant risk of not happening based on current knowledge. If you see things there that are a significant problem, then these need to be worked now. Perhaps we need to re-prioritize the Backlog to bring in the items of concern. Perhaps we need to find another Team to do the work. Perhaps …
Many people worry “But what if we get the estimates wrong?” Here's what is interesting about the approach. If we had guessed (OK, estimated) the set of items above as 5+8+13+8, instead of 3+5+8+5, and we completed this work in the iteration. Instead of a 21 point Velocity, the Team now has a 34 point Velocity. If we assume the rest of the data in the Backlog has a similar bias (a pretty good assumption), and guess 34 points for the next Iteration, then … it will produce about the same number of items in the forecast. In other words, this approach to estimating is self-correcting with respect to the forecast. That's not to say that you don't have to work to improve estimates (see What Can We Do To Improve Our Point Based Estimates?), but rather there is less concern here than at first blush.
Note that some people worry that this isn't enough data to be significant. It may not be in a statistical sense, but it is sufficient for our purposes in that it is recent and current. And the reality is that you often have more data than you think - see How Can We Forecast When We Do Not Have a Lot of Data? for more information.
It should come as no surprise that the scaling for forecasting the data takes a similar approach to the approach we take when scaling the estimating approach beyond a Team.
To create a Team Velocity, we need to have Estimates on the Stories, and we needed a time-box (2 week Iteration) at which point we measure the amount of “done” work. The Team Velocity is the sum of the Story estimates completed in the 2 week period.
We have estimates against the Features and the Epics, whether via pure Feature / Epic Points or via Summated Story Points. If we decide we want to understand the capacity of a team-of-teams (Train) organization to deliver Features, all we need to do is determine the time-box to measure this over. Most organizations settle on some kind of quarterly cadence. To calculate the Train Velocity, sum up the estimated Feature Points completed in the quarter. Lets say we get a value of 500 Points. This means the Train can deliver 500 Points of value per quarter - a good indicator for the capacity of the Train for the future.
Similar thinking can be applied to Epics at the Portfolio level.
If we are in the first quarterly increment (SAFe's Program Increment or PI) 1 and we find that we have not completed work on a Feature, what should we do with that Feature as we head into PI 2. In general, what we would like to do is determine whether it is better to work on new PI 2 Features, or finish off the remaining amount of work on the Feature from PI 1. To do this “correctly” need to determine whether it it better to invest our capacity in the remaining work on the Feature or to stop and work on new Features. This means, for the purposes of scheduling the carry-over Features, we need the “job size” to reflect remaining work when doing the WSJF calculation. This will have the natural effect of moving unfinished work higher because you'd expect it to be smaller. If we only have a couple of hours left on something and we get “all this value” then we should do this first.
This thinking is really an example of:
“what is the best use of our people’s time?“
In reality re-estimating the Features can be a lot of work and the end result is often “we need to finish the Features we have started.” Most organizations end up saying that “we will treat Feature carry over work differently”. Let's say we have some uncompleted Features in the current PI, PI 1. Rather than re-estimating the Feature for work that is carried over, we will simple treat carry-over work as “top priority” and assume that it will be scheduled first. These become the top items on the list for PI 2. Then the only thing we do a WSJF on is “new” PI 2 stuff that is coming out of the Program Backlog.
If you do this process, you would want to do a quick review of the carry over work asking “have we learned something that would say we shouldn’t just finish the work of this feature in the PI?” You are looking for things that still have a lot work to complete. For example:
These items should not be treated as carry-over but rather should be compared (WSJF) to all the other PI 2 items coming in to determine if they really are the most important things to work on in PI 2.
Similar thinking can be applied to Epics at the Portfolio level.
Requirements typically are represented in an organization as a hierarchy: Epics breaking down into a number of Features, Features into a number of Stories:
This structure is typically reflected in the work management tool you have. So Stories will be parented by Features. Features by Epics. You can then show completion of Features and Epics by showing the Story Points completed versus the Story Points still to be completed. The follow chart is an example showing progress toward completing an Epic:
Notice that the amount of work we have associated with the Epic goes up as well as down. This is expected. As we do the work, we learn more and more about the business need and how we address that need. Sometimes this means we identify additional work to do; sometimes it means we can reduce the amount of work to do. This is the point of agile - scope is not fixed - and we will evolve our plan by continuously re-planning the work based on what we know today.
Ever wondered where the story point / relative sizing based estimating approach came from. A lot of the thinking is based on the “Wideband Delphi” estimation method, a consensus-based technique for estimating effort. This, in turn, was derived from the Delphi method which was developed in the 1950-1960s at the RAND Corporation as a forecasting tool. It has since been adapted across many industries to estimate many kinds of tasks, ranging from statistical data collection results to sales and marketing forecasts.
In addition, it turns out there was significant amount of research into doing estimates, since this was considered such a core part of the software development process. The result is that this estimation approach has been validated by that research.
Looking at the approach:
The numbers refer to the following references:
And if you've ever wondered why we use the modified Fibonacci sequence instead of some kind of simple 1 to 10 scale, see Why Progressive Estimation Scale Is So Efficient For Teams.
In addition, this approach leads to:
You will also note that these references are quite old, so there really is no surprise that people continue to learn and experiment with different approaches. The reason I bring this up is that there are alternative approaches out there.
The traditional approach to estimation is based on team relative size estimates is backed by research. The approach has proven to improve outcomes associated with estimation, but that does not mean this is the only estimation approach we can use.
Alternative approaches include:
Organizations can and should improving their process to so that they produce the least amount of wasted effort in return for the information gained.
General ideas:
SAFe also talks about estimating: