Driving Efficiency in the Cloud: Designing Cost Optimization Benchmarks and KPIs

Andy
8th August 2023
KPI picture

Introduction

Flexibility, scalability, global reach and access to the latest and greatest technical innovations are often cited as (valid) reasons to migrate to the cloud. This enables very dynamic use of technology, but it can mean costs can quickly spiral out of control. 

Measurable benchmarks and KPIs are key to staying in control. They’ll help you identify when corrective action is needed and ensure you’ll invest the right amount of time and money to get a valuable return for your optimization activities. In this article, we’ll discuss the benchmarks and KPIs we’ve defined for our cloud cost optimization activities. These help our customers hold us to account but may also help you define valuable benchmarks and KPIs to use in your environment.

In the basic equation of cloud cost = what you use * the rates you pay for that usage, our focus is on optimizing the rates you pay. We do this for AWS cloud usage through the use of commitment based discounts like Reserved Instances and Savings Plans. 

It’s often the case that risk and reward come hand in hand and that is the same for commitment based discounts. For the “reward” of rates that can be discounted by over 70% you must accept the “risk” of making a commitment to use a particular resource or spend a given amount every hour for 1 or 3 years. This risk might cost money and/or limit your freedom in making technical changes. Do you make the change and pay for the old and new or do you not make the change and lose the benefits you expect those changes to deliver? We’ll discuss why it’s important that rate optimization KPIs should account for both risk and reward.

Typically we see that organizations either:

Focus on the rewards and make large commitments only to then start wasting money or limiting technical choices when their needs change. 
Or they are aware of the risk but cannot adequately measure, manage or mitigate it so take a cautious approach which limits the rewards and inflates their cloud spend.
Neither of these are ideal!

Key Takeaways:

  1. Risk and Reward: be sure to account for both so that you can make savings that do not limit your technical freedom to take advantage of the flexibility cloud offers.
  2. Separate usage and rate. Optimizing what you use in the cloud and the rates you pay for it require entirely different skill sets. It is important to recognize this and define KPIs that are sufficiently isolated to ensure those accountable feel they have full control to influence the results of their KPIs.
  3. Pay attention to sustainability. Efficient cloud usage minimizes energy consumption, resource wastage and carbon emissions. By isolating your usage and rate optimization KPIs you can reuse your usage KPIs to track sustainability goals. We all share a social responsibility to take this seriously and will increasingly be held to account for what we use for bigger reasons than cost alone.

Rate – Risk and Reward

Optimizing Rates

For the sake of this discussion we’ll consider two broad ways of optimizing rates paid for cloud services:

  1. Usage-based
    For example, volume discounts. These are defined and automatically applied by the cloud provider as appropriate for your usage levels. We’ll come onto the importance of influence and accountability in setting KPIs later but we do not factor these usage-based savings into our rate optimization KPIs because we have no ability to influence what cloud resources are actually used.
  2.  Commitment-based
    Commitments are made to use or spend a defined level of particular services/resources in exchange for a discount. They can include AWS Savings Plans, Reserved Instances and Organization-wide agreements. As we describe in more detail in our article on Understanding AWS commitments there are lots of options available when making commitments. Combining these options effectively to track usage and balance the risks and rewards requires expertise and purpose built tooling, this is where rate optimization KPIs should focus. 

Measuring the Rewards

Commitment performance is often discussed in relation to:

  • Coverage (%): represents how much of your usage that is eligible for a commitment discount actually receives one. Coverage alone will not tell you what discount you’re accessing or how much lock-in risk you’re exposed to.
  • Utilization (%): the benefit of commitments is applied on an hourly basis. Utilization represents the proportion of time you’ve had usage that is eligible to receive the commitment’s discount. Every hour there is no eligible usage, the cost of the commitment must be subtracted from the savings generated in the hours with eligible usage. The closer you can get to 100% utilization the more efficient the commitment is. As long as the savings made during hours with eligible usage are greater than the costs when there isn’t, the commitment is having an overall positive effect. This means that if you strive only for 100% utilization you will miss out on some opportunities to save. As the discount rates differ by commitment, the utilization level that still has a positive effect also varies.
  • Wastage ($): is the cost associated with commitments where their utilization is so low that the commitment costs you more than it saves. This should be minimized wherever possible.

These are all useful measures in making individual commitment decisions but in terms of defining a KPI, you need something that combines them to capture the overall effect. 

We use the “Savings Rate”, expressed as a percentage, to do this. For the period under review we sum up all the savings made by commitments and all their costs (e.g. upfront fees pro-rata’d for the period and under-utilization). The savings minus costs gives us the overall commitment benefit. The savings rate shows the benefit as a percentage of your cloud spend without that benefit.

Expressing this as a percentage allows a simple comparison from one period to the next. The more advanced your commitment management approach, the more precisely it will follow your usage changes (up or down) so the higher, and more stable the savings rate will become.

We present these different costs, savings rate and history over time in our portal to clearly demonstrate the overall effect of what we’ve been doing in the background.

Managing the Risks

As we describe in this Understanding AWS Commitments article different types of commitments have different characteristics which affect how they are applied, the nature of lock-in they create and options you have to change your position if your usage changes.

Whether your organization wants to create a KPI around this or not will vary but at the very least it is important to review the “savings rate” you’re achieving against the “commitment liability” you are creating to deliver it. If you’re creating large savings today by making large, long term commitments that tie up cash with upfront payments you’re storing up a lot of future risk that could become very expensive if your usage changes in that time.

In financial terms we consider commitments as an accrued liability. In buying an AWS Reserved Instance or Savings Plan you are agreeing to an expense (whether paid for now or in the future), the value of which will be realized gradually over its term provided you have cloud usage eligible for that commitments’ benefits. In the worst case where commitments cannot be sold or changed or you stop using the cloud entirely you will still incur these costs without any return on that investment. It is therefore very important to monitor these liabilities and particularly the freedom you have to remove them from your balance sheet if needed.

We track this liability and express it first as a total “$” amount. We also provide a breakdown of how this liability changes over the remaining term of your active commitments by commitment type. For example, you may accept a higher liability for a commitment type you could sell on the AWS Reserved Instance Marketplace should you no longer need it. You may also be more comfortable with a high liability held for a short term as it is easier to be confident about cloud usage in shorter than longer term timescales. 

Usage vs Rate, Influence and Accountability

Notice how our measures for rate optimization include no direct link to cloud usage levels. This is deliberate and we do it for a number of reasons.

Reporting for reporting’s sake has no benefit to anyone. The purpose of KPIs should be to make it clear when action is required and to ensure accountability for that action. To be effective it is important that those with accountability also have direct, independent influence over their KPIs.

Returning to our simple equation of cost = usage * rate:

  • Usage optimization includes finding efficiencies in deleting idle resources, scheduling resources to only run when needed, rightsizing to meet required performance, modernizing to use the latest technologies or harnessing cloud-native architectures. All of which require technical, engineering skills.
  • Rate optimization on the other hand requires a different set of skills, expertise and often interest in matching usage trends with an in-depth understanding of the AWS pricing model.

It’s very rare that engineers will be excited about pricing or that those in finance will understand, or be interested in the technical. In the dynamic world of cloud a wide range of skills is needed to maximize value. It’s important to put the right skills to use in the right areas and KPIs should be designed to recognize and support this.

Particularly in larger organizations,  technical ownership of cloud usage is often split up into groups such as product, service, project or department. Those groups are best placed to understand how to optimize their own use of cloud so usage optimization should be distributed across your organization to reflect these groups.

However, aggregating an entire organization’s usage increases access to volume-based discounts, supports organization-wide pricing agreements and shares cloud provider credits. Most importantly though, it creates more opportunities to access commitment based rewards whilst mitigating their risks and without the need for technical changes to usage. As a result rate optimization should be centralized.

A common mistake we see is the creation of KPIs that mix usage and rate. How can you be held accountable for something if it is affected by actions over which you have no control? Removing this gray area is a key part of effective measurement and continuous improvement.

The importance of this “distributed usage; centralized rate” differentiation is described further through the FinOps Principles laid out by the FinOps Foundation (the leading community of Cloud Financial Management practitioners and thought leaders).

Usage and Sustainability

No one moved to the cloud to become experts managing it so simplify where you can! 

When you create usage efficiencies you reduce costs but you’ll also make significant strides in minimizing energy consumption, resource wastage, and carbon emissions. By making the usage/rate distinction in your KPIs your usage KPIs can double up to measure progress against sustainability goals. 

Embracing sustainable cloud practices contributes to a greener environment and demonstrates your commitment to environmental responsibility. Not only is this the right thing to do, but there is also an increasing tendency across different industry sectors to expect compliance with standards related to sustainability. 

For more information on this overlap see the AWS Well Architected framework which includes a pillar dedicated to sustainable design and usage principles.

Conclusion

Cloud cost optimization involves a wide range of activities and expertise. Picking the right tools, delivered by the right people in the right areas is vital in taking a holistic, effective approach to ensure you get the most value from your cloud budget. 

The right approach today is not necessarily the right approach forever so you must be sure that you’re getting a good return on your investment in your cost optimization activities and Cloud cost management tools. This is why understandable, up-to-date, easily accessible and actionable KPIs are vital.

We hope you found this article useful and would welcome the opportunity to show how we surface these KPIs within our “Automate” rate optimization service.