How We Kept Datadog From Blowing Up Our AWS Bill
Some time back, our engineering leaders suddenly noticed dramatically increased billings from AWS that far outpaced typical benchmarks. The problem turned out to be data insights tools sending an unexpectedly large volume of data. The solution required a deep dive into how AWS billing works and an intricate series of steps to discover the location of the firehose of data. Courier’s Developer Advocate, Aydrian Howard, sat down with CTO Seth Carney to review and discuss the experience and how the team ultimately managed to control and reduce AWS costs.
All right. Hello, Seth.
Hey, Aydrian. How's it going?
How AWS credits from the YC program made life easier
Good. So Courier is coming up on two years old. We're a YC company, which means we have some benefits that come along with that. Can we start by talking about the AWS benefits?
Yeah, absolutely. So one of the really fantastic benefits of going through the YC program is that there is a fairly sizable chunk of credits that you can apply to your AWS account. With such a great benefit, we've been able to focus on investing the capital we would have spent there on other areas of Courier.
Nice. So, this is really a benefit that not a lot of people get. How does this affect the way that we built Courier? If we did not have these credits, what would we have done differently from your perspective?
Interesting question. Courier has got a floor as far as what we minimally have to pay to keep that infrastructure running. This includes provisioned concurrency and minimal costs for our Kinesis streams. We didn’t have to spend a lot of time thinking through how we would optimize our per message costs. As an early stage start up, we have to make strategic decisions around where to invest. Instead of spending thousands of dollars monthly on hosting costs, we were able to focus on growing the team instead. So, I think the biggest thing that it has enabled us to do is really focus on building a product that our customers can find value in.
What are our real costs?
So like I said, we're coming up on two years. The credits aren't going to last forever. I know that we have started looking into the costs, but really what kicked off this analysis of the cost - what we were paying for AWS?
Yeah. It was an interesting question that our CEO [Troy] posed to me: what was our per message cost? This is a slightly more difficult question to answer than it seems on the surface. To be able to answer this question you have to understand what our floor cost is and what kind of volume that floor supports. And we didn't have that information. We couldn't look at our various sources and say, this is what it costs us to send each and every message that comes out of our platform..
When you're on credits, this is a little harder to figure out than you would think. You start to go back and look at historical costs and all you see is zero, because you didn't pay anything.
Exactly. So in looking into this, I guess, how did you go about figuring out what our cost was? And what surprised you about that?
We cast a broad net at first, just trying to look at where we might be able to get some of the costs and information. We obviously started looking to AWS billing utilities for cost data. You could see them in real time by month, but some of the more historical stuff was the challenge. We had a product called Datadog in place that did some cost estimations for us, but they weren’t particularly accurate as far as how some of our configuration is set up.
For example, we have a pretty high provision concurrency rate set up on our authorizers since that's a stove-pipe in our application. All of our public facing Lambda functions use a custom authorizer. We wanted to make sure we've got enough capacity to handle normal traffic in addition to having auto-scaling in place to handle spikes in usage. Datadog cost estimation doesn't really take things like that into account. We ended up connecting with a company called CloudZero that thought they could help us gain some of this insight. We integrated their product and started using it to analyze some of the costs per service, different breakdowns like per developer accounts, per service, per tag, and some other various things to start understanding where some of our cost breakdowns were. We definitely found some interesting things, and definitely unexpected things as well.
One of the things that stood out immediately was how high our CloudWatch and X-ray spend was. Our CloudWatch spend should have been around one-fifth to one-sixth of our total monthly Lambda invocation costs. Ours was approximately 1.5x our Lambda invocation costs. X-ray was also wildly out of bounds. It was even higher than our Cloudwatch spend. Somewhere in the neighborhood of 3x our Lambda costs.
Identifying the problem
Let's talk about the steps that we took to identify and resolve the problem. So in using CloudZero, you said it gave you some resolutions. Is that how you went about it? Like we got the analysis and some steps to resolve issues?
The product didn’t necessarily give us the solution. It gave us threads to pull on. We had to do the underlying analysis as to what was causing the increase in spend. The CloudZero team was actually super helpful as well. Ultimately, we had to dig in and figure out what was causing the costs for these services to be outside of expected bounds: determine which of our Lambda functions are generating large volumes of operational data and why. Did we have log processes in place that should be removed or cleaned up? Was our logging framework configured at the right level?
And how did you figure out which functions were generating the most logs or how did you zero in on these different Lambda functions to clean them up? I know we are a serverless company and we probably have lots of Lambda functions that I don't even know about. So are we talking about hundreds of Lambda functions?
That's a good question. I’d say we have around 100 functions. We started by looking at cost breakdowns by cost category to see where the outliers were. We started to tackle Cloudwatch logs first so we started analyzing those by function and log group. We found that there were a couple of things happening.
First, we had some cleanup to do. There were a bunch of debug messages that were being output in production. In development in staging you don’t think much of it. But, when you have a function being invoked millions of times. It can add up. In our case, we have functions that are executed more than a hundred million times per month.
This had some nominal impact on the log costs, but something was still generating large volumes of log data. It turns out that the Datadog layer that we had put in a few months ago to help get further insights into some of our Lambda functions was injecting the extra data into CloudWatch.
So we started this slow process of turning configuration bits off. Unfortunately, no matter how many bits we turned off, there were still minimal amounts of data being pumped in. But, with high requests volumes the extra bits were really adding up. As a test, we removed the layer entirely and saw an instantaneous reduction in both our X-ray costs and our CloudWatch costs.
You haven't heard me talk much about X-ray, but I will tell you that there was a constant thread under the current CloudWatch that I was constantly looking at in X-ray where we weren’t making much progress. It turns out the Datadog layer was pumping huge amounts of data into X-Ray even though it was turned off via configuration.
The impact of removing the Datadog layer
So once you removed the Datadog layer and cleaned up, did usage drop back down to a level that you would expect, or was there more to be done?
Things were back down into nominal zones at that point. Absolutely.
Are there insights that we wish we had that we no longer do because we removed this? I know that it's not worth the cost, but are there these insights that it would have been nice to continue to monitor?
I think there are insights that would be nice to have, but not at the expense that they were generating.
We would love more accuracy around timeouts of functions and things like that, but we can get in other ways. AWS has also recently released an alternative to the Datadog layer called AWS Lambda insights. We've toyed with replacing it with that, though we haven't yet gone that route.
Advice for other companies to reduce AWS costs
So for the companies out there who are starting with the serverless model, such as ours, what types of insights for Lambda functions and for serverless architecture would you recommend that people keep an eye on?
There are the pretty obvious things you want to monitor: errors, invocation time, memory utilization, and timeouts. But you also want to make sure you’re monitoring things like cold starts and how much of your provisioned capacity that you are utilizing. In conjunction with autoscaling, this will help you keep latency low for your consumers and scale out when you have unexpected influxes in traffic.
If your Lambda functions are connected to SQS Queue or Kinesis Streams or Dynamo Streams, you’ll also want to monitor iterator ages, just to make sure that you're keeping up with the volume that's in the queue. Build up in iterator age might mean you need to adjust your parallelization factor or batch sizes to accommodate increased traffic.
So to summarize, we did have a problem with costs. We seem to have those under control. We are looking to add some of the insights back without adding a lot of the costs. Do you have any final tips or recommendations for engineers and startups who are looking to start using serverless and setting this up from scratch?
Both Serverless and AWS have helped us focus on the product we’re trying to build at Courier. But, you have to make sure you understand your ecosystem and use cases. Serverless makes it very easy to opt-in to new and (sometimes) useful features. But, it’s not always particularly easy to understand the cost impact of opting into those services. Especially, as you start to thread multiple services together at the scale of even tens of thousands of requests.
I suppose if you're on credits, you can be a little careless (for now). Perhaps more incentive to apply to the YC program!
We would love to hear more about your own experiences and issues with sky high AWS costs. How did you solve the problem? Let us know by tweeting at us @trycourier or by joining our Discord server.
More from Engineering
How to Set Up Multi-Channel Notifications in Your AWS Stack
In this article, we’ll walk through an example architecture for building your own notification service w...
September 09, 2021
How I used Unity and Courier to Create a Notification-based Game
I got featured in Courier’s live stream to build a notification-based game with Unity Engine, Courier, T...
August 19, 2021