The ranty programmer

What are we running?

A Kubernetes cluster is of no use if I don't have something to run on it. So today I will describe the application under test.

The service

Modeling an RTB is an interesting problem because of a few reasons. The business in itself is pretty straightforward, but also requires a fine level of precision. Any rogue parameter could lead to your exchange draining and advertiser's budget with 0 impact on their sales, leading to the loss of credibility of the platform and in some cases a juicy lawsuit.

The main constraint of an Ad exchange is the multiplicative nature of the model. A single ad request can fan out to dozens of DSPs.

What is a real-time bidding?

Real-time bidding is at its core advertisers fighting for a slice of a user's attention in the hopes of translating that attention into successful sales. Just like good ol' TV ads, but massively more competitive, globalized and automated.

When a user visits a website backed by ads they generate what's called an impression. The website we'll call the publisher. When the publisher generates an impression it is sent to the edge of the pipeline, the supply-side platform(SSP).

An SSP's job is to represent the interest of the publisher. That is, to make sure the publisher gets the most money possible out of that ad slot they're providing. The SSP will forward this request to one(or more exchanges).

An exchange is a term you're probably familiar with: be it you're invested on the stock market or your crypto bro doesn't stop pestering you about them. An ad exchange is pretty much the same. The difference being that there's only a single product being sold - a user's attention.

An exchange's job is making sure an Ad slot is offered to the Advertisers willing to pay the most for them; those that are more likely to get a sale out of showing that ad. In a fraction of a second this impression will be served to any number between 10 and 100 DSPs and the one that offers the best price wins.

The demand-side platform or DSP is the other end of this system. It is the one representing the interests of the advertiser. That is, to make sure that an advertiser wins the best quality bids at the lowest price possible¹.

Now, what makes real-time bidding a good problem to model in this exercise is its brutal 100ms end-to-end tail latency requirements. Considering for every one requests you're receiving you will be generating an average of 50 outbound requests which must respond fast enough to win the bid in time.

I will only be modeling the exchange and DSPs part of the system. Given the exchange is the actual problem to test and the DSPs are not really worried about going bankrupt in this exercise I will be able to focus on the most basic aspect of the exchange - fulfilling an auction in time. It is not enough that our average throughput is high though. Our tail latencies here cannot go above 100ms. At 2M requests per second, 1% of delayed requests means 20,000 lost bids every second. No DSP would want to waste their egress bandwidth on you with those stats.

Setting performance expectations

The language used throughout this project will be Go. Even though Go is a pretty great language with some of the best tooling out there. It does have a very pervasive problem for this use case, the garbage collector². While the Go garbage collector is tuned for latency(unlike Java's), it is still a garbage collector.

At 100s of thousands of requests per second any amount of allocations on the hot path of an application creates significant work for the GC. Having a GC is nice but it does make any part of the code that allocates a source of uncertainty and we don't like Heisenbugs here.

If GC is bad: why are we using Go?

That is a really good question. The answer is ... I'm way too dumb to write memory safe Zig or Odin. I could use Rust but I don't feel like fighting RAII if nobody is paying me to. Furthermore is easy to get code in Rust to be more predictable but RAII introduces a huge cost forcing you to think in terms of objects where you should always be thinking in sets. That somehow somebody thought it was a good thing to port from C++(beats me how). Of course, in Rust, you do have the option of doing data-oriented design too³. But at that point you've lost all the advantage of the borrow checker and all you have left is a language that compiles super slowly and has no real answer for the ABA problem either.

But why Go?

Well go has all the tooling included and it's pretty great. The profiler is very complete and turning your allocation soup into something performant is totally doable without fighting some arcane macro syntax. Plus the language is pretty simple and dumb people like me need simplicity in their life.

So, while I will not be fighting too many segfaults for now -why does Go still have nullable pointers though? - I will be spending a good chunk of the time looking at a flame graph and chasing allocations.

Catching the low hanging fruit

There's this ongoing myth in the industry that you should be writing code that is actively slow and once you need it you can just fix the performance in a few "bottlenecks". The reality of most modern software is that it's comprised of a ball of mud result of lazy decisions and bad coding advice(yes I'm talking about Clean Code) from snake oil salesmen. So before reaching out of the grey beards of optimization let's just look at the basic stuff.

JSON handling

Let's take a look at this code:

func BidHandler(
	logger *slog.Logger,
	bidService *bid.BidService,
	dspService *dsp.DspService,
) http.HandlerFunc {
	transactions := make(chan float64, 10)
	go func() {
		for v := range transactions {
			exchangeBalance += v
		}
	}()

	return func(w http.ResponseWriter, r *http.Request) {
		defer r.Body.Close()
		var payload openrtb2.BidRequest
                dec := json.NewDecoder(r.Body)
                err := dec.Decode(&payload)
		if err != nil {
			http.Error(w, "Invalid JSON", http.StatusBadRequest)
			return
		}

Pretty standard web handler, takes a json payload and parse it. If you benchmark this code however you will find that it generates dozens of allocations on a medium sized payload. For this tests I'm using a 3KB payload that conforms a valid OpenRTB request.

This is easily addressable by reusing the json decoder. For this all you need is a sync pool.

var decoderPool = sync.Pool{
    New: func() any {
        return json.NewDecoder(nil)
    },
}

dec := decoderPool.Get().(*json.Decoder)
dec.Reset(r.Body)
defer decoderPool.Put(dec)

err := dec.Decode(&payload)

When writing a response you can also use a pool for the encoder and you will also want to create a byte pool.

Now once you've taken these measures your handler will start looking ugly very fast. Does it have to be this way? Of course not. If you can get a speedup so trivially you can bet somebody else already made it a library.

Sonic markets itself as

A blazingly fast JSON serializing & deserializing library

Does it deliver on its promise? It does. After benchmarking the original JSON package function, my pooled implementation and a naive sonic implementation that literally just replaced the native JSON package's encoding and decoder, it was quickly apparent of the performance difference.

dec := sonic.ConfigDefault.NewDecoder(r.Body)
err := dec.Decode(&payload)

Both my pool implementation and sonic's implementation did 56 allocations with my test payload. On the other hand the native JSON decoder way did 86 allocations per operation. On the speed distribution the pooled implementation was twice as fast as the naive one but twice as slow as the sonic implementation.

Given these results it occurred to me that surely if I mixed the sonic implementation with my pooled one I would get the highest possible performance. Sadly this proved not to be the case. The combined implementation did 54 allocations per operation but the general speed of the handler remained virtually the same as the naive sonic implementation. My pooled code was simply too ugly and convoluted to justify this tiny difference so I will stick to just naively applying sonic for JSON handling for the rest of the application.

The server

The next possible spot to of contention is the web server. The http stack included in the go library is technically not the most performant so I ran some benchmarks on it and indeed I was not very impressed. However after running the profiler in the server it was immediately clear that most of the time and allocations were being spent simply on JSON parsing and I had already optimized that away so apparently I was sitting the best performance I was going to get out of my server.

In order to make sure I had arrived at the same conclusion I also created another handler using fasthttp. This is supposed to be 6 times faster on average than Go's native http stack, and maybe it is for a basic echo server, but in here I got absolutely 0 performance improvement out of it so I will be sticking to Go's native http stack throughout the experiment.

Since running a benchmarking tool alongside the unit under test is the first sin of load testing: I bought an old Lenovo M715q Gen2. It is roughly 4 times slower than my laptop(at least it was before I butched the last repaste job and had to turn off 4 threads in my i9). The baseline results with wrk against this node on the metal(k3s did add a huge performance tax we will explore later) was the following:

$ wrk -t8 -c100 -d30s --latency -s load-tests/bid.lua http://192.168.1.206:3000/bid                                                                            

Running 30s test @ http://192.168.1.206:3000/bid                                                                                                                                 
8 threads and 100 connections                                                                                                                                                  
 
Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                             
   Latency     5.80ms   10.57ms 262.17ms   93.85%                                                                                                                               
   Req/Sec     3.10k   841.44     6.74k    89.85%                                                                                                                               
 
Latency Distribution                                                                                                                                                          
   50%    3.14ms                                                                                                                                                               
   75%    6.07ms                                                                                                                                                               
     90%   11.76ms                                                                                                                                                               
     99%   37.12ms                                                                                                                                                                 

739189 requests in 30.10s, 1.98GB read                                                                                                                                         
Requests/sec:  24558.29                                                                                                                                                          
Transfer/sec:     67.26MB

This is the base performance of one node simply echoing the payload it got back after performing the full decoding and encoding. Once I add the fanout I would be expecting at least 10x less performance but well all of this pales compared to performance tax that will be introduced by the Kubernetes stack in the next chapter.

Since it didn't even get to saturating half of my bandwidth we can safely assume this was limited by either host's cpu power of its network stack. At this frequency of requests just the interrupts in the NIC alone can steal a lot of CPU time but since we're not going to be doing any cache-friendly computation here, there's really no good reason to try dedicate cores to this service.

After deciding on this basic performance ceiling I proceeded to write the DSPs. The DSPs are basically the same thing as the exchange. They take the same bid request now enriched with some new fields like the id and telemetry data added by the exchange to help the DSP make an informed decision(in our case bidding at random).

Conclusion

Now we have the basic services that will run in our simulation. The logic in them is extremely thin keeping only what's exclusively needed to pretend to be a legitimate exchange. The only real metric of this system is guaranteeing that it can always answer with a bid in less than a 100ms. I will not be putting any extra work on the application beyond performance oriented changes until the end of the series when I need to evaluate the cluster as an actual solution. Next week I will be exploring the hurdles of getting a k3s cluster going. Spoiler alert: k3s is not designed as a performance first solution.

References

https://www.start.io/blog/how-does-real-time-bidding-work/

https://tip.golang.org/doc/gc-guide#Latency

https://gamesfromwithin.com/data-oriented-design

Modeling an RTB (The DevOps Journey - 3)