Excerpt
Pushing Further While our initial experimentation proved that zstandard streaming outperformed zlib streaming, the remaining question we had was: “How far can we push this?” Our initial experiments used the default settings for zstandard and we wanted to know how high we could push our compression ratio by playing around with the compression settings.So how far did we get? Tuning Zstandard is highly configurable and enables us to tweak various compression parameters. We focused our efforts on three parameters that we thought would have the biggest impact on compression: chainlog, hashlog, and windowlog. These parameters offer trade-offs between compression speed, memory usage, and compression ratio. For example, increasing the value of the chainlog generally improves the compression ratio, but at the cost of increasing memory usage and compression time.We also wanted to ensure that with the settings we decided on, the compression contexts would still fit in memory on our hosts. While i
Pushing Further While our initial experimentation proved that zstandard streaming outperformed zlib streaming, the remaining question we had was: “How far can we push this?” Our initial experiments used the default settings for zstandard and we wanted to know how high we could push our compression ratio by playing around with the compression settings.So how far did we get? Tuning Zstandard is highly configurable and enables us to tweak various compression parameters. We focused our efforts on three parameters that we thought would have the biggest impact on compression: chainlog, hashlog, and windowlog. These parameters offer trade-offs between compression speed, memory usage, and compression ratio. For example, increasing the value of the chainlog generally improves the compression ratio, but at the cost of increasing memory usage and compression time.We also wanted to ensure that with the settings we decided on, the compression contexts would still fit in memory on our hosts. While it’s simple to add more hosts to soak up the extra memory usage, extra hosts cost money and at some point, provide diminishing returns on the gains.We settled on an overall compression level of 6, a chainlog and hashlog of 16, and a windowlog of 18. These numbers are slightly above the default settings that you can see here and would comfortably fit in the memory of a gateway node. Zstandard Dictionaries Additionally, we wanted to investigate if we could take advantage of zstandard’s dictionary support to compress data even further. By pre-seeding zstandard with some information, it can more efficiently compress the first few kilobytes of data. However, doing this adds additional complexity as both the compressor (in this case, a gateway node) and the decompressor (a Discord client) need to have the same copy of the dictionary to communicate with each other successfully.To generate a dictionary to use, we needed data… and a lot of it. Zstandard has a built-in way to generate dictionaries (zstd --train) from a sample of data, so we just had to collect a whooole buncha samples. Notably, the gateway supports two encoding methods for payloads: JSON and ETF, and a JSON dictionary wouldn’t perform as well on ETF (and vice versa) so we had to generate two dictionaries: one for each encoding method.Since dictionaries contain portions of the training data and we’d have to ship the dictionaries to our clients, we needed to ensure that the samples we would generate the dictionaries from were free of any personally-identifiable user data. We collected data involving 120,000 messages, split them by ETF and JSON encoding, anonymized them, and then generated our dictionaries.Once our dictionaries were built, we could use our gathered data to quickly evaluate and iterate on its efficacy without needing to deploy our gateway cluster.The first payload we tried compressing was “READY.” As one of the first (and largest) payloads sent to the user, READY contains most of the information about the connecting user, such as guild membership, settings, and read states (What channels should be marked as read/unread). We compressed a single READY payload of 2,517,725 bytes down to 306,745 using the default zstandard settings which established a baseline. Utilizing the dictionary we just trained, the same payload was compressed down to 306,098 bytes – a gain of around 600 bytes.Initially, these results seemed discouraging, but we next tried compressing a smaller payload, called TYPING_START, sent to the client so it can show the “XXX is typing…” notification. In this situation, a 636 byte payload compresses down to 466 bytes without the dictionary and 187 bytes with the dictionary. We saw much better results with our dictionaries against smaller payloads simply due to how zstandard operates. Most compression algorithms “learn” from data that has already been compressed, but with small payloads, there isn’t any data for it to learn from. By preemptively informing zstandard what the payload is going to look like, it can make a more informed decision on how to compress the first few kilobytes of data before its buffers have been fully populated.Satisfied with these findings, we deployed dictionary support to our gateway cluster and started experimenting with it. Utilizing the dark launch framework, we compared zstandard to zstandard with dictionaries.Our production testing yielded the following results:Ready Payload SizeWe specifically looked at the READY payload size as it’s one of the first messages sent over the websocket and would be most likely to benefit from a dictionary. As shown in the table above, the compression gains were minimal for READY, so we looked at the results for more dispatch types hoping dictionaries would give more of an edge for smaller payloads. Unfortunately, the results were a bit mixed. For example, looking at the message create payload size that we’ve been comparing throughout this post, we can see that the dictionary actually made things worse.Ultimately, we decided not to continue with our dictionary experiments. The slightly improved compression dictionaries would provide was outweighed by the additional complexity they would add to our gateway service and clients. Data is a big driver of engineering at Discord, and the data speaks for itself: it wasn’t worth investing more effort into. Buffer Upgrading Finally, we explored increasing zstandard buffers during off-peak hours. Discord’s traffic follows a diurnal pattern, and the memory we need to handle peak demand is significantly more than what’s needed during the rest of the day. On the surface, autoscaling our gateway cluster would prevent us from wasting compute resources during off-peak hours. However, due to the nature of gateway connections being long-lived, traditional autoscaling methods don’t work well for our workload. As such, we have a lot of extra memory and compute during off-peak hours. Having all this extra compute laying around raised the question: Could we take advantage of these resources to offer greater compression?To figure this out, we built a feedback loop into the gateway cluster. This loop would run on each gateway node and monitor the memory usage by the clients connected to it. It would then determine a percentage of new connecting clients that should have their zstandard buffer upgraded. An upgraded buffer increases the windowlog, hashlog, and chainlog values by one, and since these parameters are expressed as a power of two, increasing these values by one will roughly double the amount of memory usage the buffer uses.After deploying and letting the feedback loop run for a bit, the results weren’t as good as we had initially hoped. As illustrated by the graph below, over a 24 hour period, our gateway nodes had a relatively low upgrade ratio (Up to 30%), and was significantly less than we anticipated: around 70%.After doing a bit of digging, we discovered that one of the primary issues that was causing the feedback loop to behave sub-optimally was memory fragmentation: the feedback loop looked at real system memory usage, but BEAM was allocating significantly more memory from the system than was needed to handle the connected clients. This caused the feedback loop to think that it had less memory to work with than was available.To try and mitigate this, we did a little experimentation to tweak the BEAM allocator settings — more specifically, the driver_alloc allocator, which is responsible for (shockingly) driver data allocations. The bulk of the memory used by a gateway process is the zstandard streaming context, which is implemented in C using a NIF. NIF memory usage is allocated by driver_alloc. Our hypothesis was that if we could tweak the driver_alloc allocator to more effectively allocate or free memory for our zstandard contexts, we’d be able to decrease fragmentation and increase upgrade ratio overall.However, after messing around with the allocator settings for a little bit, we decided to revert the feedback loop. While we probably would have eventually found the right allocator settings to dial in, the amount of effort needed to tweak the allocators combined with the overall additional complexity that this introduced into the gateway cluster outweighed any gains that we would’ve seen if this was successful. Implementation and Rollout While the original plan was to only consider zstandard for mobile users, the bandwidth improvements were significant enough for us to ship to desktop users as well! Since zstandard ships as a C library, it was simply a matter of finding bindings in the target language —Java for Android, Objective C for iOS, and Rust for Desktop — and hooking them into each client. Implementation was straightforward for Java (zstd-jni) and Desktop (zstd-safe), as bindings already existed, however for iOS, we had to write our own bindings.This was a risky change with the potential to render Discord completely unusable if things were to go wrong, so the rollout was gated behind an experiment. This experiment served three purposes: allow the quick rollback of these changes if things were to go wrong, validate the results we saw in the “lab,” and enable us to gauge if this change was negatively affecting any baseline metrics.Over the course of a few months, we were able to successfully roll out zstandard to all of our users on all platforms.