News Aggregator from Scratch in 2 Weeks

Data Clustering Contest was a two-week data science and engineering competition held by Telegram in May. The Telegram team wanted participants to build a simple yet effective news aggregator that consolidates thousands of articles from various publishers and websites into a single page, which shows the latest news in real time, exactly like Google NewsBing News or Yandex News (with the Russian language inclining) do.

This was the second part of the competition whenever I totally missed the first part — and maybe for the better. I highly recommend reading the report from my Yandex ex-colleagues on how they placed 3rd place under the pseudonym Mindful Squirrel. Fortunately, now they made the worthy first place. I won the third place in this competition with the prize of 7000 Euros under the nickname Mindful Kitten (to be honest, it was quite nice that they named us, the contestants, this way 🙂 ).

Final Results
My page with the news, might be already dead

My Initial Goal

I decided to try my skills given the fact I’ve been working on the search engine for quite some time. I want to be absolutely clear. I am not a Machine Learning engineer

when you are working in the search engines, you need to know at least how text/language ML works. Overall, you will not find many cool ideas right here about the ML itself but I tried to make the infrastructure around it fast. One of the main criterias for the contest was speed and I will show how I achieved the pretty fast results except server indexing (server response was one of the fastest among contestants) and why the latter was much slower.


There were two parts of the contest — static and dynamic one. In the static one (almost the same as in the first stage of the contest) we needed to detect the language of the articles, separate news from non news, detect the categories and combine the similar news into threads and rank the articles in the threads.

Static version of the contest

In the dynamic one we needed to create a server which accepts the articles, detects their language, if it is news, detect the category, cluster news in threads and respond to the queries of the most relevant news for the given time, language and category.

How server should work

Server was tested on the x86_64 Debian machine with 8 CPUs, 16GB of RAM, HDD. The max requested time can be 30 days. The current server time is supposed to be the maximum article time (not news) in the index: it even means that the time can go backwards if we delete the most recent article from the index.

The final solution should be a binary without the connection to a network, has very few dependencies, no root (for example, this makes it harder to use some “fancy” databases like redis, etc).

Solution for a Static Version

Overall I was extensively relying on a fastText library. It is the library for learning word embeddings and training text classification models created by Facebook’s AI Research in 2016. The whole submission code was written in C++17 and some training was done through Python. But let’s start with something simpler. For example, thread management.

Thread Pool

As in the static version there are so many tasks that can be parallelized, I took a pretty decent implementation of the thread pool and patched it with my lock free queue (thanks to my university where we had to write it) which resulted in a slightly better performance rather than locking the mutex. (5.6k vs 5.9k tasks per second).


All data consisted of HTML files from Telegram Instant View which parses most popular websites (around 5000+ of them) and provides a very lightweight HTML that is easy to read, load and parse.

Typical instant view html

There are several HTML parses in the open source community. Most popular one is libxml2. However, I found this was not an efficient one and it was around 3 times slower than Google’s Gumbo Parser. It is a parser which was tested against Google Search with over 2.5 billion pages which makes me confident that if there are some problems with the text extraction, they are definitely the Telegram’s problems. And in the end I found a couple of interesting surprises.

Like in many other HTMLs I filtered out <script>, <style>, <figure>, <iframe>, <address> tags to get the raw text which surprisingly turned out to be not enough. In the picture you can see what happened after that.

For some reason <iframe> tag was self-closing that resulted in such a mess. By HTML5 standard it is not a self-closing element and all browsers add a closing element to iframe. As I did not want to rewrite in that way because it would require one additional pass through the text, I decided (of course) to patch Gumbo library to meet the standards of the Telegram contest. Luckily, there are many self-closing tags and in the end it was a one line change

After that the text was correctly handled out.

Also, after some perf investigation, I found that around 15-20% of all the parsing was spent in function calls which were not inlined by the compiler. My final perf patch you can see here and I managed to come up with a 20% faster parsing algorithm: on average, each article was parsed in 0.4ms per thread which resulted in ~0.055ms per 8 threads or, alternatively, around 16-18k articles per second. I also believe I was lucky and HDD stored the articles in the directory consecutively where I managed to remove the latency of the HDD seeks (which are typically 8-12ms) or page cache made the job which is a more likely hypothesis.

During the parsing I also extracted the og:url, og:title, og:description and published_time attributes for the future use. It is not an interesting part though worth noting.

Language Detection

I was very interested how Chromium detects the languages of the pages offline and with respect to the fact it is written in C++, I was interested if I could do something similar. Finally I found CLD3 (Compact Language Detector 3) library which is a neural network model. It extracts the ngrams of the words or text and predicts with the 3 layers of the network which language is that.

CLD3 language detection for the word ‘banana’

I failed to find how accurate it was and went to read some papers about language detection and found out a nice paper Language Identification on Massive Datasets of Short Message using an Attention Mechanism CNN. The benchmarks and the accuracy of the models for short messages were provided and it helped me a lot with the decision

Short messages language detection benchmarks

Unfortunately, the paper approach is hidden right now and I decided to look at CLD2 as it is 7.5 times faster than CLD3 and has even better accuracy. Though, I think the migration from CLD2 to CLD3 happened because of the accuracy for the web pages, though my testing of CLD2 against CLD3 showed 0.03% difference and almost all articles were highly debatable whether they are English or not because of the mess written inside.

CLD2 is a pretty abandoned repository with the latest commit 5 years ago and mostly is presented for historical reasons. CLD2 is a Naïve Bayesian classifier, using a combination of several different token algorithms. For most of the languages, sequences of four letters (quadgrams) are scored. For some big languages with lots of symbols the unigrams are taken. It is noted that “the design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text, lists of proper names, part numbers, etc”.

I decided to proceed with CLD2 in the end and a couple of surprises were on my way. I misread the initial statement and thought that detecting all languages was required. As it was 5 years old, some languages changed their ISO-639-1 code, for example iw -> he, jw -> jv, in -> id and mo -> ro. In order to stop language detection from failing, I was checking with the static table that language is inside the ISO-639-1 known languages. Unfortunately, there were articles with the languages that are not presented in ISO-639-1 and were only in ISO-639-2, for example, cebuano (which has a code ceb). In the end I gave up and was not printing them.

Also, I created a task in Yandex.Toloka to measure the precision and recall of my classifier. Toloka is a Russian crowdsourcing platform, just like Amazon Mechanical Turk. It turned out that they were more than 99.2% for English and Russian and I was pretty confident the results were ok. To fix several other issues I decided to do a cross validation only with the title and description of the article. Finally, the optimal algorithm for the problem was: if CLD2 is not sure (it can provide you the level of sureness) about the English language text, try out the title and description, otherwise use the first language. The metrics showed around 99.35% recall and precision for English and other articles were quite indecisive with some mess inside. That was a huge win. Performance of the library is extraordinary, it is around 0.2-0.4ms per thread per article.

In the end my language detection turned out to be the fastest across 3 days with less than 0.1ms per article. Yet, I don’t claim it is the best, other contestants might have better quality but no one knows how telegram evaluates the submissions.

Lang detection speed

News Filtering and Topic Classification

I decided to combine these two tasks into one and just added an additional label to the article as NotNews. The learning is almost exactly the same to what Mindful Squirrel did in the first stage yet I optimized the code and reestablished the models, i.e. was using the new learned ones.

I again utilized Yandex.Toloka and (a little bit) Amazon Turk to label a fraction of input data with the labels of categorization. I extended the dataset with publicly available data from the first contest, from the and news dataset for the Russian language and the ones from BBC and HuffPost for English (thanks to Kaggle contests).

In the end I pushed around 50k articles to train for, cross validated with other submissions of the first round available on Github and got a pretty good model. Russian quality improved a lot but the English one did not have a super advance (especially in detecting other and not news labels). Overall my schema was looking something like that:

Learning the categorization

With the labeled data I trained the fastText classifier. Fortunately, engineers from Facebook made this library pretty fast with all SIMD accelerations so it looked pretty good for me. Unfortunately, I almost got caught to have a penalty in the contest because it is built with -march=native flag which includes all AVX-512 extensions on my machine. Likely they are not presented in the data-centers which Telegram is using so I decided to be more conservative and only used the extensions up to SSE4.2 and lost possibly at most 2-3% of the performance which is not bad.
News RU speed

Unfortunately, I did not make the first place in terms of speed here but the submissions in top 3 did not have a (reasonably working) server so it gave me a pretty high chance to be one of the best for those who had servers.

News EN speed

In categories detection it was better, second place speed in EN categories (with a pretty decent pre final result):

Categories EN speed

And a tie (but slightly better) in RU categories.

Categories RU speed

Also, worth noting, before applying the model, please, tokenize the text and remove all punctuation marks, etc. I lost several hours understanding what was wrong in some places.

Thread Clustering

In this part the contestants needed to combine similar news into the threads and rank the news inside them accordingly.

This is the part where I am going to talk about word embeddings. It is no secret that most of the text ML produces the embeddings and it is learned the way that if the distance between two embeddings is small, then the articles/news/pages are similar to each other. There are different types of “contextual” embeddings, one of the most popular ones are Word2Vec, DSSM, Sentence-BERT embeddings. I was not satisfied with the Word2Vec approach at all possibly because I failed to tune it properly. Fortunately, I felt pretty safe because I knew that average, min, max over word vectors of the fastText works pretty nice, in the end I retrained the model with the triplet loss function using a Siamese neural network from the Mindful Squirrel’s solution. Here I don’t understand the full magic behind the scenes, yet I only wanted to have at least some quality with the follow-up autotuning.

Mindful Squirrel’s embeddings approach

This model was pretty simple in the end and was using dot products which are highly optimized in the Eigen library.

In order to combine news in threads, I used SLINK algorithm: assume all nodes are 1-norm vectors and the edges between nodes are the L2 distances between them, then if we have a threshold \alpha, in O(n^2) we can find the graph components with the property that graph components are disconnected iff the distance between their nodes in the bipartite graph is more than \alpha.

SLINK clustering

Unfortunately, it may lead to overpopulation in threading, for example, if the first node connected to the second one, and the second to the third one, it may happen that the first and the third are quite different. I firstly thought this not a big problem but in the end all COVID, accident news, earnings call reports had been combined together:

378 artices of the COVID in one thread for the past day

In the picture above news articles are connected by the topic COVID-19 but in the end it becomes a pretty huge cluster which is not easy to deal with. Though it is not accurate in some sense, I think this helped me to avoid many clusters with COVID news that are pretty irrelevant to each other and have 1 big thread talking about COVID overall.

Also, the time complexity was pretty huge for dealing with the hundreds of thousand of news and I decided to apply the algorithm with a time sliding window of 15000 docs and 3000 docs in the overlap. I assumed that if the news is not popular, it is very unlikely they are repeated after a long time.

Sliding window with the SLINK algorithm

Within the thread, the news articles are ranked by their PageRank weight and the time between the most recent article in the thread. Actually, there are pretty few really good news agencies so it was even a matter of human intervention to boost some very authoritative newspapers. Also, one of the weights was the mean similarity with other news which was pretty easy to recompute in real time but we will talk a bit later about it.

In the end, threads clustering turned out to be one of the fastest among contestants

Threads EN speed
Threads RU speed, <25 quality filtered out

Dynamic Part

As you may already forget, we also needed to write a server that indexes the news and responses with the best threads within the given range. Just a reminder what we should do

How server should work

It means we need to index all upcoming news in real time. The rules of the Telegram contest were vague in terms of what consistency we should provide, for example it was absolutely not clear if we can produce news after some time, meaning if we can update the top news once in a while like all aggregators do. Unfortunately, some of the solutions were tested against this “common sense”, the top news requests were coming right after the requests when all news were indexed.


First thing to note, that updating the article is basically a deletion+insertion request, so this was a pretty easy reduction. Unfortunately, that also increased the complexity — I could not easily parallelize the requests by category or even by language because with the article update, it may change even the language — and I needed to provide strong consistency which resulted in me using a global lock. I could not split the request into two because in the middle there can be another request modifying the article and the result can be skewed or not expected. The modifying and deletion requests were tested by Telegram but from the logs I obtained not a single article changed its language, possibly it was just done if the article was updated.

Current time of the server is the maximum time of the article (not news) so I needed to store metadata for all articles. It also means that the time can go backwards with the articles deletion so it also increased the complexity that I cannot easily update the current time in one atomic operation. Was this tested by Telegram? I doubt.

Indexing news consisted of one more parameter — TTL (Time To Live), it was stated that if TTL + article published time passed the current time, the article should be deleted. With the consequence that the time can go backwards, I needed to store all articles either way.

How to Store?

First I thought of using something external like ExtremeDB and/or SQLite3 but their locking and parallelisation mechanisms turned out to be really slow for my needs. Also, their C++ APIs are not flexible to use and require some time for careful investigation.

To store documents I used protobufs as I love this format a lot for its simplicity and I worked in the companies where protobuf is heavily used. I honestly believe this format is hundreds times better than JSON and is one of the best statically typed formats to use.

Google engineering motto
syntax = "proto3";
package tgnews_index;
message Doc {
bytes title = 1;
bytes url = 2;
string lang = 3;
string category = 4;
repeated float embed = 5;
int64 timestamp = 6;
int64 age = 7;
message Docs {
map<string, Doc> docs = 1; // names to docs
view raw tg_contest.proto hosted with ❤ by GitHub
My easy protobuf schema

Also, one more requirement to the contestants was that if the server restarts, it should load the index and respond like usual. That’s why it was important for me to store embeddings rather than text: to combine news into threads I only needed them. Also Telegram did not require store text of the article in any form so I saved lots of CPU and disk resources. Upon the restart, I used the static schema solution already knowing the category and embeds of the news.

Actually I thought I could proceed with something really simple. I did not store all in one file, all docs were sharded. When the doc comes and I need to store it, I choose 2 random shards to store and pick the one who has the lowest number of docs, store it. If I needed to extract the document, I just needed to do two lookups. Why have I chosen two random shards? That’s because in one random shard strategy the maximum number of the docs in one shard (given that the number of articles is m and the number of shards is $n$) will be around m/n + O(\frac{\sqrt{m \log n}}{n}) and with the power of 2 choices it will be m/n + O(\log \log n) with high probability, so we significantly improving the load balancing between shards. It significantly improves the bound and still allows it to be almost lock free in the shard’s choice. Given the fact there was around 10 QPS in the tests, it is almost optimal.

As a hash key I used the names of the files. As the number of shards was 10000, I found that reasonable for a monthly number of articles (up to 10mln).

Thread Updates

I highly prioritized the read responses as in the news it is ok to wait for 5-10 minutes to propagate the indexing requests but it is very user facing to respond as fast as possible. As you can see, I was top two in response times. Mad Crow, unfortunately, had hardly any working server overall and I believe I got additional points because of that decision.

I decided to store globally for all time all the threads that I have. Partially because I was paranoid of the fact that the time can go backwards, because of strong consistency, etc.

To find a thread for the article I decided to use k-NN approaches (k nearest neighbors) and especially HNSW (Hierarchical Navigable Small World) similarity embedding search algorithm. According to ANN benchmarks it is one of the fastest algorithms. Yet, right after the contest Google published its own version of nearest neighbor search called scANN which in my opinion has the advantage that it utilizes the SIMD instructions better with its design and overall it is really faster, maybe for 20-30%. HNSW was more than enough for me, also I quite nicely understand how it works. Actually…

How does HNSW work?

A naive solution to the k-NN problem is to compute the distances from a given query point to every data point within an index and then select the data points with the smallest k distances. It does not scale to the sizes of hundreds of thousands of vectors. An approximate k-NN algorithm may greatly reduce search latency at the cost of precision.

The HNSW algorithm focuses on reducing the number of comparisons by building a graph data structure on the constituent points of the data set.

With a graph data structure on the data set, approximate nearest neighbors can be found using graph traversal methods. Given a query point, we find its nearest neighbors by starting at a random point in the graph and computing its distance to the query point. From this entry point, we explore the graph, computing the distance to the query of each newly visited data point until the traversal can find no closer data points. To compute fewer distances while still retaining high accuracy, the HNSW algorithm builds on top of previous work on Navigable Small World (NSW) graphs. The NSW algorithm builds a graph with two key properties. The “small world” property is such that the number of edges in the shortest path between any pair of points grows poly-logarithmically with the number of points in the graph. The “navigable” property asserts that the greedy algorithm is likely to stay on this shortest path. Combining these two properties results in a graph structure so the greedy algorithm is likely to find the nearest data point to a query in logarithmic time.

k-NN Graph Figure 1
The dark blue edges represent long-range connections that help ensure the small-world property. Starting at the entry point, at each iteration the greedy algorithm will move to the neighbor closest to the query point. The chosen path from the entry point to the query’s nearest neighbor is highlighted in magenta and, by the “navigable” property, is likely to be the shortest path from the entry point to the query’s nearest neighbor.

HNSW extends the NSW algorithm by building multiple layers of interconnected NSW-like graphs. The top layer is a coarse graph built on a small subset of the data points in the index. Each lower layer incorporates more points in its graph until reaching the bottom layer, which consists of an NSW-like graph on every data point. To find the approximate nearest neighbors to a query, the search process finds the nearest neighbors in the graph at the top layer and uses these points as the entry points to the subsequent layer. This strategy results in a nearest neighbors search algorithm which runs logarithmically with respect to the number of data points in the index.

For full graphs it is working really nice and easily provides the insertion queries. Yet, deletion is a bit more complex because you need to somehow deal with the NSW layers and it might take some time to reconstruct them.

Thread Ranking/Updates

I utilized the online-hnsw library in order to support insertions, deletions and k-NN queries. But in a bit of an interesting way.

As you remember, SLINK clustering is transitive, it just puts two vertices in one cluster if there is a path between them where all edges are of weight \leq \alpha. In order to preserve this property I decided to do the following:

Three clusters are going to be combined in the end

I find k nearest neighbors and when I stop finding by the same threshold as in the static version, I combine all threads in one. It guarantees that eventually we will have the cluster as in the SLINK algorithm (if k-NN search is accurate, of course). Approximately it works almost perfectly and within a small time all relevant articles are combined together.

When combining, I need to do several things, for example, to update the similarity matrix which is basically done incrementally:

Adding new doc and areas which are needed to be recalculated are marked in red and purple

This recalculation method is working very nicely when we merge big clusters with very small ones. Unfortunately, I made a very silly bug there, and in the submission and full matrix was recalculated which resulted in this after 9 days of server uptime.

Sad server after 8 days of work 😦

That basically means that huge clusters (like COVID) were constantly updated and matrix multiplication was taking too long to update the similarity values. I also put the restriction on the cluster size but because of another bug it worked only when one document is merged with the big cluster.

Threads ranking is basically the sum of the pagerank weights of the articles inside plus small constant for all other news agencies. Small threads up to 5 docs are linearly penalized which works very good for small ranges of time (for example, show the news for the past 1 hour). News in clusters are sorted by sigmoid of the most recent time plus mean of the similarity matrix.

Thread Responses

We needed to show the most relevant news depending on the given time range since the latest article (in the tests there were 1 hour, 3 hours and 1 day, in the statement it could be up to 30 days). Also, people can request categorized news like Science news for the last 1 hour in Russian language.

As I was updating the threads in real time, I just was looking up through top T threads and was filtering out all the news that are not in line with the request (not the requested category, language, expired or not in the time-frame). Old news will be filtered out, new news remain and are likely relevant, ordering in the cluster preserves the same, title of thread is the title of the most relevant news (to avoid any copyright or misleading issues in the future 🙂 ), thread relevance is recalculated as it is linear in time. T was set to 50000 as I thought that there could not be more than 1000-2000 news a day (and likely hundreds) and we should response for the maximum of 1 month.

Unfortunately, when the time range is very small, it could happen that there is no very new news in the top T threads. In that case I was looking for the past T news and their associated threads so that I can fulfill the request at least with something. Worked perfectly for the seconds/minutes ranges.

So, the news responses were very fast because the threads are updated in real time and the threads are precomputed for the response. Also, I was not including more than 100 threads in the response because starting somewhere from the 50-60th threads, all threads contain only one article so that it does not make sense to show. Also, it helped to improve the performance a lot without any loss in the quality (cmon, no one looks at more than 10-20 news at once).

How I tested

The solutions were tested on pure Debian, I rented for 3-4 days the Digital Ocean Debian machine in order to better understand the performance of my application.

I wrote a bunch of tests for the static part, unfortunately, the dynamic part was written once and tested with my eyes. Build system was CMake and it was fairly easy to statically include all the projects I needed — gumbo, online-hnsw, abseil, http server, thread pool, protobuf, boost. Of course, I used all kinds of sanitizers to catch any issues which is basically a mandate in the modern C++ world.

Unfortunately, I was not able to benefit the server from multithreading because I could not do it for languages as the article may change its language and the judges may test the skew issues. So on each mutating request I acquired WriteLock, on each response request ReadLock with the higher priority not to block the clients.


Whole application was written in C++17 when learning algorithms were in Python, I spent around 5-6 days on the static part and then 8-9 days on the server part 4 of which I was debugging all the cases that I am correctly updating the pointers to threads, thread combining, etc. Of course, I made a couple of bugs that I described above.

What I liked

Finally some engineering contest and not sports-programming or pure ML competition on Kaggle. I liked that we were given some data to test on. I enjoyed writing C++ code and CMake was very easy to set up (I admit that maybe it is already easy for me as I spent enormous time in the past doing that kind of job). The problem was good and novel enough to work on from scratch.

Prize pool also was quite nice, I was immediately communicated to receive my prize.

What I did not like

Absolutely no transparency on how we can communicate with the judges about the questions that arise. For example, nothing was said about which type of disk would be used or if we could delay the index updates (turned out to be that no or not always).

No transparency in how the evaluation or testing process happens. Likely they hired many volunteers to outsource the categorization and thread quality. Yet, we did not know how with which instructions and why they are silent about this. Incremental process of evaluation would be very fun to watch, for example.

It was unclear if we can work together or not. I did not work with anybody but I would definitely want to have an ML partner to flesh out and look for bugs. Some other teams were of 4 and more people which is a bit unfair(?).

Also it would be nice to have at least some statistics from Telegram on the data, at least the view count that they have in Instant View. With the raw data it was much harder to get what is relevant and farm the data for the training stages.

Unfortunately, I cannot open source the solution for now because I need to get an approval from my employer but as soon as I get it, I will probably open source that crunchy code 🙂 .

The second place in the contest can be placed without the server side which is quite strange, even in the final message the judges said: “Clustering tasks similar to the previous round (sorting by language, news or not, categories and threads) carried a lower weight. Overall speed of the algorithms influenced the final score.”

True Conclusion

I managed to combine lots of ideas into one place and compete with many other contestants from ML sphere quite a little. I am satisfied with my results and looking forward to participating in other contests from Telegram. Of course, I am not satisfied with the code quality and what I haven’t managed to try or flesh out.

Also, this type of work gave me a good personal understanding that I show myself best when I have competition and/or tighten up by the deadlines. In the end, I show at least something acceptable.

About the author

My Telegram channel (in Russian)

Optimizing 128-bit Division

When it comes to hashing, sometimes 64 bit is not enough, for example, because of birthday paradox — the hacker can iterate through random 2^{32} entities and it can be proven that with some constant probability they will find a collision, i.e. two different objects will have the same hash. 2^{32} is around 4 billion objects and with the current power capacity in each computer it is certainly achievable. That’s why we need sometimes to advance the bitness of hashing to at least 128 bits. Unfortunately, it comes with a cost because platforms and CPUs do not support 128 bit operations natively.

Division historically is the most complex operation on CPUs and all guidelines suggest avoiding the division at all costs.

At my job I faced an interesting problem of optimizing 128 bit division from abseil library in order to split some data across buckets with the help of 128 bit hashing (the number of buckets is not fixed for some uninteresting historical reasons). I found out that the division takes a really long time. The benchmarks from abseil on Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz show some horrible results

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor     13.8     13.8  // 128 bit by 128 bit
BM_DivideClass128SmallDivisor        168      168  // 128 bit by 64 bit

150 nanoseconds for dividing the random 128 bit number by a random 64 bit number? Sounds crazy. For example, div instruction on x86-64 Skylake takes 76 cycles (also, for AMD processors it is much less), the division takes around 20-22ns.

In reality everything is slightly better because of pipeline execution and division has its own ALU, so if you divide something and do something else in the next instructions, you will get lower average latency. Still, 128 bit division cannot be 8x slower than 64 bit division. All latencies you can find in Agner Fog instruction table for most of the modern x86 CPUs. The truth is more complex and division latency can even depend on the values given.

Agner Fog instruction table for Skylake CPUs, the second but last column is the latency.

Even compilers when dividing by some constants, try to use the reciprocal (or, the same as inverse in a ring) value and multiply the reciprocal and the value with some shifts afterwards

Overall, given the fact that only some sin, cos instructions cost more than division, division is one of the most complex instructions in CPUs and optimizations in that place matter a lot. My exact case was more or less general, maybe I was dividing 128 bit by 64 bit a bit more frequent. We are going to optimize the general case in LLVM.

We need to understand how 128 bit division is working through the compiler stack.

It calls __udivti3 function. Let’s first understand how to read these functions. In runtime libraries the modes of the functions are:

QI: An integer that is as wide as the smallest addressable unit, usually 8 bits.
HI: An integer, twice as wide as a QI mode integer, usually 16 bits.
SI: An integer, four times as wide as a QI mode integer, usually 32 bits.
DI: An integer, eight times as wide as a QI mode integer, usually 64 bits.
SF: A floating point value, as wide as a SI mode integer, usually 32 bits.
DF: A floating point value, as wide as a DI mode integer, usually 64 bits.
TI: An integer, 16 times as wide as a QI mode integer, usually 128 bits.

So, udivti3 is an unsigned division of TI (128 bits) integers, last ‘3′ means that it has 3 arguments including the return value. Also, there is a function __udivmodti4 which computes the divisor and the remainder (division and modulo operation) and it has 4 arguments including the returning value. These functions are a part of runtime libraries which compilers provide by default. For example, in GCC it is libgcc, in LLVM it is compiler-rt, they are linked almost in every program if you have the corresponding toolchain. In LLVM, __udivti3 is a simple alias to __udivmodti4.

#include "int_lib.h"
typedef int si_int;
typedef unsigned su_int;
typedef long long di_int;
typedef unsigned long long du_int;
typedef int ti_int __attribute__((mode(TI))); // 128 signed
typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned
#ifdef CRT_HAS_128BIT
COMPILER_RT_ABI tu_int __umodti3(tu_int a, tu_int b) {
tu_int r;
__udivmodti4(a, b, &r);
return r;
// Returns: a % b
COMPILER_RT_ABI ti_int __modti3(ti_int a, ti_int b) {
const int bits_in_tword_m1 = (int)(sizeof(ti_int) * CHAR_BIT) – 1;
ti_int s = b >> bits_in_tword_m1; // s = b < 0 ? -1 : 0
b = (b ^ s) – s; // negate if s == -1
s = a >> bits_in_tword_m1; // s = a < 0 ? -1 : 0
a = (a ^ s) – s; // negate if s == -1
tu_int r;
__udivmodti4(a, b, &r);
return ((ti_int)r ^ s) – s; // negate if s == -1
#endif // CRT_HAS_128BIT
view raw modti3.c hosted with ❤ by GitHub
#include "int_lib.h"
typedef int si_int;
typedef unsigned su_int;
typedef long long di_int;
typedef unsigned long long du_int;
typedef int ti_int __attribute__((mode(TI))); // 128 signed
typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned
#ifdef CRT_HAS_128BIT
// Returns: a / b
COMPILER_RT_ABI tu_int __udivti3(tu_int a, tu_int b) {
return __udivmodti4(a, b, 0);
COMPILER_RT_ABI ti_int __divti3(ti_int a, ti_int b) {
const int bits_in_tword_m1 = (int)(sizeof(ti_int) * CHAR_BIT) – 1;
ti_int s_a = a >> bits_in_tword_m1; // s_a = a < 0 ? -1 : 0
ti_int s_b = b >> bits_in_tword_m1; // s_b = b < 0 ? -1 : 0
a = (a ^ s_a) – s_a; // negate if s_a == -1
b = (b ^ s_b) – s_b; // negate if s_b == -1
s_a ^= s_b; // sign of quotient
return (__udivmodti4(a, b, (tu_int *)0) ^ s_a) – s_a; // negate if s_a == -1
#endif // CRT_HAS_128BIT
view raw udivti3.c hosted with ❤ by GitHub

__udivmodti4 function was written with the help of Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide. After looking at it here, it looks like this was written long time ago and things have changed since then

First of all, let’s come up with something easy, like shift-subtract algorithm that we have been learning since childhood. First, if divisor > dividend, then the quotient is zero and remainder is the dividend, not an interesting case.

// dividend / divisor, remainder is stored in rem.
uint128 __udivmodti4(uint128 dividend, uint128 divisor, uint128* rem) {
if (divisor > dividend) {
if (rem)
*rem = dividend;
return 0;
// Calculate the distance between most significant bits, 128 > shift >= 0.
int shift = Distance(dividend, divisor);
divisor <<= shift;
quotient = 0;
for (; shift >= 0; –shift) {
quotient <<= 1;
if (dividend >= divisor) {
dividend -= divisor;
quotient |= 1;
divisor >>= 1;
if (rem)
*rem = dividend;
return quotient;
view raw shift_subtract.c hosted with ❤ by GitHub

The algorithm is easy, we align the numbers by their most significant bits, if dividend is more than divisor, subtract and add 1 to the output, then shift by 1 and repeat. Some sort of animation can be seen like that:

For 128 bit division it will take at most 128 iterations in the for loop. Actually, the implementation in LLVM for loop is a fallback and we saw it takes 150+ns to complete it because it requires to shift many registers because 128 bit numbers are represented as two registers.

Now, let’s dive into the architecture features. I noticed that while the compiler generates the divq instructions, it frees rdx register

In the manual they say the following

divq instruction provides 128 bit division from [%rdx]:[%rax] by S. The quotient is stored in %rax and the remainder in %rdx. After some experimenting with inline asm in C/C++, I figured out that if the result does not fit in 64 bits, SIGFPE is raised. See:

#include <cinttypes>
uint64_t div(uint64_t u1, uint64_t u0, uint64_t v) {
uint64_t result;
uint64_t remainder;
__asm__("divq %[v]" : "=a"(result), "=d"(remainder) : [v] "r"(v), "a"(u0), "d"(u1));
return result;
int main() {
div(1, 0, 1); // 2**64 / 1
g++ -std=c++17 -O0 main.cpp -o main
“./main” terminated by signal SIGFPE (Floating point exception)
view raw divq.c hosted with ❤ by GitHub

Compilers don’t use this instruction in 128 bit division because they cannot know for sure if the result is going to fit in 64 bits. Yet, if the high 64 bits of the 128 bit number is smaller than the divisor, the result fits into 64 bits and we can use this instruction. As compilers don’t generate divq instruction for their own reasons, we would use inline asm for x86-64.

#if defined(__x86_64__)
inline uint64_t Divide128Div64To64(uint64_t high, uint64_t low,
uint64_t divisor, uint64_t* remainder) {
uint64_t result;
__asm__("divq %[v]"
: "=a"(result), "=d"(*remainder) // Output parametrs, =a for rax, =d for rdx, [v] is an
// alias for divisor, input paramters "a" and "d" for low and high.
: [v] "r"(divisor), "a"(low), "d"(high));
return result;
tu_int __udivmodti4(tu_int dividend, tu_int divisor, tu_int* remainder) {
#if defined(__x86_64__)
if (divisor.high == 0 && dividend.high < divisor) {
remainder->high = 0;
uint64_t quotient =
Divide128Div64To64(dividend.high, dividend.low,
divisor.low, &remainder->low);
return quotient;

What to do if the high is not less than the divisor? The right answer is to use 2 divisions because

So, first we can divide hi by divisor and then {hi_r, lo} by divisor guaranteeing that hi_r is smaller than divisor and thus the result is smaller than 2^{64}. We will get something like

#if defined(__x86_64__)
inline uint64_t Divide128Div64To64(uint64_t high, uint64_t low,
uint64_t divisor, uint64_t* remainder) {
uint64_t result;
__asm__("divq %[v]"
: "=a"(result), "=d"(*remainder) // Ouput parametrs, =a for rax, =d for rdx, [v] is an
// alias for divisor, input paramters "a" and "d" for low and high.
: [v] "r"(divisor), "a"(low), "d"(high));
return result;
tu_int __udivmodti4(tu_int dividend, tu_int divisor, tu_int* remainder) {
#if defined(__x86_64__)
if (divisor.high == 0) {
remainder->high = 0;
if (dividend.high < divisor) {
uint64_t quotient =
Divide128Div64To64(dividend.high, dividend.low,
divisor.low, &remainder->low);
return quotient;
} else {
tu_int quotient;
quotient.high = Divide128Div64To64(0, dividend.high, divisor.low,
quotient.low = Divide128Div64To64(dividend.high, dividend.low,
divisor.low, &remainder->low);
return quotient;

After that the benchmarks improved significantly

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor 11.9      11.9
BM_DivideClass128SmallDivisor   26.6      26.6

Only 26.6ns for small divisors, that’s a clear 6x win.

Then there are multiple choices to do next but we know that both dividend and divisor have at least one bit in their high registers and the shift-subtract algorithm will have at most 64 iterations. Also the quotient is guaranteed to fit in 64 bits, thus we can use only the low register of the resulting quotient and save more shifts in the shift-subtract algorithm. That’s why the uniform divisor slightly improved.

One more optimization to do in shift-subtract algorithm is to remove the branch inside the for loop (read carefully, it should be understandable).

// dividend / divisor, remainder is stored in rem.
uint128 __udivmodti4(uint128 dividend, uint128 divisor, uint128* rem) {
if (divisor > dividend) {
if (rem)
*rem = dividend;
return 0;
// 64 bit divisor implementation
// end
// Calculate the distance between most significant bits, 128 > shift >= 0.
int shift = Distance(dividend, divisor);
divisor <<= shift;
quotient.low = 0;
quotient.high = 0;
for (; shift >= 0; –shift) {
quotient.low <<= 1;
const int128 s = (int128)(divisor.all – dividend.all1) >> 127;
quotient.low |= s & 1;
dividend.all -= divisor.all & s;
divisor >>= 1;
if (rem)
*rem = dividend;
return quotient;

In the end, it gives 0.4ns more for uniform 128 bit divisor.

And finally I believe that’s one of the best algorithm to divide 128 bit by 128 bit numbers. From statistics, the case when the divisor is 64 bit is worth optimizing and we showed that additional checks on the high register of divisor has its own advantages and expansion of the invariants. Now let’s see what other libraries perform in that case.


Libdivide is a small library targeting fast division, for example, if you divide by some fixed number a lot of times, there are techniques that can precalculate reciprocal and then multiply by it. Libdivide provides a very good interface for such optimizations. Even though, it has some optimizations regarding 128 bit division. For example, function libdivide_128_div_128_to_64 computes the division 128 bit number by 128 bit number if the result fits in 64 bits. In the case where both numbers are more or equal to 2^{64} it does the following algorithm that they took from Hackers Delight book:

\begin{cases} n = MSB(\mathrm{divisor}) \geq 1 \\ \mathrm{divisor_1} = \lfloor \mathrm{divisor}/2^{64 - n} \rfloor \\ \mathrm{dividend_1} = \lfloor \mathrm{dividend}/2 \rfloor \end{cases}

With the instruction that produces the 64 bit result when the divisor is 128 bit result we can compute

\mathrm{quotient_1} = \lfloor \mathrm{dividend_1}/\mathrm{divisor_1} \rfloor

Then we compute

\mathrm{quotient_0} = \lfloor \mathrm{quotient_1}/2^{63 - n} \rfloor.

It cannot overflow because \mathrm{quotient_1} < 2^{64} because the maximum value of \mathrm{dividend_1} is 2^{127} - 1 and minimum value of \mathrm{divisor_1} is 2^{63}. Now let’s show that

\lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor \leq \mathrm{quotient_0} \leq  \lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor + 1

\mathrm{quotient_0} = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\mathrm{divisor_1}} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\left\lfloor\frac{\mathrm{divisor}}{2^{64 - n}}\right\rfloor} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n})} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \frac{\mathrm{dividend}(\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))}{\mathrm{divisor}(\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \delta \right\rfloor.

Now we want to show that \delta < 1. \delta is the largest when the remainder in the numerator is as large as possible, it can be up to 2^{64 - n} - 1. Because of the definition of n, \mathrm{divisor} \geq 2^{127 - n}. The smallest value of \mathrm{divisor} in the denominator is 2^{127 - n} + 2^{64 - n} - 1. That’s why

\delta \leq \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n} + 2^{64 - n} - 1)2^{127 - n }} < \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n })^2}. As n iterates from 0 to 63, we can conclude that \delta < \frac{\mathrm{dividend}}{2^{128}}. So we got either the correct value, either the correct plus one. Everything else in the algorithms is just a correction of which result to choose.

Unfortunately, these corrections increase the latency of the benchmark pretty significant

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3    26.3  
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2    26.2
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8    25.8
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3    26.3

So I decided to drop this idea after I’ve tried this.


GMP library is a standard GNU library for long arithmetic. They also have something for 128 bit by 64 bit division and in my benchmark the following code worked

struct GmpDiv {
uint64_t operator()(uint64_t u1, uint64_t u0, uint64_t v, du_int* r) const {
mp_limb_t q[2] = {u0, u1};
mp_limb_t result[2] = {0, 0};
*r = mpn_divrem_1(result, 0, q, 2, v);
return result[0];
view raw gmp_div.h hosted with ❤ by GitHub

It divides the two limbs by a uint64_t and provides the result. Unfortunately, the latency is much higher than expected, also does not work

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<GmpDivision>          11.5    11.5
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7    10.7
BM_DivideClass128SmallDivisor<GmpDivision>            47.5    47.5
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8    47.8 


In the end I’ve tried several method of 128 bit division and came up with something that is the fastest among popular alternatives. Here is the full code for benchmarks, though it is quite hard to install, maybe later I will provide an easy version of installation. The final benchmarks are

Benchmark                                           Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<absl::uint128>        13.7      13.7
BM_RemainderClass128UniformDivisor<absl::uint128>     14.0      14.0  
BM_DivideClass128SmallDivisor<absl::uint128>          169       169    
BM_RemainderClass128SmallDivisor<absl::uint128>       153       153    
BM_DivideClass128UniformDivisor<LLVMDivision>         12.6      12.6  
BM_RemainderClass128UniformDivisor<LLVMDivision>      12.3      12.3  
BM_DivideClass128SmallDivisor<LLVMDivision>           145        145    
BM_RemainderClass128SmallDivisor<LLVMDivision>        140        140    
**BM_DivideClass128UniformDivisor<MyDivision1>          11.6      11.6**
**BM_RemainderClass128UniformDivisor<MyDivision1>       10.7      10.7**
**BM_DivideClass128SmallDivisor<MyDivision1>            25.5      25.5**
**BM_RemainderClass128SmallDivisor<MyDivision1>         26.2      26.2**
BM_DivideClass128UniformDivisor<MyDivision2>          12.7      12.7  
BM_RemainderClass128UniformDivisor<MyDivision2>       12.8      12.8  
BM_DivideClass128SmallDivisor<MyDivision2>            36.9      36.9  
BM_RemainderClass128SmallDivisor<MyDivision2>         37.0      37.1  
BM_DivideClass128UniformDivisor<GmpDivision>          11.5      11.5  
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7      10.7  
BM_DivideClass128SmallDivisor<GmpDivision>            47.5      47.5  
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8      47.8  
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3      26.3  
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2      26.2  
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8      25.8  
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3      26.3

MyDivision1 is going to be upstreamed in LLVM, MyDivision2 will be the default version for all non x86-64 platforms which also has a solid latency, much better than the previous one.

Future Work

However, benchmarks are biased in the uniform divisor case because the distance between most significant bits in the dividend and divisor falls exponentially and starting from 10-15, the benchmark becomes worse rather than libdivide approach.

I also prepared some recent research paper patch in where the reciprocal is computed beforehand and then only some multiplication happens. Yet, with the cold cache of the 512 byte lookup table it is worse than the already submitted approach. I also tried just division by d_9 in the screenshot below and it showed some inconsistent results which I don’t understand for now.

Also, subscribe to my Twitter 🙂

Why is std::pair broken?

Today we are going to talk about C++ basic std::pair class in a way how and why it is broken.


std::pair first appeared in C++98 as a pretty basic class with a very simple semantics — you have two types T1 and T2, you can write std::pair<T1, T2> and access .first and .second members with all intuitive copy, assignment, comparison operators, etc. This is used in all map containers as value type to store and can be expanded into structured bindings, for example, with the following code:

std::unordered_map<std::string, int> m;
for (const auto& [key, value] : m) {
view raw pair_easy.h hosted with ❤ by GitHub

Pretty convenient, almost all C++ programmers know about this class and use it appropriately but several style guides including Google Style Guide suggest using user defined structs where appropriate. It has several reasons but the first one is readability — I personally want to kill people who write more than one level of std::pair, for example, std::pair<int, std::pair<int, std::pair<int, int>>> and then access it with el.second.second.second. And as the structure of the objects in a maintainable code tends to grow, it is better to use structs rather than pairs and tuples. Other reasons are more interesting.


Default construction

This is a minor and debatable thing but std::pair with trivial and non default zero initialized scalars is zero initialized:

constexpr std::pair<int, int> t;
static_assert(t.first == 0 && t.second == 0);
std::array<int, 5> arr;
// assert(arr[0] == 0); // Undefined behavior
view raw pair_init.h hosted with ❤ by GitHub

In contrast, std::array does not initialize their elements with the default construction (if the underlying type is trivial) and leaves this for the user. It is debatable because some people don’t like the uninitialized memory but in the end this is an inconsistency between the containers and is a different behavior from default struct what people tend to think about std::pair (yet, I still think this should look like this):

template <typename T, typename U>
struct pair {
T first;
U second;
constexpr std::strong_ordering operator<=>(const pair& other) const {
if (auto first_comp = first <=> other.first; first_comp != 0) {
return first_comp;
} else {
return second <=> other.second;
view raw fancy_pair.h hosted with ❤ by GitHub


I found this issue when I wanted to serialize and deserialize std::pair. One of the default implementations in struct/type serialization and deserialization is the following:

template <class T>
T Deserialize(const void* memory) {
T t;
memcpy(&t, memory, sizeof(T));
return t;
template <class T>
void Serialize(const T& t, void* memory) {
memcpy(memory, &t, sizeof(T));
view raw de_serialize.h hosted with ❤ by GitHub

If we go to memcpy documentation and see the following:

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined.

TriviallyCopyable are objects that satisfy the following conditions:

  • Every copy constructor is trivial or deleted. pair(const pair&). So, it should be either = default, either = delete, either omitted/created by the rule of X and all underlying objects are trivially copyable.
  • Every move constructor is trivial or deleted. Same with pair(pair&&).
  • Every copy assignment operator is trivial or deleted. Same with pair& operator=(const pair&).
  • Every move assignment operator is trivial or deleted. Same with pair& operator=(pair&&).
  • At least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted.
  • Trivial non-deleted destructor. This means that virtual classes cannot be trivially copyable because we need to copy the implementation vptr.

If we write our pair above and check if it is trivially copyable (with some trivial underlying types), it definitely is because we use the rule of zero and thus everything is generated automatically. Also, a small note — in the examples above you need to make sure that the types are trivially copyable, a perfect variant looks like (this is also heavily used by the compilers to generate the unaligned loads and work with unaligned memory):

template <class T>
T Deserialize(const void* memory) noexcept {
T t;
memcpy(std::addressof(t), memory, sizeof(T));
return t;
// std::remove_reference_t for non-deduced context to prevent such code to blow below:
// char first = f(); char second = g();
// Serialize(first – second, to) (int will be deduced)
template <class T>
void Serialize(const std::remove_reference_t<T>& t, void* memory) noexcept {
memcpy(to, std::addressof(t), sizeof(T));

Let’s see what std::pair guarantees. Good news is standard guarantees the copy and move constructors to be defaulted (see 7 and 8).

Also, from C++17 we have the requirement that if the types of the pair are trivially destructible, pair is also trivially destructible. This allows having many optimizations including putting the type onto registers. For example, std::variant and std::optional had a different proposal to make this possible.

With everything else, it is more complicated. Let’s see what is going on with the copy assignment, in libc++ it is implemented like this:

struct __nat {};
pair& operator=(typename conditional<
is_copy_assignable<first_type>::value &&
pair, __nat>::type const& __p)
noexcept(is_nothrow_copy_assignable<first_type>::value &&
first = __p.first;
second = __p.second;
return *this;
view raw pair_copy.h hosted with ❤ by GitHub

__nat is used for disabling the instantiation of the copy assignment operator if it is invoked — a cool C++98 hack to delete the operator.

This helps the compiler to use std::pair with the references. By default, references cannot be implicitly copied or, if said in a more clever way, if you have them in the class, the copy assignment operator is implicitly deleted unless user defined. For example:

template <typename T, typename U>
struct pair {
T first;
U second;
int main() {
int x = 10;
int y = 20;
pair<int&, int> p{x, x};
pair<int&, int> p1{y, y};
p1 = p; // ill-formed, with std::pair, well-formed
view raw pair_assignment.h hosted with ❤ by GitHub

In C++ copying the reference changes the underlying object and thus the referenced is binded to the class. Here is a good explanation why it was designed this way. So, the assignment operator is deleted because copying reference has slightly different semantics but in C++98 people (maybe) thought it was convenient. So, we cannot easily write = default to provide a default assignment operator (despite the fact this is an ABI (application binary interface) break).

All these facts lead to suboptimal performance, for example, in copying the vector of pairs. With the standard pair we have a simple loop, with trivially copyable type, we have fast memmove, the difference might be up to the register size, memmove is highly optimized.

Some attempts to fix this

First of all, all fixes will require an ABI break but as C++ standard will be (I hope) more or less ok with ABI breaks (or not :smile:), this can be done in a way.

Conceptually, we need to make a conditional default copy assignment operator with some restrictions:

  • We cannot write = default in the main class because it disallows the assignment of the references.
  • We cannot add the (possibly defaulted) template arguments to the pair because it will break the forward declarations.
  • We cannot use SFINAE on the return types of the defaulted copy assignment operators.
    • std::enable_if_t<std::is_reference<T1>::value || std::is_reference<T2>::value> operator=(const pair&) = default; is disallowed by the standard.
  • We can move the .first and .second fields to the base classes and have partial specialisations like:
template <class T1, class T2, bool is_reference>
struct MyPairBase;
// Dispatch the reference specialisation and use it as a storage.
template <class T1, class T2>
struct MyPairBase<T1, T2, false> {
T1 a;
T2 b;
MyPairBase& operator=(const MyPairBase& other) = default;
template <class T1, class T2>
struct MyPairBase<T1, T2, true> {
T1 a;
T2 b;
MyPairBase& operator=(const MyPairBase& other) {
a = other.a;
b = other.b;
return *this;
// Possibly we will need to copy many things there.
template <class T1, class T2>
struct MyPair : public MyPairBase<T1, T2, std::is_reference<T1>::value ||
std::is_reference<T2>::value> {
using Base = MyPairBase<T1, T2, std::is_reference<T1>::value ||
// … all the implementation

For example, std::optional does it this way. The downside that we need to copy lots of code and such a simple class like optional turned out to be 1300+ lines of code. Too much for std::pair, isn’t it?

Good solution I found is using C++20 concepts. They allow (at least they don’t disallow) the usage of the conditional default assignment operators.

#include <type_traits>
template<class T1, class T2>
struct MyPair {
T1 first;
T2 second;
static constexpr bool has_references = std::is_reference_v<T1> ||
MyPair(const T1& x, const T2& y) : first(x), second(y) {}
MyPair& operator=(const MyPair&) requires(!has_references) = default;
MyPair& operator=(const MyPair& other) requires(has_references) {
first = other.first;
second = other.second;
return *this;
int main() {
int x = 10;
MyPair<int&, int> a(x, 5);
MyPair<int&, int> b(x, 10);
b = a;
view raw concepts_pair.h hosted with ❤ by GitHub

This fixes the problem, does not blow up the code size and at least is very intuitive to read. In the end I found several proposals trying to suggest it in a more generic way. See one, two and see the blog where they use the feature and reduce the number of lines from 1200 to 400 for std::optional. Yet, I haven’t found that this was accepted but this is a very good idea of improving and simplifying things in C++.

The funny

I knew it already but cannot be silent about it. LLVM and GCC optimizers optimize std::pair<int, int> to use one 64 bit register. It does not do it with std::tuple<int, int>. So, if you have the choice between pair and tuple, always use pair or even user defined structs.


All that was written there also applies to std::tuple, they have the identical behavior in many places. Yet, this is one more argument why C++ old containers and even basic things are broken and ABI stability is doing more harm than good. I will finish it with a tweet that reminds me of a CAP theorem:

I really think that if at some point strong but breaking change does not happen, we will have a C++ fork that will do it.

How to crack HashCode competition with some engineering skills

Hi everyone, hashcode is over. As now I can tell anything, I want just to tell how I was solving this problem, my methodologies, overall the structure of the data sets. I am not going to provide any internals and whatsoever, all data can be checked and much of it is my personal opinion though I had more time to reflect on it rather than contestants that only had 4 hours to solve it. So, I managed to get one of the highest scores (top 2, to be precise with the result 27,189,110) while testing the problem. Let’s talk about it. I provide all my solution data sets and pieces of code (though I don’t provide compilable repository because I am lazy, ok?). Please, get that this is an engineering research rather than a true contest full of combinatoric ideas. I suck at contests and don’t participate that often.

Here is the problem, very well written and explained. If converting it to a math problem, it is exactly this:

There are B books and L libraries. Books somehow assigned to libraries, one book can be in many libraries. Each book has a score S_i, each library has install time T_i and daily throughput M_i (maximum number of books shipped per day). We can only choose one library, wait for its install time, then choose another and so on but all libraries push their books at the same time and when they are ready. We need to maximize the score of all books given D days. Look at the example if you did not get the statement.


So, this is clearly an NP-hard problem. For example, the simplest reduction is a knapsack problem — assume that all daily throughputs are infinite and libraries do not overlap in their sets of books, then you need to maximize the sum of the libraries with a given budget D. This is a very simple reduction with 2 constraints relaxed. So if we can solve this problem in polynomial time, we can solve any NP problem. The problem itself is very and very hard. There are multiple ways to optimize the score, you can choose these freedoms to gain better score:

  • In the optimal solution, books in all libraries must be sorted descending with respect to their scores. That’s easy to get and polynomial.
  • The set of libraries matters because we are given the budget D. This is exponential of the number of the libraries
  • The choice of the books in the chosen libraries matters. For the books that are inside multiple libraries, it is sometimes crucial which library (or maybe not a single one) to choose to ship this book.
  • The order of the libraries significantly matters. If you decide to push some library first with the install time T_i, all other libraries have at most D - T_i days to ship books.

Some good observations that always help for such kind of problems

  • They are easily simulated — given the answer, you can calculate the exact score — that’s the first you should do in such research problems. It helps you to check thousands of different hypotheses.
  • Freedoms are good to have some basic solutions. For example, I did not know how to optimize the books choices, I used the following technique: if there are multiple libraries for a book, choose one who has the biggest capacity for now. Overall, does not matter much which strategy, just try something reasonable and try different “reasonable” things to get fast and pretty sustainable results. This helped me to reduce the problem to just picking up a set and to start hacking.

But in order to start the actual hacking, some prerequisites.

Linear programming

This is a very sophisticated and interesting science about linear inequalities. Basically, you have a linear functional to maximize and a numerous of linear constraints to satisfy. One of the canonical forms looks like:

\text{Maximize}\  \mathbf{c}^\mathrm{T} \mathbf{x}\  \text{subject to} A \mathbf{x} \leq \mathbf{b}\  \text{and}\  \mathbf{x} \ge \mathbf{0} where x represents the vector of variables (to be determined), c and b are vectors of (known) coefficients, A is a (known) matrix of coefficients, and {\ (\cdot )^{\mathrm {T} }} is the matrix transpose.

This is not the form you should stick to, for example the positivity property of the vector is not needed or can be additionally added (because any number can be represented as the difference between two positive ones). You can have any kind of inequalities (<, >, =). The only requirement — the relations must be linear.

Why is this important? The answer is that the all feasible solutions represent a convex polytope, matrix theory is also widely known and many optimization problems can be formulated in the form of some linear programs. The first and the second help to have many algorithms to solve as this area is widely known from many perspectives (maths and computer science), the last helps to solve many problems in a generalized way. You can assume that the “real world” solvers can solve linear programs with around 5 \cdot 10^5 variables and conditions. Or at least they can find some solutions that are quite good. We will talk about solvers a bit later.

Integer linear programming

The same as linear program but the solution vector x must be an integer vector. Often there are constraints on the vector x to be a \{0, 1\}^n vector which sometimes makes the problem much better. Solvers are less happy about the integer programming — it is NP-complete partially because MAX-3SAT can be reduced (try to do it on your own, it is not difficult, or you can look up the idea here and finish it by yourself).

This area is less known and harder to optimize in the end. This happens because it is really hard to use methods from simple linear programs because the feasible solutions are extremely spare and proving that some solution is optimal is hard (and, of course, it is NP-complete which is one of the main reasons for a such complexity). Modern solvers can solve “real world” integer linear programs for around 10^4 variables. For more only luck and the variety of the solvers can help. Though, you can do something good even with that big amount of variables about which we will talk a bit later.

NP-completeness and hardness. P != NP

I hope most of my readers are familiar with these words. If not, google and read about this, right now, this is one of the most important things in the computer science overall.

A bit of philosophical reasoning. When people start programming, they express their thoughts inside the code to do something with the computer — to show some text, window, to compute some big expressions. Then, while educating, the problems become harder but still solvable. In many minds of the true beginners there sometimes can be a thought that computer can solve any practical problem (or at least, most of them). And then comes this monster, called NP-complete problems. This is a story about what the programmers cannot do.

There are several things in NP-completeness (further, NPC) right now. Discovering that some new problems are in NPC, either finding some approximations in some NPC problems or proving that some approximations of NPC problems are also NPC.

The first thing, reductions are very useful. They help to understand that if you can solve problem A, then you can solve problem B which is proven to be in NPC and then A is in NPC (I missed couple of details, does not matter). These reductions give a good understanding of what the hard problems are. A good intuition when the software developer gets that the challenging task comes to a very hard point and if the boss requires completing this task for 100%, the answer should be: “this problem is NP-complete, if I manage to solve it, I would be a millionaire and it does not matter whoever engineer you are going to hire”. Still, the “real world” is not black and white and we can something do about NPC problems — like finding approximations or trying to find some acceptable solution.

We need to admit that we need to solve these hard problems and that is what I do and don’t like about math at the same time — the math gives up and starts arguing that we need to stop solving this but the real world sometimes provide patterns inside the data which can be better approximated rather than the pure math says.


3SAT is a classical NP-complete problem. Its main goal to satisfy the formulas with n boolean variables of type (\vee means OR, \wedge means AND) 3CNF (each parenthesized expression is called clause):

(x_i \vee \neg x_j \vee x_k) \wedge \ldots \wedge (\neg x_l \vee \neg x_m \vee x_n)

The problem is known to be hard and each NP-complete problem can be polynomial (from input size) reduced to it. Though, knowing the exact reduction is mostly useless (because the formula of cubed or even squared size can be unacceptable) does not mean that the world is not still going crazy and there are yearly SAT competitions. Minisat library is a good starting point for solving SAT formulas. Minisat can solve real world SATs with around of 2 \cdot 10^4 variables and clauses (clause is a disjunction of variables). Also, there is a really good blog about SAT solvers from the SAT geek. Though, kinda specific one.


This is the same problem but now we need to satisfy as many clauses as we can. This is more practical because mostly yes/no answer is not that important and some approximation can be a still acceptable and good solution.

Why is it important?

We face these problems quite often, for example, Sudoku for a field n^2 \times n^2 with the squares n \times n is a NP-complete problem, SameGame is an NP-complete, optimal scheduling is NP-complete. All these problems have some intuition behind and this intuition is a good sign of what the software engineers can and cannot do. 3SAT is just a good generalization of it.

Complex 3SAT formulas

A very good question of producing really complex formulas still remains half open. In engineering part you can simply increase the number of variables and do some random stuff and a complex formula is going to happen at some point. Still a good question of producing rather small formulas making all solvers stuck and solve it forever. Only recently there was a paper around it with a pretty interesting results. It states that if the number of clauses (m) divided by the number of variables (n) is around 4.24, then modern SAT solvers explode in finding the solutions if the random clauses are generated. Also, it is important to know that random 3SAT is defined by the action that each clause is randomly generalized by the uniform distribution according to the clause/variable ratio r. Number 4.2 + \varepsilon is kind of important: it is shown by many different papers that this is likely a transition phase number for random 3SATs (so, if the ratio is bigger than that, 3SAT is easy to solve, if less, also easy but if it is around, 3SAT becomes complex). Random 3SAT does embrace a phase transition phenomenon! That’s kind of a funny fact and showed even within some physics laws here as well. The current lower and upper bounds are 3.52 and 4.51 as long as I could find it here and in this presentation. Practical bounds show that the value is around 4.2 + \varepsilon. The algorithm is easy, just create random 3SAT with some existing solution and then show that, for example, MiniSat cannot solve it for a very long time.

Engineering part in all this

There are many libraries and solvers that try to solve such complex problems. One of which I am used to and like the most is or-tools. It supports many languages and is a combination of many-many solvers. Has a pretty good documentation with many examples combining all the solvers such as GLOP, CLP, GLPK, SCIP and GUROBI. All these solvers are complex and require additional posts which will be interested only to some geeks. I would say that the advantage that you can try out many of them with one proxy that or-tools is providing. Many of the solvers can have a timeout and print the existing solution saying was it optimal or was not able to prove it. Also, most of the solvers accept a hint solution using which it can try something better around this solution — this is pretty useful, if you have some combinatoric idea, you can use the solvers to try out if they can find something better that you’ve come up with. Some of the solvers even support multithreaded mode but this is more an advanced feature and I haven’t seen much progress from it, solvers tend to roll down to some solution and don’t move out from it for a long time.

So, one standard thing within the optimization problems — research, check, put into the solvers, repeat. Let’s try to see how it works with the hashcode problem which is kind of optimization problem.

Into the details

Problem is hard to formulate in terms of integer programming mostly because we have libraries which order matters — linear equations don’t like ordering or otherwise we should create many new variables. I thought this is a bold strategy given the fact that the conditions are already pretty big for the integer programming. I decided to look precisely inside the datasets and let’s try to break each of them. Example is not an interesting one, so let’s skip it.

Read On

This is a pretty simple dataset, all books are of cost 100, libraries have its own books, no book in any two libraries, all libraries have the 1000 days to ship everything (1 book per each of the 1000 days), so it is ok get around 90% of the libraries. Actually, as no book is in two libraries, it is just optimal to pick all libraries in the ascending order of their installation time. This is the optimal and we get 5,822,900 points. Easy dataset to start.


This is a harder dataset. All costs seem random, the books are in many libraries but I found that the throughputs are high. That means that the order in which we are shipping the libraries does not matter, matters the sum of the installation times. In this case it is a maximum budgeted coverage problem. Indeed

maximize \sum_{b \in B} S_b \cdot y_j . (maximizing the weighted sum of covered books).

subject to \sum{T_i \cdot x_i} \leq  D; (the installation time of the selected libraries cannot exceed D — overall days).

\sum_{b_j \in L_i} x_i \geq y_j ; (if a book is chosen, i.e. y_j > 0 then at least one library where this book presents b_j \in L_i is selected).

y_j \in \{0,1\}; (if y_j=1 then book b_j is chosen)

x_i \in \{0,1\} (if x_i=1 then library L_i is selected for the cover).

But if you will try to just put this the conditions inside the problem, the solver will not find anything cool in one hour. That’s why we need some sort of hinting. In that case I just came up with some greedy solution: pick the library that will give us the most value divided by the number of days it is installing, delete all books from other libraries, subtract the number of days and repeat the iteration. After some analysis, it can be proven this is a 1 - 1/e approximation. And this solution brings us ~5,689,000 points. After that, I moved it to a SCIP solver as a hint and after 20 seconds it said it found the optimum in 5,690,556 points and that’s it. Dataset is cracked. This is the optimal solution to it.

Tough choices

As long as I have somewhat greedy solution and written linear program, let’s dive into this dataset. This dataset has all books of equal cost. All libraries contain small libraries, that means we should not care much about the ordering — most books will be deployed.

Also, 15000 books are exactly in 2 libraries (and adjacent ones), 63600 books are in exactly 3 libraries. Also, after that I found out that 63600/15000 = 4.24 which means that it is likely common to random 3SAT with a transition phase. So, basically I split 30000 libraries in pairs, each pair is a variable in 3SAT, if you choose the first one, this is true, otherwise false, other books in the libraries are the clauses numbers, in the first element there are the clauses that contain the variable, in the second the negation of this variable. Then I created a 3SAT with 15000 variables which the world believes it is hard to solve.

The world is, of course, correct, I tried pushing this 3SAT inside Minisat and it did not manage to find any good solution for a long time and I gave up. Luckily, I could push it to the integer programming solvers and optimize the coverage of the books. Greedy solution from the previous solution gives 98% coverage (which results in ~5,006,820 points) which is already great. Some code prototype of the integer programming reduction can be found here. After 2-3 hours of waiting I got 99.75% of the coverage which resulted finally in 5,096,260 points. The solution you can find here. Solving it optimally is a really hard problem but having the 99.75% coverage is quite good.

So many books

This is the dataset I liked the most, overall I didn’t manage to prove something interesting here but at least I can tell how I was solving it. In this dataset the order of the libraries matters a lot and linear program is possibly only when the order of the libraries is fixed, then we can optimally choose which books should be chosen from which libraries. I decided to run a greedy solution and only somehow chose the books and got around ~5,100,000 points. Then I did not know how to correctly assign the books to the libraries and I just used the solver of the fixed order of the libraries and got ~5,200,000 points, so it turned out that optimal assignment gives lots of more points. Then I found out that the optimal assignment ignores cheap books aggressively and in greedy solution I started to choose which books I should account for the greedy solution choice, in the end I counted only the books with 99+ score and finally got 5,235,423 points which you can check with this solution and this linear program which worked for 10 seconds, so finding the optimal score for a greedy solution was pretty doable.

I do believe this dataset has much inside it and can be cracked better than by me. Still, I got +120k from the greedy solution which possibly gave me so many points.

Libraries of the world

This dataset is a bit easier than the previous one because installation time is quite big and the day count is rather small. It turned out that likely optimal solutions are in a range from 15 to 18 libraries and this is a good number to try different permutations, some random removals and new additions. That I was doing — remove half of the libraries, add greedily to match the needs, run the solver (which solved optimally in 5ms), so I managed to try around 3-4 billion different permutations and variants. In the end simple greedy solution gave me ~5,312,895 points and I managed to get with this algorithm 5,343,950 points which you can check here. I think this value is not far from optimal, otherwise the solution is too far from the greedy one which I personally don’t believe given number of the variants I’ve tried.


I want to say that optimization problems arise in software engineering from time to time and it is essential to try some combinatoric solutions with the solvers — they can help find some interesting dependencies and optimize better your own solutions. Also, as there are many different solvers, it is easy to try them out given that or-tools already implemented a proxy where you can just change 1 parameter.

Also, hashcode is very fun to solve.

I wrote Go code for 3 weeks and you won’t believe what happened next

Historically, I am a C++ programmer and really like all this kind of C-ish stuff — from raw assembly to high level abstractions and SFINAE methods. For the past three weeks I managed to write lots of Go code — around 10k LOC without almost knowing anything when I started. I want to talk what I like, what is bad and funny. I am not a Go expert yet and this is my opinion which might be wrong, offensive and misleading.

The Good

Go is (almost) easy to learn

This language has in its DNA this fact. Once you read “Tour of Go“, “Effective Go“, look through the standard library, you already can program some really cool things. I can admit that most things I read in Go, I can understand — what this code is doing and to this level of understanding I came in 3 weeks unlike C++ where I managed to read most of the standard library parts in 4 years after starting doing it. This is a huge advantage — simplicity. And, as you know, simplicity is very and very complicated.

I cannot say Go is very rich, standard library is pretty small but for the problems Go can and is solving, this is more than is enough. I don’t know but I really think that people that are the maintainers of Go say “No” to features more often than other languages because they want to be simple and restrictive at first and then try to expand the language.

Goroutines and channels

Oh god, they are really cool. It requires some time to understand what they are and how to use them correctly. For me personally this was not a problem.

But using them in Go is so comfortable, inviting, snugly, that you want to write everything in goroutines. They are like a crux of the biscuit to start and never finish, they are begging you not to stop writing go f(), they are the hidden thing to say to others “Hey, how are you doing? – I am invoking a goroutine. – Ah, that’s cool, won’t disturb you”. So, as you see, I am impressed by the goroutine design and its ease to use. Possibly without it, it would be a pretty useless language.

I was writing code specifically doing something asynchronous, non blocking, background and I managed to use all my concurrency knowledge with a great and easy infrastructure.

Go is pretty performant

Though, it has GC, it is pretty performant. I found around 10% lower performance than C++ on my synthetic benchmarks and that is pretty decent. I believe it might be faster than Java because in Java every structure is a pointer in a slice but I might be mistaken.

defer f()

Yes, I really like this feature, it is easy to understand, read and use. It is one of the keys to understand the lifetime of the objects which is done much better than in C which usually uses goto to finish something. Though, I still think that the C++ destructor is really the best design approach for object lifetimes, defer is a good approach for non object oriented languages and it will be a definitely a good extension to C language (but I am pretty sure this will never happen).

Code formatting

There is gofmt and that is all. All, you heard? Language defined code format, not a human.

And that is nice, no debates, just invoke gofmt and that’s it. Code is written once and read hundreds of times and the faster you understand the constructions, the more code you are going to write in the end. I would say I like this so much for that particular reason — you understand code very fast because you have lots of catchy constructions.

The linters are also nice but in my opinion in C++ you can do much more through LLVM infrastructure.


Iota is a const declaration to simplify definitions of “enums” in Go. For example, this is really nice:

const (
	_           = iota // ignore first value by assigning to blank identifier
	KB ByteSize = 1 << (10 * iota)

This is one more feature for the simplicity and I like it a lot. Define some constants and you know what their transformations are. I really think other languages miss this simple feature so much. Don’t need something specific? Write iota. Need some simple formula? Write iota. Need something more complex? Write yourself.


Humans like patterns, when they see something familiar, what they did themselves, they feel it, they know exactly how it is working.

Go is full of patterns, like the very simple one:

if err != nil {
  return nil, err

Human eye is very catchy on such kind of things. So, this is one more argument for better readability. To make people use patterns in C++ is barely possible and is not a priority for the standard.

Compilation time

I believe Go can compile hundreds of thousands lines of code in seconds? I really caught myself waiting for linkage more than compiling everything. I like this scalibility with the fact it is only around 10-15% slower than C++.

Exported and unexported types

Hiding implementations by writing the function name with the lower case is pretty funny and accurate in my opinion. I was not able to understand it for the first time but then I was able to read and understand other code even better.


Easily, without any specific thing, code can be built on Linux x86_64, Linux PowerPC, MacOS, Windows with the same semantics of goroutines, network, filesystem handling, etc. I was writing client side code for many platforms and Go perfectly fitted this requirement.

In C++ in order to write coroutines that work on all the platforms, you need to spend much time make it work like you really want and the community does not help here, most of the implementations are not that tested and there is little trust overall.

The Bad

Hacks around generic types

Maps, slices and channels are generic types. And others? Possibly now I understand all the lack of generics jokes in Go.

For example, I want to use goroutine safe map sync.Map. I look at the implementation and see interfaces and I need to use casts. Okay, you make me sad. I do believe it is possible to have generics in Go without destroying the language simplicity but I might be mistaken.

I would expect atomic types being generic at least because synchronization between goroutines is very important in a such language. And for now I need to write something like

type Status int32

var status Status
atomic.StoreInt32((*int32)(&State.status), int32(status))


It is hard to find all the definitions of the interfaces, sometimes it is not clear how they are built — all I see is numerous NewSomeType functions to create a new type with some interface. To find all the instantiations is pretty hard. Possibly it is a matter of time to find hacks how to do it but I failed it completely. Maybe tooling is not the best but I am working at Google where Go was created … I don’t know why it was extremely hard to me.

Zero values initialization

Types by default are nil and I don’t like it. Either you need to check in the implementations being non nil or panic.

Possibly constructors and destructors can solve this problem but still I am not sure this is possible in Go philosophy.

Constant weirdness

I don’t understand why I can write 30 * time.Minute but cannot integerVariable * time.Minute. Of course, I can google it but understanding why is much more complex. I still think this is a great misbehavior in Golang.

Compiler infrastructure

After C++, Go has a really bad compiler in its error messages. The simplest example is lack of typos finding. If somebody misses the character or mistypes it, it takes time to find the actual error. Better suggestions would also help, for example, for non-exported functions and fields.

Also, I don’t like when there is a monopoly inside the language compilers. Yes, there are gccgo and llvm-go but they are kind of … silent, I mean. And I don’t understand why this is happening. Having the competition in all kind of areas in the world is always better than not having it.


After const qualifier in C/C++, in Go I miss it a lot. Basically, you can change any type of field whenever you want. Adding const to a language possibly is pretty late but mutability is not an answer for such mid-level abstractions language, in my opinion.

Lack of visibility support

For example, protobuf generated code for enums is ugly.

message SomeMessage {
  enum SomeEnum {
    SomeValue = 0;

To access SomeValue, you need to write SomeMessage_SomeEnum_SomeValue with the type SomeMessage_SomeEnum. I don’t like the fully qualified names here. Possibly some kind of visibility namespaces inside the structs can help. I find such code generation ugly and as a true believer in Protobuf over JSON battle, this looks as a disadvantage. Also, golang syntax for protobufs is VERY different from other languages.

Testing infrastructure

I hate writing tests in go. It is not structured at all, it is difficult to use mocks, it is difficult to test different parameters, there is no infrastructure for comparing two values, expecting some error is always done with a different test case where you specify some wantError variable to compare with. If the language prioritizes being more language specific rather than community specific, I expect having perfect testing library.

For example, this leads to a fact that it is hard to get 100% code coverage. Because Golang code is full of error checking and mocking these errors is always hard. This can be solved by a better testing library support or some macro machinery around error handling, because sometimes you are tired of writing hundreds of 3 lines of err != nil conditions without testing them properly.

Where is my memory?

After C and C++ it is really hard to understand that you don’t need to clean up the memory but you are still thinking too much how not to allocate something bigger, reuse some memory, etc. This is mostly annoying and Golang does not talk about this much but probably sometimes should — how GC works, what it is doing, that its semantics is low-latency but probably high CPU for horizontal scaling.

Messy assembly

I am a low level guy and sometimes this is hard to force the compiler to do what I really want and I was not able to understand what the compiler is doing and why. For example, TL; DR. It is really hard to read the assembly of it. For example, for StoreInt32 it generates xchgl AX, (CX) instruction

What if I want another memory ordering? No default support. Also, lots of noisy pcdata things for garbage collector.

The Funny

Time format layout

No comments, just funny. TL; DR To format time variable t you can only write the format with a specific hardcoded date.

s := t.Format("2006/01/02")

More here, for example.

Community and gopher apocalypse

A lot of stuff in the reviews like Good to Go, Gotchas or many gopher memes around the community. I mean, this is catchy, nice and friendly.


Go is not a universal language for any kind of things you want. This is good for simple server applications, client side code (with little “multithreaded” overhead), some caching stuff and not big projects. Though, I like Go a lot and if I want to write something small, networky and useful, Go is my choice right now. For big and longstanding projects, I do believe Go is bad because if you want to do something complex, Go won’t let you do this anyhow.

How to contribute to Abseil with a visible performance gain

Overview of Abseil

Abseil is a great C++ library with a special and unique philosophy inside it. While there is a lot of debate what C++ is prioritizing more – portability or performance with possibly breaking changes, Abseil library aims in a standard compatible way to provide best library features of all standards sometimes with a very good performance. No need to upgrade your compiler which may contain bugs which are extremely hard to debug (for example, if you are developing highly reliable software for rockets, being on the edge with the latest compilers is often not even a choice), no need to upgrade C++ language features with some possible breaking changes, just get Abseil and get all features from future standards which are already highly tested by millions of lines of code across Google.

As an example to show, all standards prior to C++17 lack of useful class std::optional<T> which provide an optional access to some type T. This class can be efficiently implemented even in C++11. By the way, optimizations for std::optional is another a very interesting topic – how to create such a simple class being 1500+ lines of code overall. So, Abseil implements this like

#if __cplusplus >= 201703 && __has_include(<optional>)
#include <optional>
namespace absl {
using std::bad_optional_access;
using std::optional;
using std::make_optional;
using std::nullopt_t;
using std::nullopt;
} // namespace absl
#else // __cplusplus >= 201703 && __has_include(<optional>)
// Internal abseil implementation.
// Under define not to generate more code to a compiler because std::optional
// and absl::optional claim to be compatible.
view raw optional.h hosted with ❤ by GitHub
Abseil very rough idea of implementing things

Then, you just use in your code absl::optional. After the update to C++17, it is a drop-in replacement of std::optional and you can replace one class to another. Abseil guarantees 5 years of support after the standard release and then workarounds can be removed for previous standards. Abseil does not guarantee ABI compatibility – this is a very debatable state but improving the things over the time is one of the humanity problems – we tend to make mistakes and a good thing is that we should learn how to fix them. Modern C++ compilers can do this in a very efficient way (clang-tidy, all kind of tooling, etc). That’s why if Abseil introduces some breaking change, it comes with a tool to fix all of your code. Google has more than 250 millions lines of C++ code – all breaking changes are firstly tested on Google’s codebase and then ported to open source. Dealing with that amount of code cannot be done with a human interaction and automation is one of the main priorities to change something – you break, you provide an automated tool to fix this. We will also talk about this a bit later.

Of course, not everything is that cool, I decided to spend an hour or so and to find some incompatibilities, and I managed to find them. Abseil provides std::string_view replacement absl::string_view which claims to be compatible with a standard version. And consider the following example: one is working in C++17 mode, the other is not in C++11 mode:

We just assign to std::string an object of type absl::string_view. It fails to compile in C++11 and is successful in C++17. Which, I mean, is not that big problem because people always move forward with a standard rather than backward.

This is a very neat example shows that forward compatibility guarantee is hard because of multiple interaction between classes. One of these is shown above – std::string provides an assignment operator accepting std::string_view in C++17. Personally I don’t understand why. std::string_view claims to be a class that if you want a conversion to an owned string, you should write it explicitly. And… There are very very few options in C++ doing so with an assignment operator efficiently.

For example, writing something like

s = std::string{sv};

Would be much less efficient because we create a temporary string.

For constructors there is a choice to make the constructor explicit or implicit. For assignment there is only one syntax, unfortunately.

Of course, we have assign(). Traditionally assign() mirrors constructors (not necessarily all, e.g. assigning a default-constructed string is s.clear() rather than s.assign()), and operator=() is a subset of assign() with a single parameter. I do not think there is a precedent for making some single-parameter assign() overload not available as operator=(). People from standard could have introduced such convention for explicit constructors but they did not. Actually, libc++ implementation calls assign(const char*, size_t) as you can see in the godbolt link. I claim this as a defect of a standard but not a huge one, of course. That way Abseil does not provide a full C++17 replacement for the types.


After some brief overview of the Abseil library, we can continue digging inside it even more. I want to talk more about the command line flags library. In the documentation it is stated (with my comments in italic):

The Abseil flags library allows programmatic access to flag values passed on the command-line to binaries. The Abseil Flags library provides the following features:

  • Access to Abseil flags in a thread-safe manner. This is a very interesting property to have for all flags.
  • Access to flag values that are valid at any point during a program’s lifetime. This means right here that they are global variables and we know that globals are harmful…
  • Allows distributed declaration and definition of flags, though this usage has drawbacks and should generally be avoided. extern hello darkness, my old friend

Values for these flags are parsed from the command line by absl::ParseCommandLine(). The resulting value for each flag is stored in a global variable of an unspecified type absl::Flag<T>.

TL; DR. Flags are global moduled variables that must be defined in .cc|.cpp|.cxx files (otherwise you will get an ODR violation) that can be declared and accessed by any code if you declare them in a translation unit and link with the definition.

I think this is one of the most convenient way to turn on/off features in big codebases if you are depending on some other APIs – it is easy to test features, to setup an environment. The other option I see in production binaries is to provide a context to propagate it through the entire program stack and have some kind of global config. Either way, we need something global and to avoid bloat, Google decided to use global variables.

And now come the most sophisticated parts. First, global variables are bad for multiple reasons

  • You should not define them in headers unless they are inlined (weak symbols) because you will have ODR violations (which variable to choose across translation units is an open question to a linker). And it is an undefined behavior in C++.
  • Global variables with dynamic storage duration should be created and used very carefully. The flags have default values and you can specify your own flags with any type. Many guidelines and even universities (especially in Russia) suggest creating singletons (and flags are singletons in a way) with the BadSingleton() function:
// Very bad because the destruction order is not specified.
// From experience, these are one of the hardest to debug bugs.
template <class T>
T BadSingleton() {
static T t;
return t;
// Better. It is NOT a leak because the compiler will clear it after the shutdown.
// Downside — destuctors are not called. But you should never do complicated
// things in destructors.
template <class T>
T BetterSingleton() {
static T* t = new T();
return *t;
view raw singleton.h hosted with ❤ by GitHub
Bad and Good singletons

Flags library solves the last problem in an interesting way. With a definition they don’t create anything, literally anything. Let’s see how it is done:

// A variable declaration annotated with the `ABSL_CONST_INIT` attribute will
// not compile (on supported platforms) unless the variable has a constant
// initializer. This is useful for variables with static and thread storage
// duration, because it guarantees that they will not suffer from the so-called
// "static init order fiasco". Prefer to put this attribute on the most visible
// declaration of the variable, if there's more than one, because code that
// accesses the variable can then use the attribute for optimization.
// Note that this attribute is redundant if the variable is declared constexpr.
#define ABSL_CONST_INIT [[clang::require_constant_initialization]]
// This struct and corresponding overload to MakeDefaultValue are used to
// facilitate usage of {} as default value in ABSL_FLAG macro.
struct EmptyBraces {};
template <typename T>
T* MakeFromDefaultValue(T t) {
return new T(std::move(t));
template <typename T>
T* MakeFromDefaultValue(EmptyBraces) {
return new T;
#define ABSL_FLAG_IMPL_DECLARE_DEF_VAL_WRAPPER(name, Type, default_value) \
static void* AbslFlagsInitFlag##name() { \
return absl::flags_internal::MakeFromDefaultValue<Type>(default_value); \
#define ABSL_FLAG(Type, name, default_value, help) \
namespace absl /* block flags in namespaces */ {} \
ABSL_FLAG_IMPL_DECLARE_DEF_VAL_WRAPPER(name, Type, default_value) \ // Creating a function to create a default flag
ABSL_CONST_INIT absl::Flag<Type> FLAGS_##name{ \ // Const init initialization (see above)
ABSL_FLAG_IMPL_FLAGNAME(#name), ABSL_FLAG_IMPL_FILENAME(), \ // Both compile time pointers.
&absl::flags_internal::FlagMarshallingOps<Type>, \ // Compile time pointer to dispatch the operations.
absl::flags_internal::HelpArg<AbslFlagHelpGenFor##name>(0), \ // Compile time struct to generate –help values.
&AbslFlagsInitFlag##name}; // Compile time pointer from line 34.
view raw flags.h hosted with ❤ by GitHub
Rough absl::Flag definition

And you can see that flag is a compile time definition – all pointers, all data is known at compile time and no memory is allocated and it is nicely generated in assembly code then. Also, remember that flags are thread safe, at least for big types it should contain mutex (actually, mutex is always needed because initialization requires more than just replacing the value). As mutex is not a trivial type, in this situation, absl::FlagImpl does a great trick:

// This is reserved space for an absl::Mutex to guard flag data. It will be
// initialized in FlagImpl::Init via placement new.
// We can't use "absl::Mutex data_guard_", since this class is not literal.
// We do not want to use "absl::Mutex* data_guard_", since this would require
// heap allocation during initialization, which is both slows program startup
// and can fail. Using reserved space + placement new allows us to avoid both
// problems.
alignas(absl::Mutex) mutable char data_guard_[sizeof(absl::Mutex)];
view raw flag_trick.h hosted with ❤ by GitHub
If you don’t know that the placement new is, look into

All this is done not only for safety reasons. My guess that some programs at Google can contains hundreds of thousands of different modules with multiple flags in each of it. If we allocate even something during the initialization, the binary startups linearly slow down which is a so huge and bad effect on the user experience and performance overall (even recovery latency, sometimes if the binary segfaults on a killer query, for example, search engine, it is better to have a very fast startup to continue working ASAP). As only a small fraction of flags are accessed during the execution, the decision to make the initialization lazy is a great performance and user experience solution.

Flags are not meant to be performance critical and used in tight loops though the reality can show something different. Many APIs create their options and access the flags in the constructors (for example, to distinguish test and prod environment). And it becomes a performance critical feature because options can be created thousands of times with a multi-thread flags access. Though, I think calling absl::SetFlag happens really rare and might not be have that performance hefty.

What we know for now about flags:

  • They are global variables with lazy initialization.
  • They might be read performance critical.
  • They are thread safe in their access.

These properties are a bit controversy because thread safety property is a really complex issue with a templated arbitrary type. In general case it is done by holding a mutex because it cannot be done in any other way. To make the situation a bit better, flags did an optimization for trivial types like integers and floats (I put a lot of comments inside the snippet, please read them all):

// This macro is the "source of truth" for the list of supported flag types we
// expect to perform lock free operations on. Specifically it generates code,
// a one argument macro operating on a type, supplied as a macro argument, for
// each type in the list.
A(bool) \
A(short) \
A(unsigned short) \
A(int) \
A(unsigned int) \
A(long) \
A(unsigned long) \
A(long long) \
A(unsigned long long) \
A(double) \
constexpr int64_t AtomicInit() { return 0xababababababababll; }
class FlagImpl {
// Init this atomic with some garbage where
// the collisions are negligible.
std::atomic<int64_t> atomic_{AtomicInit()};
void FlagImpl::StoreAtomic() {
// Get the size of a type.
size_t data_size = Sizeof(op_);
// If it fits into the word on 64 bit systems.
if (data_size <= sizeof(int64_t)) {
int64_t t = 0;
// Copy it inside this atomic.
std::memcpy(&t, cur_, data_size);, std::memory_order_release);
template <typename T>
bool AtomicGet(T* v) const {
// Load the byte representation.
const int64_t r = atomic_.load(std::memory_order_acquire);
// If it is not atomic init which means that the atomic
// value was initialized with something but not a garbage.
if (r != flags_internal::AtomicInit()) {
// Copy it inside the value.
// Compilers are very smart enough to generate
// couple of cmov, mov instructions when the memcpy size is known.
std::memcpy(v, &r, sizeof(T));
// Return that we successfully returned a true value depending
// on a "garbage heuristic".
return true;
// In that case we do believe that the flag was not yet initialized.
return false;
T GetFlag(const absl::Flag<T>& flag) { \
T result; \
/* Try to get the heuristic */ \
if (flag.AtomicGet(&result)) { \
/* If it was initialized, return*/ \
return result; \
} \
/* Use slow mutex holding impl */ \
return flag.Get(); \
// For other types use mutex holding implementation.
template <typename T>
ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
return flag.Get();
view raw lock_free_flags.h hosted with ❤ by GitHub

So, the implementation is holding one atomic of 8 bytes and tries to get the fast path if the flag was initialized based on a heuristic in that way that if AtomicGet returned true, that means that flag was initialized and we can safely return.

This generates very good assembly with few instructions and the benchmark for flags access showed the following results for all the one worded trivial types (I show only for bool for simplicity):

Benchmark                                     Time(ns) CPU(ns) Iterations
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:1  0.464    0.464   1000000000
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:2  0.233    0.465   1445775046
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:4  0.126    0.476   1427829560
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:8  0.136    0.875    576009808
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:16 0.097    0.963    673725472
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:32 0.086    0.963    691627808

We can conclude that it is extremely fast and does not degrade much with the number of threads increased. It is achieved because of std::memory_order_acquire in atomic load and x86_64 with its strong memory model (almost each instruction guarantees release and acquire semantics) generates around the following code which is efficient, small and extremely fast:

movq (%rdi), %rax # load from the input
xorl %ecx, %ecx
movabsq $-6076574518398440533, %rdx # imm = 0xABABABABABABABAB
cmpq %rdx, %rax # Check with a garbage value
cmovel %ecx, %eax # return it in %rax
view raw atomic_get.s hosted with ❤ by GitHub

The benchmark of course is far from ideal (guess why).

For all other types absl::Mutex is held for accessing the value which has around 17ns latency for holding in one threaded mode and around 150ns in 4 threaded mode. Hi, latency.

My concerns about all above

And here comes my biggest concern – I don’t feel this API guarantees the right thing enough (I mean, atomic hack). Though, in a true way this is not a guarantee but looking at the implementation while programming and profiling absolutely unrelated thing, I put this thing into the list to look into because I felt through my veins the lack of completeness around the API and it bothered me a lot. I got the feeling about all this but the most fascinating part that I was not able to formulate it in the next couple of hours but I caught that feeling and it got inside me that in my background thoughts I was thinking and trying to get all the right words to understand what really bothered me. Okay, let’s roast a bit this implementation.

  1. This works bad for other trivial types like enums that fit into the 8 byte register. And why are we holding mutex inside it for them, are these types special?
  2. Why do we have really special types? Yes, they are built-in, but they are not general, they optimize many things but not all.
  3. This is not scalable and this is the biggest concern. What if I want to provide other lock free types? I should list them in that list. What? Why? I mean, I should not, this involves circular dependencies and additional macro expansions. C++ is much more powerful to deal with such things at compile time.
  4. absl::Flag provides absl::Duration type implementation which is 12 bytes in size. C’mon, modern processors have 16 and 32 byte registers, can we operate with them in an atomic way? I should investigate all these issues. Also, absl::Duration turned out to be very popular flag and it potentially can save many CPUs and latency which is hard to measure just running some benchmarks.
  5. Oh, what if I can do this for 16 bytes registers, I can hack small string optimization (will talk about this a bit later).

When I told all this to one of the Abseil team members, I got around this feedback (stolen Russian meme):

Some of the Abseil team members were skeptical at first but I decided not to say any more words and make a prototype of this.

Let’s implement this

First, we need to create a trait to understand whether the type can fit into the atomic. As we use memcpy, the type must be trivially copyable according to a C++ standard. So, I came up with something like:

// The minimum atomic size we believe to generate lock free code, i.e. all
// trivially copyable types not bigger this size generate lock free code.
static constexpr int kMinLockFreeAtomicSize = 8;
template <typename T>
struct IsAtomicFlagTypeTrait {
static constexpr bool value = (sizeof(T) <= kMinLockFreeAtomicSize
&& std::is_trivially_copyable<T>::value);
// If not atomic fittable type, use mutex implementation.
template <typename T,
typename std::enable_if<
!flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
return flag.Get();
// If it is a trivially copyable type, use atomic hack.
template <typename T,
typename std::enable_if<
flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
T result;
if (flag.AtomicGet(&result.value)) {
return result.value;
return flag.Get();
view raw flag_trait.h hosted with ❤ by GitHub

And it failed so many times during the testing because … because it is C++ and it is going to punch you in the face all the time when you are not doing really correct things.

First, trivially copyable type might not be default constructible, i.e. we cannot write T value;. To work around this, we need to either now how it can be worked around either remember some other examples. For example, optional<T> can be created with a non default constructible type. Looking into the Abseil implementation, we can see a very interesting thing:

struct empty_struct {};
struct dummy_type {
static_assert(sizeof(T) % sizeof(empty_struct) == 0, "");
// Use an array to avoid GCC 6 placement-new warning.
empty_struct data[sizeof(T) / sizeof(empty_struct)];
// Data storage
union {
T data_;
dummy_type dummy_;
view raw union_hack.h hosted with ❤ by GitHub

It is using a union type, indeed if we put a type into the union, it just prepares the memory layout for it. So, I ended up in something like:

// The minimum atomic size we believe to generate lock free code, i.e. all
// trivially copyable types not bigger this size generate lock free code.
static constexpr int kMinLockFreeAtomicSize = 8;
template <typename T>
struct IsAtomicFlagTypeTrait {
static constexpr bool value = (sizeof(T) <= kMinLockFreeAtomicSize
&& std::is_trivially_copyable<T>::value);
// If not atomic fittable type, use mutex implementation.
template <typename T,
typename std::enable_if<
!flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
return flag.Get();
// If it is a trivially copyable type, use atomic hack.
template <typename T,
typename std::enable_if<
flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
// T might not be default constructible.
union U {
T value;
U() {}
U result;
if (flag.AtomicGet(&result.value)) {
return result.value;
return flag.Get();
view raw flat_trait_2.h hosted with ❤ by GitHub

Run tests again and they again failed. ARGH, now it is not working on all compilers. GCC 4.x+ is failing but luckily Abseil has a workaround this (look at line 8).

namespace type_traits_internal {
// is_trivially_copyable()
// Determines whether the passed type `T` is trivially copyable.
// This metafunction is designed to be a drop-in replacement for the C++11
// `std::is_trivially_copyable()` metafunction for platforms that have
// incomplete C++11 support (such as libstdc++ 4.x). We use the C++17 definition
// of TriviallyCopyable.
// NOTE: `is_trivially_copyable<T>::value` is `true` if all of T's copy/move
// constructors/assignment operators are trivial or deleted, T has at least
// one non-deleted copy/move constructor/assignment operator, and T is trivially
// destructible. Arrays of trivially copyable types are trivially copyable.
// We expose this metafunction only for internal use within absl.
template <typename T>
class is_trivially_copyable_impl {
using ExtentsRemoved = typename std::remove_all_extents<T>::type;
static constexpr bool kIsCopyOrMoveConstructible =
std::is_copy_constructible<ExtentsRemoved>::value ||
static constexpr bool kIsCopyOrMoveAssignable =
absl::is_copy_assignable<ExtentsRemoved>::value ||
static constexpr bool kValue =
(__has_trivial_copy(ExtentsRemoved) || !kIsCopyOrMoveConstructible) &&
(__has_trivial_assign(ExtentsRemoved) || !kIsCopyOrMoveAssignable) &&
(kIsCopyOrMoveConstructible || kIsCopyOrMoveAssignable) &&
is_trivially_destructible<ExtentsRemoved>::value &&
// We need to check for this explicitly because otherwise we'll say
// references are trivial copyable when compiled by MSVC.
template <typename T>
struct is_trivially_copyable
: std::integral_constant<
bool, type_traits_internal::is_trivially_copyable_impl<T>::kValue> {};
} // namespace type_traits_internal
view raw gcc_bug.h hosted with ❤ by GitHub

So, after all miscompilations, bugs, finally we have a version working for all trivially copyable types.

And it turned out to be breaking again. 🙂

Before that all trivial types were defined in .cc file which has external linkage. Assume now that you have a shared library that accesses some trivial flag. Briefly, see

// flag.h
template <typename T>
class Flag {
T Get() const {
// See implementation notes in CommandLineFlag::Get().
union U {
T value;
U() {}
~U() { value.~T(); }
U u;
// Operations are good to check that the flag was declared with the same type.
impl_.Read(&u.value, &flags_internal::FlagOps<T>);
return std::move(u.value);
void FlagImpl::Read(void* dst, const flags_internal::FlagOpFn dst_op) const {
absl::ReaderMutexLock l(DataGuard());
// `dst_op` is the unmarshaling operation corresponding to the declaration
// visibile at the call site. `op` is the Flag's defined unmarshalling
// operation. They must match for this operation to be well-defined.
if (ABSL_PREDICT_FALSE(dst_op != op_)) {
absl::StrCat("Flag '", Name(),
"' is defined as one type and declared as another"));
CopyConstruct(op_, cur_, dst);
view raw flag.h hosted with ❤ by GitHub

As now the linkage is internal, Get on its first initialization will get the pointer that was not firstly provided inside the flag definition because shared libraries are loaded with another addresses dynamically. So, it will result in a log statement which can pollute a bit the logs of the execution. This check is here because we want to check that the flag definition is the same as its declaration, to do this we need to check that types are the same and it is done in not the best way. The best way will be to check the typeids but this comes with some cost. Overall, we decided not to patch it right now as shared libraries for flags are already broken (for all non trivial types) and rare.

But this is funny – I changed 3 places writing pretty understandable code and it turned out that all 3 places are breaking changes and ended up with something ugly and crunchy. C++ in a nutshell. And

Okay, let’s assume we are finished with trivial types. Now comes the boss called 16 byte atomics. I really think that the theoretical knowledge helps in this case. Everyone should know that in order to create an atomic register we need a CAS operation on the register. For example, Introduction to Reliable and Secure Distributed Programming book helped me a lot to understand how memory models work from its theoretical to practical parts. To be dull, I just googled CAS 16 bytes and figured out that there is an instruction cmpxch16b on modern x86_64 platforms (by modern I mean westmere from 2010).

It turned out that even windows 8.1 requires this instruction for installation. This instruction comes with a bigger cost (by the way, I still don’t know why 16 byte loads do not meet strong model memory consistency but this is an absolutely another topic).

However, this instruction comes with a cost because it requires lock prefix. I am not an x86_64 memory model expert but my understanding what will happen:

The cost of the x86 LOCK prefix, or CAS, before PentiumPro, is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.

Not going into too much detail, I’ll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses. L1 Cache miss is around 4ns, so we can wait for 8ns latency for the lock prefix. Cachelines are always at least 16 bytes, so probably we should expect something like this.

It is no way <=1ns with one word atomics. I decided that we need a dispatch for this. So, for trivially copyable types >8 bytes and not bigger than 16 we can generate two words atomic while for others we should stick with a very nice and fast mov instructions. I ended up implementing something like this:

// The minimum atomic size we believe to generate lock free code, i.e. all
// trivially copyable types not bigger this size generate lock free code.
static constexpr int kMinLockFreeAtomicSize = 8;
// The same as kMinLockFreeAtomicSize but maximum atomic size. As double words
// might use two registers, we want to dispatch the logic for them.
static constexpr int kMaxLockFreeAtomicSize = 16;
static constexpr int kMaxLockFreeAtomicSize = 8;
// We can use atomic in cases when it fits in the register, trivially copyable
// in order to make memcpy operations.
template <typename T>
struct IsAtomicFlagTypeTrait {
static constexpr bool value =
(sizeof(T) <= kMaxLockFreeAtomicSize &&
// Clang does not always produce cmpxchg16b instruction when alignment of a 16
// bytes type is not 16.
struct alignas(16) FlagsInternalTwoWordsType {
int64_t first;
int64_t second;
constexpr bool operator==(const FlagsInternalTwoWordsType& that,
const FlagsInternalTwoWordsType& other) {
return that.first == other.first && that.second == other.second;
constexpr bool operator!=(const FlagsInternalTwoWordsType& that,
const FlagsInternalTwoWordsType& other) {
return !(that == other);
constexpr int64_t SmallAtomicInit() { return 0xababababababababll; }
template <typename T, typename S = void>
struct BestAtomicType {
using type = int64_t;
static constexpr int64_t AtomicInit() { return SmallAtomicInit(); }
template <typename T>
struct BestAtomicType<
T, typename std::enable_if<(kMinLockFreeAtomicSize < sizeof(T) &&
sizeof(T) <= kMaxLockFreeAtomicSize),
void>::type> {
using type = FlagsInternalTwoWordsType;
static constexpr FlagsInternalTwoWordsType AtomicInit() {
return {SmallAtomicInit(), SmallAtomicInit()};
template <typename T>
class Flag {
// For some types, a copy of the current value is kept in an atomically
// accessible field.
union Atomics {
// Using small atomic for small types.
std::atomic<int64_t> small_atomic;
template <typename T,
typename K = typename std::enable_if<
(sizeof(T) <= kMinLockFreeAtomicSize), void>::type>
int64_t load() const {
return small_atomic.load(std::memory_order_acquire);
// Using big atomics for big types.
std::atomic<FlagsInternalTwoWordsType> big_atomic;
template <typename T, typename K = typename std::enable_if<
(kMinLockFreeAtomicSize < sizeof(T) &&
sizeof(T) <= kMaxLockFreeAtomicSize),
FlagsInternalTwoWordsType load() const {
return big_atomic.load(std::memory_order_acquire);
constexpr Atomics()
: big_atomic{FlagsInternalTwoWordsType{SmallAtomicInit(),
SmallAtomicInit()}} {}
constexpr Atomics() : small_atomic{SmallAtomicInit()} {}
Atomics atomics_{};
view raw flag_final.h hosted with ❤ by GitHub

The benchmark showed 7.6ns accessing absl::Duration flag in 1 threaded mode and 10 times better in 16 threaded mode. So, we saved a lot of contention, provided lock free implementation for bigger types increasing the flag size only by 8 bytes.

Funny fact that GCC since version 7.1 does not generate 16 byte atomic lock free code because

IIRC this was removed as the instruction cannot be used for read only memory.

Andrew Pinski 2017-05-25 12:49:51 UTC.

I barely understand what this means but the fact remains the fact. GCC does not provide this instruction in any of the last implementations. Sorry, Clang, you win this game. See And I turned it on for all clang compilers that have -mcx16 option inside. Windows does a complete garbage here and I don’t even think about it right now.

After I sent the patch on the review, we got 2 additional meetings, around 3000 words pretty tough arguing if we need this, checking that nothing else breaks. (My side note – I find sometimes very hard to convince people at Google doing some crazy things but with enough intention, good intuition, it is possible in short amount of time). I was polishing it and making the code more readable. In the end it turned out completely insane and now I FEEL this is way clearer in its internals rather than it was before.

As it is a well known fact that Google uses LLVM infrastructure, my change might possibly be useful to all services. This patch was open sourced couple of days ago under the name absl::GetFlag is lock free for small trivially copyable types. with other internal changes 🙂 .


Maybe this is not the end. The way I told you, it might be possible to hack libc++ small strings <=14 bytes. They are stored on stack and first byte of the representation indicates if the string is small and you can get the string size. If it is small, we can atomically load 16 bytes to return a valid small string to a user. This requires using undocumented function __is_long() and very robust code which might end in really crazy things and bugs.

Also, on other platforms there are 16 byte atomic loads but they are not generated by all compilers and for all platforms. For example, for PowerPC GCC generates stq instruction (store quad) but Clang does not. For ARM the situation is the opposite. Also, it is hard to get old compilers for all the platforms and I decided to give up a bit on this. But this is absolutely possible.


To write the working code took me around 2 hours from scratch. Research on how the things are going around took me around a day. Then 5 days of arguing with people and why we need or don’t need this. But this definitely is one of the greatest ideas that came to my mind for the past couple of years. Also thanks to C++ that it is possible to do this crazy thing and gain the maximum we can do.

What am I doing with my life?

About me and this blog

I am Danila Kutenin. I am an engineer working at Google and doing distributed and high performance C++ software. My main interests are distributed systems, operating systems design, hardware testing, compilers, APIs, algorithms and theoretical computer science.

I worked for almost for 2 years at Yandex doing the core search engine – one of the biggest runtime services in Russia facing million of daily users and thousands QPS. Also, I was one of the C++ library maintainers fixing, optimizing and providing cleaner APIs for all C++ code at Yandex. There I learned a lot how to do high performance and distributed things, how to manage huge clusters and to find my way to what I like and what I don’t.

Now I am working at Google doing internal distributed file system facing all Googlers all around the world, providing best user experience in latency, speed and synchronization. My side projects include patching compilers, LLVM, C++ main library, documentation engine, Chromium and everything that I find funny enough to do and to investigate.

This blog is a way of expressing and combining my interests and feelings about practical engineering explaining a lot of “why”s from its theoretical, hardware and software side – from getting a brilliant math idea to an actual implementation, assembly code generation and how it is dealing with a hardware revealing all the magic that is happening through all the stack.

If you want something to discuss, everything you want me to ask/suggest, please reach me out at kutdanila at gmail dot com. Also, you can always find me on Twitter @Danlark1.