## How a Bug(?) in the Linux CRC-32 Checksum Turned out not to be a Bug

A friend of mine nerd sniped me with a good story to investigate which kinda made me giggle and want to tell to the public some cool things happened to me during my experience with hashing and other stuff.

The story is simple and complex at the same time, I’d try to stick to a simple version with the links for those who do understand the underlying computations better.

A Cyclic Redundancy Check (CRC) is the remainder of binary division of a potentially long message, by a CRC polynomial. This technique is employed in network and storage applications due to its effectiveness at detecting errors. It has some good properties as linearity.

Speaking in mathematical language CRC is calculated in the following way:

$CRC (M (x)) = x^{deg(P(x))} \cdot M(x) \ \mathrm{ mod} \ P (x)$

where the polynomial $P(x)$ defines the CRC algorithm and the symbol $\cdot$ denotes carry-less multiplication. Most of the times the polynomial $P(x)$ is irreducible, it just leads to fewer errors in some applications. CRC-32 is basically a CRC with $P(x)$ of degree 32 over Galois Field GF(2). Nevermind, it still does not matter much, mostly for the context.

x86 processors for quite some time have instructions CLMUL which are responsible for multiplication in GF(2) and can be enabled with -mplcmul in your C/C++ compilers. They are useful, for example, in erasure coding to speed up the recovery from the parity parts. As CRC uses something similar to this, there are algorithms that speed up the CRC computation with such instructions.

The main and almost the only one source of knowledge how to do it properly with other SIMD magic is Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction from Intel which likely hasn’t been changed since 2009, it has lots of pictures and it is nice to read when you understand the maths. But even if you don’t, it has a well defined algorithm which you can implement and check with other naive approaches that it is correct.

When it tries to define the final result, it uses so called scary Barrett reduction which is basically the subtraction of the inverse times the value. Just some maths.

Still not so relevant though. But here comes the most interesting part. In the guide there are some constants for that reduction for gzip CRC. They are looking this way

If we look closely, we would notice that k6′ and u’ are ending in 640 and 641 respectively. So far, so good, yet, in the Linux kernel the constants are slightly different, let me show you

It is stated to be written from the guide in the header, so they should be same. The constant is 0x1DB710641 vs 0x1DB710640 stated in the guide, the off by 1 but with the same 3 digits in the end as u’.

Two is that there’s also this line a little earlier on the same page: $P(x)'$ = 0x1DB710641 This $P(x)'$ number differs from $k6'$ only in the final bit: 0x41 vs 0x40. This isn’t coincidental. Calculating $k6'$ (in the Galois Field GF(2) space) means dividing (i.e. bitwise XOR-ing) a 33-bit polynomial ($x^{32}$, with only one ‘on’ bit) by the 33-bit polynomial $P(x)'$.

I thought for a while this is a legitimate bug. But wait a second, how can this be even possible given that CRC is used worldwide.

0x1DB710640 camp:

• Rust
• Intel with a guide and code
• Other “clean” implementations

0x1DB710641 camp:

I decided that this is too good to be true to be a bug, even though the number is changed, I also tried 0x42 and the tests fail. After that I started looking at the code and managed to prove that this constant +-1 does not matter

Let’s look the snippet where this constant is used from zlib:

Speaking about CLMUL instruction, it has the following signature __m128i _mm_clmulepi64_si128 (__m128i a, __m128i b, const int imm8), i.e. takes two 16 byte registers and mask and returns 16 byte register. The algorithm is the following

While executing the 15th line in gist x2 = _mm_clmulepi64_si128(x2, x0, 0x10); the mask imm8 takes the high 8 bytes in TEMP2 and thus the result isn’t changed, no need to worry here.

The most interesting part is the second _mm_clmulepi64_si128 with the third argument 0x00 which takes first 8 bytes from the operation in TEMP2. Actually the resulting values would be different but all we need to prove is that the bytes from 4 to 8 are the same because return happens with _mm_extract_epi32 which returns exactly uint32_t of that bytes (to be clear, the xor from x1 and x2 but if we prove bytes from 4 to 8 are the same for x2, it would be sufficient).

The bytes from 4 to 8 are only used in one loop in the operation:

TEMP2 is now our “magic” k6 value and TEMP1 is just some input. Note that when changing from 0x1DB710640 to 0x1DB710641 we only swap bit TEMP2[0]. Given it makes AND with all bits when i equals to j, the result would not change if and only if TEMP1[j] is zero for all j from 32 to 63.

And this turns out to be true because before the second CLMUL happens the following: x2 = _mm_and_si128(x2, x3);. And as you can see, x3 has bits zero from 32 to 63. And the returning result isn’t changed. What a coincidence! Given the conditions, if the last byte is changed to 0x42, only the highest bit can differ at the very most as it changes TEMP2[1].

For now I don’t know for 100% if it was made on purpose, to me looks like a human issue where the value was copy pasted and accidentally it worked. I wish during the interviews I also may miss any +-1 because of such bit magic 🙂

This is not for the first time I face some weird bit magic issues with CRC, for example, look at the following code:

It seeds CRC64 hash with FNV hash and the results are the same. Though the last byte of the 8 word strings are different. I faced it once in URL hashing and it was failing for urls with very short domain and path. The proof is left as an exercise to the reader, try it out, really, it is some good bit twiddling. More hashing is sometimes bad hashing, be careful 🙂

Thanks to Nigel Tao who first suspected an issue with the Linux kernel and described it in the wuffs repository. I think no one should do anything as it perfectly works or at the very most fix the constants to better match the guideline.

## Miniselect: Practical and Generic Selection Algorithms

Today I present a big effort from my side to publish miniselect — generic C++ library to support multiple selection and partial sorting algorithms. It is already used in ClickHouse with huge performance benefits. Exact benchmarks and results will be later in this post and now let’s tell some stories about how it all arose. I publish this library under Boost License and any contributions are highly welcome.

## It all started with sorting

While reading lots of articles, papers, and posts from Hacker News, I found it pretty funny each several months new “shiny”, “fastest”, “generic” sorting algorithms to come or remembered from old papers such as the recent paper on learned sorting, Kirkpatrick-Reisch sort or pdqsort. It is that we are essentially 65+ years into writing sorting algorithms, and we still find improvements. Shouldn’t sorting items be a “solved” problem by now? Unfortunately, not. New hardware features come, we find that sorting numbers can be actually done faster than best comparison $O(n \log n)$ time complexity and we still find improvements in sorting algorithms like avoiding branches in partitions and trying to find good pivots as pdqsort does. Also, there are many open questions in that area as “what is the minimum number of comparisons needed?”.

Huge competition is still going on in sorting algorithms and I believe we are not near the optimal sorting and learned sorting looks like the next step. But it uses the fundamental fact that no one expects sorting to be completed in a couple of passes and we can understand something about data during first array passes. We will understand why it matters later.

My favorite general sorting is pdqsort, it proves to be currently the best general sorting algorithm and it shows a significant boost over all standard sorts that are provided in C++. It is also used in Rust.

## Selection and Partial Sorting

Nearly a couple of months ago I started thinking about a slightly different approach when it comes to sorting — partial sorting algorithms. It means that you don’t need to sort all $n$ elements but only find $k$ smallest and sort them. For example, it is widely used in SQL queries when you do ORDER BY LIMIT N and N is often small, from 1-10 to ideally couple of thousands, bigger values still happen but rare. And, oh god, how little engineering and theoretical research has been done there compared to full sorting algorithms. In fact, the question of specifically finding $k$th order statistics when $k$ is small is open and no good solution is presented. Also, partial sorting is quite easy to obtain after that, you need to sort the first $k$ elements by some sorting algorithm to get optimal $O(n + k \log k)$ comparisons and we will look at only one example when it is not the case. Yes, there are a bunch of median algorithms that can be generalized to find the $k$th smallest element. So, what are they? Yeah, you may know some of them but let’s revise, it is useful to know your enemies.

### QuickSelect

This is almost the very first algorithm for finding the $k$th smallest element, just do like QuickSort but don’t go recursively in two directions, that’s it. Pick middle or even random element and partition by this element, see in which of two parts $k$ is located, update the one of the borders, voila, after maximum of $n$ partitions you will find $k$th smallest element. Good news that on average it takes $O(n)$ comparisons if we pick random pivot. That is because if we define $C(n, k)$ is the expected number of comparisons for finding $k$th element in $n$ elements and $C(n) = \max_{1}^{n} C(n, k)$, then during one stage we do $n - 1$ comparisons and uniformly pick any pivot, then even if we pick the biggest part on each step

$\displaystyle C(n) \leq n - 1 + \frac{2}{n}\sum_{n/2}^{n - 1} C(i)$

$\displaystyle C(n) \leq (n - 1) + \mathrm{avg}(C(n/2), \ldots, C(n))$

If assuming by induction that $C(i) \leq 4i$ with an obvious induction base, we get

$C(n) \leq (n - 1) + \mathrm{avg}(C(n/2), \ldots, C(n)) \leq n - 1 + 4(3n/4) < 4n$

Bad news is that the worst case will still be $O(n^2)$ if we are unfortunate and always pick the biggest element as a pivot, thus partitioning .

In that sense that algorithm provides lots of pivot “strategies” that are used nowadays, for example, picking pivot as a $n/2$ element of the array or picking pivot from 3 random elements . Or do like std::nth_element from libcxx — choose the middle out out of $A[0], A[n/2], A[n - 1]$.

I decided to visualize all algorithms I am going to talk about today, so quickselect with a median of 3 strategy on random input looks something like this:

And random pivot out of 3 elements works similar

For a strategy like libcxx (C++ llvm standard library) does, there are quadratic counterexamples that are pretty easy to detect, such patterns also appear in real data. The counterexample looks like that:

This is definitely quadratic. By the way, this is perfectly ok with the C++ standard wording as it says:

## Median of Medians

For a long time, computer scientists thought that it is impossible to find medians in worst-case linear time, however, Blum, Floyd, Pratt, Rivest, Tarjan came up with BFPRT algorithm or like sometimes it is called, median of medians algorithm.

Median of medians algorithm: Given array $A$ of size $n$ and integer $k \leq n$,

1. Group the array into $n/5$ groups of size 5 and find the median of each group. (For simplicity, we will ignore integrality issues.)
2. Recursively, find the true median of the medians. Call this $p$.
3. Use $p$ as a pivot to partition the array.
4. Recurse on the appropriate piece.

When we find the median $p$ of $g = n/5$ groups, at least $g/2$ of them have at least 3 out of 5 elements that are smaller or equal than $p$, that said the biggest out of 2 partitioned chunks have size $7n/10$ and we have the reccurence

$C(n) \leq cn + C(n/5) + C(7n/10)$

If we appropriately build the recurse tree we will see that

This is the geometric series with $cn(1 + 9/10 + (9/10)^2 + (9/10)^3 + \ldots ...)$ which gives us the result $C(n) \leq 10 c n$.

Actually, this constant 10 is really big. For example, if we look a bit closer, $c$ is at least 1 because we need to partition the array, then finding median out of 5 elements cannot be done in less than 6 comparisons (can be proven by only brute-forcing) and in 6 comparisons it can be done in the following way

1. Use three comparisons and shuffle around the numbers so that $a[1] < a[2], a[4] < a[5]$, and $a[1] < a[4]$.
2. If $a[3] > a[2]$, then the problem is fairly easy. If $a[2] < a[4]$, the median value is the smaller of $a[3]$ and $a[4]$. If not, the median value is the smaller of $a[2]$ and $a[5]$.
3. So $a[3] < a[2]$. If $a[3] > a[4]$, then the solution is the smaller of $a[3]$ and $a[5]$. Otherwise, the solution is the smaller of $a[2]$ and $a[4]$.

So that maximum $c$ can be $11/5$ and it gives us the upper bound $22n$ comparisons which looks like it can be achieved. Some other tricks can be done in place to achieve a bit lower constants like $18n$ (for example, sorting arrays of 5 and comparing less afterwards). In practice, the constant is really big and you can see it from the following demonstration which was even fastened because it took quite a few seconds:

## HeapSelect

Another approach to finding $k$th element is to create a heap on an array of size $k$ and push other $n - k$ elements into this heap. C++ std::partial_sort works that way (with additional heap sorting of the first heap). It shows good results for very small $k$ and random/ascending arrays, however starts to significantly degrade with growing $k$ and becomes impractical. Best case $O(n)$, worst $O(n \log k)$, average $O(n \log k)$.

## IntroSelect

As the previous algorithm is not very much practical and QuickSelect is really good on average, in 1997 “Introspective Sorting and Selection Algorithms” from David Musser came out with a sorting algorithm called “IntroSelect”.

IntroSelect works by optimistically starting out with QuickSelect and only switching to MedianOfMedians if it recurses too many times without making sufficient progress. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large arrays. Musser discusses a couple of simple approaches:

• Keep track of the list of sizes of the subpartitions processed so far. If at any point $t$ recursive calls have been made without halving the list size, for some small positive $t$, switch to the worst-case linear algorithm.
• Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant $t$, switch to the worst-case linear algorithm.

This algorithm came into libstdcxx and guess which strategy was chosen? Correct, none of them. Instead, they try $2\log n$ QuickSelect steps and if not successful, fallback to HeapSelect algorithm. So, worst case $O(n \log n)$, average $O(n)$

## PDQSelect

Now that most of the known algorithms come to an end 😈, we can start looking into something special and extraordinary. And the first one to look at is pdqselect which comes pretty straightforward from pdqsort, the algorithm is basically QuickSelect but with some interesting ideas on how to choose an appropriate pivot:

1. If there are $n < 24$ elements, use insertion sort to partition or even sort them. As insertion sort is really fast for a small amount of elements, it is reasonable
2. If it is more, choose $p$ — pivot:
1. If there are less or equal than 128 elements, choose pseudomedian (or “ninther”, or median of medians which are all them same) of the following 3 groups:
1. begin, mid, end
2. begin + 1, mid – 1, end – 1
3. begin + 2, mid + 1, end – 2
2. If there are more than 128 elements, choose median of 3 from begin, mid, end
3. Partition the array by the chosen pivot with avoiding branches:
1. The partition is called bad if it splits less than $1/8n$ elements
2. If the total number of bad partitions exceeds $\log n$, use std::nth_element or any other fallback algorithm and return
3. Otherwise, try to defeat some patterns in the partition by (sizes are l_size and r_size respectively):
1. Swapping begin, begin + l_size / 4
2. Swapping p – 1 and p – l_size / 4
3. And if the number of elements is more than 128
1. begin + 1, begin + l_size / 4 + 1
2. begin + 2, begin + l_size / 4 + 2
3. p – 2, p – l_size / 4 + 1
4. p – 3, p – l_size / 4 + 2
4. Do the same with the right partition
4. Choose the right partition part and repeat like in QuickSelect

## Median Of Ninthers or Alexandrescu algorithm

For a long time, there were no practical improvements in finding $k$th element, and only in 2017 very well recognized among C++ community Andrei Alexandrescu published a paper on Fast Deterministic Selection where worst case median algorithm becomes practical and can be used in real code.

There are 2 key ideas:

• We now find the pseudomedian (or ninther, or median of medians which are all the same) of 9 elements as it was done similarly in pdqsort. Use that partition when $n / 6 \leq k \leq 5n/6$
• Introduce MedianOfMinima for $k < n/6$. MedianOfMedians computes medians of small groups and takes their median to find a pivot approximating the median of the array. In this case, we pursue an order statistic skewed to the left, so instead of the median of each group, we compute its minimum; then, we obtain the pivot by computing the median of those groupwise minima.

$n/6$ is not chosen arbitrarily because in order to preserve the linearity of the algorithm we need to make sure that while recursing on $2k$ elements we partition more than $2k$ elements and thus $n/2 - k > 2k$. MedianOfMaxima is done the same way and for $k > 5n/6$. The resulting algorithm turns out to be the following

Turns out it is a better algorithm than all above (except it did not know about pdqselect) and shows good results. My advice that if you need a deterministic worst-case linear algorithm this one is the best (we will talk about a couple of more randomized algorithms later).

## Small k

All these algorithms are good and linear but they require lots of comparisons, like, minimum $2n$ for all $k$. However, I know a good algorithm for $k = 2$ which requires only $n + \lceil \log n \rceil - 2$ comparisons (I am also not going to prove it is minimal but it is). Let’s quickly revise how it works.

For finding a minimum you just compare linearly the winner with all others and basically the second place can be anyone who lost to the winner, so we need to compare them within each other. Unfortunately, the winner may have won linear number of others and we will not get the desired amount of comparisons. To mitigate this, we need to make a knockout tournament where the winner only plays $\log n$ games like that:

And all we need to do next is to compare all losers to the winner

And any of them can be the second. And we use only $\log n - 1$ comparisons for that.

What can we do to find the third and other elements? Possibly not optimal in comparison count but at least not so bad can follow the strategy:

First set up a binary tree for a knockout tournament on $n - k + 2$ items. (This takes $n - k + 1$ comparisons.) The largest item is greater than $n - k + 1$ others, so it can’t be $k$th largest. Replace it, where it appears at an external node of the tree, by one of the $k - 2$ elements held in reserve, and find the largest element of the resulting $n - k + 2$; this requires at most $\log(n + 2 - k)$ comparisons because we need to recompute only one path in the tree. Repeat this operation $k - 2$ times in all, for each element held in reserve.

It will give us the estimation of $n - k + (k - 1)\log (n + 2 - k)$ comparisons. Assume you need to find top 10 out of millions of long strings and this might be a good solution to this instead of comparing at least $2n$ times. However, it requires additional memory to remember the path of the winner and I currently do not know how to remove it thus making the algorithm impractical because of allocations or additional level of indirections.

At that time my knowledge of selection algorithms ended and I decided to address one known guy.

In The Art of Computer Programming, Volume 3, Sorting and Searching I read almost 100-150 pages in order to understand what the world knows about minimal comparison sorting and selection algorithms and found a pretty interesting one called Floyd-Rivest algorithm. Actually, even Alexandrescu paper cites it but in an unusual way:

## Floyd-Rivest algorithm

This algorithm is essentially the following

The general steps are:

1. Select a small random sample $S$ of size $n^{2/3}$ from the array.
2. From $S$, recursively select two elements, $u$ and $v$, such that $u < v$ (essentially they take $k - n^{1/3}\log^{1/2}n$ and $k + n^{1/3}\log^{1/2}n$). These two elements will be the pivots for the partition and are expected to contain the $k$th smallest element of the entire list between them.
3. Using $u$ and $v$, partition $S$ into three sets (less, more and between).
4. Partition the remaining elements in the array by comparing them to u or v and placing them into the appropriate set. If $k$ is smaller than half the number of the elements in the array, then the remaining elements should be compared to $v$ first and then only to $u$ if they are smaller than $v$. Otherwise, the remaining elements should be compared to $u$ first and only to $v$ if they are greater than $u$.
5. Apply the algorithm recursively to the appropriate set to select the $k$th smallest in the array.

Then in 2004 it was proven that this method (slighly modified in bound selection) will have $n + \min(k, n - k) + O(\sqrt{n \log n})$ comparisons with probability at least $1 - 2n^{-1/2}$ (and the constant in the power can be tuned).

This algorithm tries to find the appropriate subsamples and proves that the $k$th element will be there with high probability.

Yet the worst case of the algorithm is still $O(n^2)$ but it tries to optimize the minimum amount of comparisons on average case, not the worst case.

For small $k$ it is really good as it only used $n + k + o(n)$ comparisons which is significantly better than all “at least $2n$ comparisons” algorithms and even for median it is $3n/2$ which is significantly better.

## What it resulted in

So I decided to code all these algorithms with the C++ standard API and to test against each other and possibly submit to something performance heavy as DBMS for ORDER BY LIMIT N clauses.

I ended up doing miniselect library. For now, it is header-only but I don’t guarantee that in the future, it contains almost all algorithms except the tournament one which is very hard to do in the general case.

I tested on Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz (yeah, a bit old, sorry). We are going to find median and 1000th elements out of $10^7$ and $2^16$ arrays. Benchmark data:

• Ascending
• Descending
• Median3Killer which I described above
• PipeOrgan: half of the values are ascending, the other half is descending
• PushFront: ascending but the smallest element added to the end
• PushMiddle: ascending but the middle element added to the end
• Random
• Random01: random only with values 0 and 1
• Shuffled16: random only with 16 overall values

You can see other values also here.

As you see, Floyd-Rivest outperforms in time all other algorithms and even Alexandrescu in most of the cases. One downside is that Floyd-Rivest performs worse on data where there are many equal elements, that is expected and probably can be fixed as it is pointed out in Kiwiel’s paper.

## Real World

This effort also resulted in a patch to ClickHouse where we got the following benchmarks:

As you see, most of the queries have ORDER BY LIMIT N. Topmost queries were significantly optimized because std::partial_sort works worst when the data is descending, real-world usage as you can see in string_sort benchmarks was optimized by 10-15%. Also, there were many queries that have been optimized by 5% where sorting is not the bottleneck but still, it is nice. It ended up with a 2.5% overall performance boost across all queries.

Other algorithms showed worse performance and you can see it from the benchmarks above. So, now Floyd-Rivest is used in production, and for a good reason. But, of course, it does not diminish the results of Mr. Alexandrescu and his contributions are very valuable when it comes to determinism and worst case.

## Library

I publish miniselect under Boost License for everybody to use and adjust to the projects. I spent several days in the last week trying to make it work everywhere (except Windows for now but it should be good). We support C++11 and further with GCC 7+ and Clang 6+. We carefully test the algorithms for standard compliance, with fuzzing, sanitizers, etc. Fuzzing managed to find bugs that I thought only at the last moment so it really helps.

If you want to use it, please read the API but it should be an easy sed/perl/regex replacement with zero issues except for the ties between elements might resolve in a different way, however, C++ standard says it is ok and you should not rely on that.

Any contributions and other algorithms that I might miss are highly welcome, I intend to get this library as a reference to many implementations of selection algorithms so that the users can try and choose the best options for them. (Should I do the same for sorting algorithms?)

## Conclusion

This was a long story of me trying to find the solutions to the problems that were puzzling me for the past month or so. I probably read 300+ pages of math/algorithms/discussions/forums to finally find everything that the world knows about it and to make it work for real-world applications with huge performance benefits. I think I can come up with something better in a long term after researching this stuff for a while longer but I will let you know in this blog if anything arises 🙂

#### References I find useful (together with zero mentions in the post)

1. Selection Algorithms with Small Groups where the Median of Medians algorithms are a little bit twisted for groups of size 3 and 4 with linear time worst-case guarantee
2. Understandable proof of second biggest element minimum comparisons bound
3. On Floyd and Rivest’s SELECT algorithm by Krzysztof C. Kiwiel
4. The Art of Computer Programming, Volume 3, Sorting and Searching, Chapter about Minimum-Comparison Sorting and Selection
5. Original Floyd-Rivest paper
6. Simple proof of a bit worse bound for Floyd-Rivest algorithm by Sam Buss
7. Fast Deterministic Selection by Andrei Alexandrescu
8. An efficient algorithm for the approximate median selection problem by S. Battiato, D. Cantone, D. Catalano and G. Cincotti
9. An efficient implementation of Blum, Floyd, Pratt, Rivest, and Tarjan’s worst-case linear selection algorithm
10. Selection from read-only memory and sorting with minimum data movement by J. Ian Munro, Venkatesh Raman

## I need extra C/C++ performance now. How?

Hello everyone, today we are going to talk about C++ performance. Again, another “usual” perf blog as you may think. I promise to give really practical advice and don’t get bullshitty (okay, a little bit) talks about how C++ is performant. We are gonna roast the compilers and think about what we should do about it (or should we?). Let’s do a bit of fact-checking:

## Fact #1. Compilers are smarter than the programmers

### Resolution: it’s exaggerated

Of course, we do have really awesome compiler infrastructure as GNU or LLVM, they are doing a massively great job in optimizing programs. Hundreds of PhDs were wasted done on researching that and what the community achieved is flabbergastingly.

However, the problem of finding the optimal program is undecidable, and nothing we can do about it. Even small cases of invariant loads are not simplified, after looking at the examples below it seems that compilers cannot do anything useful:

None of the expressions were optimized to one load but they clearly should be. I was always thinking that compilers should find all possible invariants in the code and simplify the constructions to minimize the execution cost.

Compilers cannot understand the complex properties of the ranges being sorted or increasing which can help understanding more structures. Actually, that’s because of the compilation times — understanding the sorted property is rarely used and is not so easy to check or make the programmers annotate.

Also, of course, no compiler will optimize your quadratic sort into a O(n log n) sort because it is too much for them to understand what you really want and other factors can play a role — for example, the stability of sorting which affects the order of equal elements, etc.

## Fact #2. C/C++ produces really fast code

### Resolution: false

I don’t agree that C/C++ itself provides the best performance in the world, the only thing I am convinced in: C/C++ provides you the control over your performance. When you really really want to find the last nanoseconds in your program, C++ is a great tool to achieve this, for example, you can look at my blog entry on how I managed to put 16 byte atomics in the Abseil, doing it in other languages require much more time and not that expressive.

## Fact #3. inline keyword is useless when concerning C/C++ performance

### Resolution: partly but false

I heard that so many times that inline keyword is useless when it comes to C/C++ performance. Actually, it has a bit of history, inline really does not mean that function should be “inlined” in the code, it is not even a hint, it is a keyword that solves the following problem: assume you have two cpp files with the include of any entity that might have a one definition rule conflict: it might be a function or even variable, then you will get a compiler error, inline keyword promises that it is ok to duplicate the function in different translation units and have the same address.

The usual understanding of the performance of inline functions comes from the C standard

A function declared with an inline function specifier is an inline function … Making a function an inline function suggests that calls to the function be as fast as possible.(118) The extent to which such suggestions are effective is implementation-defined.(119)

Page 112 of ISO/IEC 9899:TC2

So, let’s once for and forever for the most used cases get:

• constexpr functions are implicitly “inline”d
• general template functions are implicitly “inline”d
• function definitions in classes are implicitly “inline”d
• static constexpr variables in classes are “inline”d from C++17

Don’t put inline specifiers for these entities and force your developers to remove them within the code review, educate them, put in styleguides, etc.

Ok, inline is good but did you know that inline functions have a better threshold for inlining? For example, in clang, functions marked as inline have a better threshold in actual code inlining than usual functions, for example, see this:

So, inline keyword might help only in free functions within one translation unit, otherwise you either need to mark the function inline or it will be implicitly that way. And yes, it does help the compilers to optimize your programs. Yet, if you are having an interview and somebody asks about the inline keyword, tell about the performance because most of the interviewers expect this and as we all know, interviews are not about the actual knowledge, they are about understanding what people want to hear and find out about you. I actually once told the whole lecture about inline keyword to the interviewer and they were kind of impressed but all they wanted to hear is that it helps the performance. Unfortunately, the truth is more complex and boring.

Though the inlining in the compilers makes a big deal in optimizations and code size and we will look at this a bit later.

## Fact #4. -march=native gives a good stable performance boost

### Resolution: yeah, probably true but …

These flags allow you to tell the compilers what kind of hardware you have and optimize the program specifically for this type of hardware.

But it is not portable, once you compile with that flag you must verify it does not do any harm to the hardware you are running your code with. Or you compiled the code on the right hardware.

I personally don’t recommend using this flag anywhere as it will 100% break things and in my experience, it does not provide lots of benefits for already well-tuned programs. Most of the programs are SSE optimized and using AVX-AVX-512 can cause real troubles within the transition as explained in this paper. Well, if you are sure, use AVX everywhere and make sure non of the SSE code follows your critical path or the benefits from AVX are significant.

## Enough talking, give me perf

In recent times I really found myself repeating to many different projects the same thing and they all helped to gain around 5-15% performance boost just in a few lines of toolchain settings but first, a little bit of overview

### GCC vs Clang

I personally don’t believe in a fast moving future for GCC toolchain and for now this compiler wins mostly for supporting a huge variety of platforms. Yet, the overwhelming majority of all machines are now one of x86, ARM, Power, RISC-V and both compilers support these architectures. Clang is giving more and more competition, for example, Linux kernel can be built with clang and is used in some 6 letter companies and Android phones. Clang is battle-tested continuously from head within at least 3 companies out of FAANG which contain around a billion (rough estimation) of C/C++ code overall.

I do want to admit that GCC became much better than it was several years ago with better diagnostics, tooling, etc. Yet, even later news for GCC such as a huge bug in memcmp optimizations which, for example, already can contain at least 10 bugs in your default OS software. As I understand correctly, the bug was fixed after 3-5 weeks of identifying which is already an enormous amount of time for such a bug to be alive. People suggest using the option -fno-builtin but this will definitely lead to huge performance regressions, for example, all memcpy small loads are going to be replaced by a call and even the code in the fastest decompression algorithm LZ4 will become less efficient: in the upper code memcpy is replaced by a 16 byte vectorized load, in the bottom picture this is an entire jump to memcpy which can be a huge penalty

For example, in my experiments even -fno-builtin-memcpy can cause up to 10% performance penalty, for example in ClickHouse benchmark.

The situation with memcmp is pretty much the same but if you definitely need to mitigate the bug — update the compiler, use Clang or at the very least use -fno-builtin-memcmp.

Overall it means that GCC was once extremely unlucky with the bug and no prerelease software caught it. There are likely several reasons for that: hard to test continuously GCC from head, hard to build the compiler overall (against LLVM where you just write 1 command to build the whole toolchain).

### GCC vs Clang tooling

I feel that Clang is an absolute winner here. Sanitizers are much more convenient to use in the ecosystem, static analysis tools together with the API which help the code to evolve and recover from mistakes.

### GCC vs Clang perf

Still, there are debates about what is faster between these two compilers. Phoronix benchmarks are the good estimations of what is happening on a broad range of software. Want to admit that compile times for Clang are significantly faster than GCC ones, especially on C++ code. In my practice I see around 10-15% faster compile times which can be used for cross-module optimizations such as ThinLTO (incremental and scalable link time optimizations).

Yet, with clang-11 and a little bit of my help we finally managed to achieve the parity between gcc and clang in performance at ClickHouse:

Most of the performance came from just updating the compiler, however the last 0.5% we got from ThinLTO and the compile time was still faster than GCC without any cross-module optimization

So, after that I decided to go a bit further, and with a couple of lines of toolchain code to experiment with the main ClickHouse binary, I managed to get an extra 4% performance. Let’s talk about how to achieve this.

## Clang perf options I recommend

• -O3 by default, obviously
• -flto=thin. This is a scalable cross module optimization ThinLTO. As a bonus we also got the binary size reduction from 1920mb to 1780mb and after that GCC was not faster than Clang. In my practice, this should give from 1% to 5% of performance and it always decreases the binary size and consumes less memory during the compilation than LTO which can easily OOM and painfully slow. By the way, here is a good presentation on ThinLTO and the memory consumption vs full link time optimization is the following

And compile times

As it is scalable, you can easily put it in your CI or even build once in a while for release to check that it is fine. In my practice, ThinLTO provides almost the same performance optimization and is much more stable. For ClickHouse it gave 0.5% performance gain and smaller binary.

• -fexperimental-new-pass-manager. LLVM IR optimizations consist of passes that transform code trying to reduce the overall execution cost. The old pass manager was just containing a list of passes and running them in a hardcoded order, the new pass manager does a bit more complex things such as pass caching, analysis passes to determine better inlining and order. More info you can find in this video and in these two pdfs (one, two). The development of new pass manager finished in 2017 and Chandler Carruth sent the RFC at that time, since then it has been used at Google permanently and polished by the compiler developers. In August 2019 it was once again to be said to be by default and as of October 2020 it is still not by default but the bitflip diff is ready. In my practice that gives 5-10% compile time reduction and 0-1% performance. In ClickHouse together with ThinLTO it produced 1% performance in total, 7% compile time reduction and 3% binary size reduction from 1780mb to 1737mb. I highly recommend using this option and to test it now — you can possibly catch the latest bugs and it will be by default anyway in the near future (I really hope so).
• -mllvm -inline-threshold=1000. This option controls the actual inlining threshold. In clang by default, this number is 225. Inlining turns out to be really important even for middle size functions. In my experience this really performs well for numbers around 500-2000 and 1000 is somewhat a sweet spot. This, of course, will increase the binary size and compile times by a fair amount of percent (10-30%), yet it is a universal option to find the performance here and now if you really need it. From my experience it gives from 3% to 10% performance wins. In ClickHouse it gave exactly 4% performance win:

It’s up to you to decide if you are able to trade the compile time and binary size for extra several percent of performance.

• A bit tricky one: Use libcxx as a standard library and define the macros _LIBCPP_ABI_ENABLE_UNIQUE_PTR_TRIVIAL_ABI, _LIBCPP_ABI_ENABLE_SHARED_PTR_TRIVIAL_ABI

In C++, unfortunately, smart pointers come with an additional cost, for example:

In SysV ABI it is said, that if a C++ class has a non-trivial destructor or assignment operator then it should be passed by an invisible pointer. Because of that smart pointers are not so cheap. You can try to change it by assigning the trivial ABI to these pointers and Google added the macro option in libcxx to do this. Some consequences of that will be:

The last failure is a bit more interesting, in order to guarantee the triviality, unfortunately, you need to call the destructor in a bit different way. Currently, the following would “work” because the argument to run_worker() is deleted at the end of func() With the new calling convention, it will be deleted at the end of run_worker(), making this an access to freed memory.

Google saw 1.6% performance gains fleetwide and only several failures (out of hundreds of millions of code) which can be easily fixed. Also, the binary sizes reduced by several percent.

Overall I haven’t tried these macros yet in ClickHouse for a couple of failures but switching to libcxx itself gave 2% performance boost:

In total, I believe you can find from 5 to 15 performance boost easily by adopting the correct options, fresh libraries and adequate code options.

I don’t recommend -ffast-math if you are not sure in your precision calculations but in some scientific heavy computing programs I’ve seen that to find another percent. Also, I don’t recommend -Ofast for the same reasons: the optimizations can be unsafe there. I read not so long ago an article Spying on the floating point behavior of existing, unmodified scientific applications which only proves that unsafe operations with floats do happen in any sophisticated software.

## Conclusion

Of course, nothing will help you if you use quadratic algorithms instead of linear and the best performance optimizations come from algorithmic approaches rather than the compiler ones. However, I found myself telling about these options to several teams and people so that I decided to put the knowledge right there for a broader audience to know. What to use, how to trade compile time vs run time is up to you but always think about the time of the developers, ease of migrations, usage, etc.

For GCC I haven’t found such flexibility but possibly some of them are hidden in the documentation under the --param option. I personally believe that happens because people don’t usually tell or try them relying on the compiler a lot like a black box and a source of truth. However, as in every complex system, the truth is more complicated (and boring) and compiler engineers are also not the ones who are writing perfect code or are able to tell about their research to massive public.

I personally feel that LLVM infrastructure is going to stick with me for quite some time because it is just easier, more modern, and provides the ability to move faster than GCC where I am always lost to find the code I need.

## News Aggregator from Scratch in 2 Weeks

Data Clustering Contest was a two-week data science and engineering competition held by Telegram in May. The Telegram team wanted participants to build a simple yet effective news aggregator that consolidates thousands of articles from various publishers and websites into a single page, which shows the latest news in real time, exactly like Google NewsBing News or Yandex News (with the Russian language inclining) do.

This was the second part of the competition whenever I totally missed the first part — and maybe for the better. I highly recommend reading the report from my Yandex ex-colleagues on how they placed 3rd place under the pseudonym Mindful Squirrel. Fortunately, now they made the worthy first place. I won the third place in this competition with the prize of 7000 Euros under the nickname Mindful Kitten (to be honest, it was quite nice that they named us, the contestants, this way 🙂 ).

## My Initial Goal

I decided to try my skills given the fact I’ve been working on the search engine for quite some time. I want to be absolutely clear. I am not a Machine Learning engineer

when you are working in the search engines, you need to know at least how text/language ML works. Overall, you will not find many cool ideas right here about the ML itself but I tried to make the infrastructure around it fast. One of the main criterias for the contest was speed and I will show how I achieved the pretty fast results except server indexing (server response was one of the fastest among contestants) and why the latter was much slower.

## Requirements

There were two parts of the contest — static and dynamic one. In the static one (almost the same as in the first stage of the contest) we needed to detect the language of the articles, separate news from non news, detect the categories and combine the similar news into threads and rank the articles in the threads.

In the dynamic one we needed to create a server which accepts the articles, detects their language, if it is news, detect the category, cluster news in threads and respond to the queries of the most relevant news for the given time, language and category.

Server was tested on the x86_64 Debian machine with 8 CPUs, 16GB of RAM, HDD. The max requested time can be 30 days. The current server time is supposed to be the maximum article time (not news) in the index: it even means that the time can go backwards if we delete the most recent article from the index.

The final solution should be a binary without the connection to a network, has very few dependencies, no root (for example, this makes it harder to use some “fancy” databases like redis, etc).

## Solution for a Static Version

Overall I was extensively relying on a fastText library. It is the library for learning word embeddings and training text classification models created by Facebook’s AI Research in 2016. The whole submission code was written in C++17 and some training was done through Python. But let’s start with something simpler. For example, thread management.

As in the static version there are so many tasks that can be parallelized, I took a pretty decent implementation of the thread pool and patched it with my lock free queue (thanks to my university where we had to write it) which resulted in a slightly better performance rather than locking the mutex. (5.6k vs 5.9k tasks per second).

## Parsing

All data consisted of HTML files from Telegram Instant View which parses most popular websites (around 5000+ of them) and provides a very lightweight HTML that is easy to read, load and parse.

There are several HTML parses in the open source community. Most popular one is libxml2. However, I found this was not an efficient one and it was around 3 times slower than Google’s Gumbo Parser. It is a parser which was tested against Google Search with over 2.5 billion pages which makes me confident that if there are some problems with the text extraction, they are definitely the Telegram’s problems. And in the end I found a couple of interesting surprises.

Like in many other HTMLs I filtered out <script>, <style>, <figure>, <iframe>, <address> tags to get the raw text which surprisingly turned out to be not enough. In the picture you can see what happened after that.

For some reason <iframe> tag was self-closing that resulted in such a mess. By HTML5 standard it is not a self-closing element and all browsers add a closing element to iframe. As I did not want to rewrite in that way because it would require one additional pass through the text, I decided (of course) to patch Gumbo library to meet the standards of the Telegram contest. Luckily, there are many self-closing tags and in the end it was a one line change

After that the text was correctly handled out.

Also, after some perf investigation, I found that around 15-20% of all the parsing was spent in function calls which were not inlined by the compiler. My final perf patch you can see here and I managed to come up with a 20% faster parsing algorithm: on average, each article was parsed in 0.4ms per thread which resulted in ~0.055ms per 8 threads or, alternatively, around 16-18k articles per second. I also believe I was lucky and HDD stored the articles in the directory consecutively where I managed to remove the latency of the HDD seeks (which are typically 8-12ms) or page cache made the job which is a more likely hypothesis.

During the parsing I also extracted the og:url, og:title, og:description and published_time attributes for the future use. It is not an interesting part though worth noting.

## Language Detection

I was very interested how Chromium detects the languages of the pages offline and with respect to the fact it is written in C++, I was interested if I could do something similar. Finally I found CLD3 (Compact Language Detector 3) library which is a neural network model. It extracts the ngrams of the words or text and predicts with the 3 layers of the network which language is that.

I failed to find how accurate it was and went to read some papers about language detection and found out a nice paper Language Identification on Massive Datasets of Short Message using an Attention Mechanism CNN. The benchmarks and the accuracy of the models for short messages were provided and it helped me a lot with the decision

Unfortunately, the paper approach is hidden right now and I decided to look at CLD2 as it is 7.5 times faster than CLD3 and has even better accuracy. Though, I think the migration from CLD2 to CLD3 happened because of the accuracy for the web pages, though my testing of CLD2 against CLD3 showed 0.03% difference and almost all articles were highly debatable whether they are English or not because of the mess written inside.

CLD2 is a pretty abandoned repository with the latest commit 5 years ago and mostly is presented for historical reasons. CLD2 is a Naïve Bayesian classifier, using a combination of several different token algorithms. For most of the languages, sequences of four letters (quadgrams) are scored. For some big languages with lots of symbols the unigrams are taken. It is noted that “the design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text, lists of proper names, part numbers, etc”.

I decided to proceed with CLD2 in the end and a couple of surprises were on my way. I misread the initial statement and thought that detecting all languages was required. As it was 5 years old, some languages changed their ISO-639-1 code, for example iw -> he, jw -> jv, in -> id and mo -> ro. In order to stop language detection from failing, I was checking with the static table that language is inside the ISO-639-1 known languages. Unfortunately, there were articles with the languages that are not presented in ISO-639-1 and were only in ISO-639-2, for example, cebuano (which has a code ceb). In the end I gave up and was not printing them.

Also, I created a task in Yandex.Toloka to measure the precision and recall of my classifier. Toloka is a Russian crowdsourcing platform, just like Amazon Mechanical Turk. It turned out that they were more than 99.2% for English and Russian and I was pretty confident the results were ok. To fix several other issues I decided to do a cross validation only with the title and description of the article. Finally, the optimal algorithm for the problem was: if CLD2 is not sure (it can provide you the level of sureness) about the English language text, try out the title and description, otherwise use the first language. The metrics showed around 99.35% recall and precision for English and other articles were quite indecisive with some mess inside. That was a huge win. Performance of the library is extraordinary, it is around 0.2-0.4ms per thread per article.

In the end my language detection turned out to be the fastest across 3 days with less than 0.1ms per article. Yet, I don’t claim it is the best, other contestants might have better quality but no one knows how telegram evaluates the submissions.

## News Filtering and Topic Classification

I decided to combine these two tasks into one and just added an additional label to the article as NotNews. The learning is almost exactly the same to what Mindful Squirrel did in the first stage yet I optimized the code and reestablished the models, i.e. was using the new learned ones.

I again utilized Yandex.Toloka and (a little bit) Amazon Turk to label a fraction of input data with the labels of categorization. I extended the dataset with publicly available data from the first contest, from the Ria.ru and Lenta.ru news dataset for the Russian language and the ones from BBC and HuffPost for English (thanks to Kaggle contests).

In the end I pushed around 50k articles to train for, cross validated with other submissions of the first round available on Github and got a pretty good model. Russian quality improved a lot but the English one did not have a super advance (especially in detecting other and not news labels). Overall my schema was looking something like that:

With the labeled data I trained the fastText classifier. Fortunately, engineers from Facebook made this library pretty fast with all SIMD accelerations so it looked pretty good for me. Unfortunately, I almost got caught to have a penalty in the contest because it is built with -march=native flag which includes all AVX-512 extensions on my machine. Likely they are not presented in the data-centers which Telegram is using so I decided to be more conservative and only used the extensions up to SSE4.2 and lost possibly at most 2-3% of the performance which is not bad.

Unfortunately, I did not make the first place in terms of speed here but the submissions in top 3 did not have a (reasonably working) server so it gave me a pretty high chance to be one of the best for those who had servers.

In categories detection it was better, second place speed in EN categories (with a pretty decent pre final result):

And a tie (but slightly better) in RU categories.

Also, worth noting, before applying the model, please, tokenize the text and remove all punctuation marks, etc. I lost several hours understanding what was wrong in some places.

In this part the contestants needed to combine similar news into the threads and rank the news inside them accordingly.

This is the part where I am going to talk about word embeddings. It is no secret that most of the text ML produces the embeddings and it is learned the way that if the distance between two embeddings is small, then the articles/news/pages are similar to each other. There are different types of “contextual” embeddings, one of the most popular ones are Word2Vec, DSSM, Sentence-BERT embeddings. I was not satisfied with the Word2Vec approach at all possibly because I failed to tune it properly. Fortunately, I felt pretty safe because I knew that average, min, max over word vectors of the fastText works pretty nice, in the end I retrained the model with the triplet loss function using a Siamese neural network from the Mindful Squirrel’s solution. Here I don’t understand the full magic behind the scenes, yet I only wanted to have at least some quality with the follow-up autotuning.

This model was pretty simple in the end and was using dot products which are highly optimized in the Eigen library.

In order to combine news in threads, I used SLINK algorithm: assume all nodes are 1-norm vectors and the edges between nodes are the L2 distances between them, then if we have a threshold $\alpha$, in $O(n^2)$ we can find the graph components with the property that graph components are disconnected iff the distance between their nodes in the bipartite graph is more than $\alpha$.

Unfortunately, it may lead to overpopulation in threading, for example, if the first node connected to the second one, and the second to the third one, it may happen that the first and the third are quite different. I firstly thought this not a big problem but in the end all COVID, accident news, earnings call reports had been combined together:

In the picture above news articles are connected by the topic COVID-19 but in the end it becomes a pretty huge cluster which is not easy to deal with. Though it is not accurate in some sense, I think this helped me to avoid many clusters with COVID news that are pretty irrelevant to each other and have 1 big thread talking about COVID overall.

Also, the time complexity was pretty huge for dealing with the hundreds of thousand of news and I decided to apply the algorithm with a time sliding window of 15000 docs and 3000 docs in the overlap. I assumed that if the news is not popular, it is very unlikely they are repeated after a long time.

Within the thread, the news articles are ranked by their PageRank weight and the time between the most recent article in the thread. Actually, there are pretty few really good news agencies so it was even a matter of human intervention to boost some very authoritative newspapers. Also, one of the weights was the mean similarity with other news which was pretty easy to recompute in real time but we will talk a bit later about it.

In the end, threads clustering turned out to be one of the fastest among contestants

## Dynamic Part

As you may already forget, we also needed to write a server that indexes the news and responses with the best threads within the given range. Just a reminder what we should do

It means we need to index all upcoming news in real time. The rules of the Telegram contest were vague in terms of what consistency we should provide, for example it was absolutely not clear if we can produce news after some time, meaning if we can update the top news once in a while like all aggregators do. Unfortunately, some of the solutions were tested against this “common sense”, the top news requests were coming right after the requests when all news were indexed.

## Indexing

First thing to note, that updating the article is basically a deletion+insertion request, so this was a pretty easy reduction. Unfortunately, that also increased the complexity — I could not easily parallelize the requests by category or even by language because with the article update, it may change even the language — and I needed to provide strong consistency which resulted in me using a global lock. I could not split the request into two because in the middle there can be another request modifying the article and the result can be skewed or not expected. The modifying and deletion requests were tested by Telegram but from the logs I obtained not a single article changed its language, possibly it was just done if the article was updated.

Current time of the server is the maximum time of the article (not news) so I needed to store metadata for all articles. It also means that the time can go backwards with the articles deletion so it also increased the complexity that I cannot easily update the current time in one atomic operation. Was this tested by Telegram? I doubt.

Indexing news consisted of one more parameter — TTL (Time To Live), it was stated that if TTL + article published time passed the current time, the article should be deleted. With the consequence that the time can go backwards, I needed to store all articles either way.

## How to Store?

First I thought of using something external like ExtremeDB and/or SQLite3 but their locking and parallelisation mechanisms turned out to be really slow for my needs. Also, their C++ APIs are not flexible to use and require some time for careful investigation.

To store documents I used protobufs as I love this format a lot for its simplicity and I worked in the companies where protobuf is heavily used. I honestly believe this format is hundreds times better than JSON and is one of the best statically typed formats to use.

Also, one more requirement to the contestants was that if the server restarts, it should load the index and respond like usual. That’s why it was important for me to store embeddings rather than text: to combine news into threads I only needed them. Also Telegram did not require store text of the article in any form so I saved lots of CPU and disk resources. Upon the restart, I used the static schema solution already knowing the category and embeds of the news.

Actually I thought I could proceed with something really simple. I did not store all in one file, all docs were sharded. When the doc comes and I need to store it, I choose 2 random shards to store and pick the one who has the lowest number of docs, store it. If I needed to extract the document, I just needed to do two lookups. Why have I chosen two random shards? That’s because in one random shard strategy the maximum number of the docs in one shard (given that the number of articles is $m$ and the number of shards is $n$) will be around $m/n + O(\frac{\sqrt{m \log n}}{n})$ and with the power of 2 choices it will be $m/n + O(\log \log n)$ with high probability, so we significantly improving the load balancing between shards. It significantly improves the bound and still allows it to be almost lock free in the shard’s choice. Given the fact there was around 10 QPS in the tests, it is almost optimal.

As a hash key I used the names of the files. As the number of shards was 10000, I found that reasonable for a monthly number of articles (up to 10mln).

I highly prioritized the read responses as in the news it is ok to wait for 5-10 minutes to propagate the indexing requests but it is very user facing to respond as fast as possible. As you can see, I was top two in response times. Mad Crow, unfortunately, had hardly any working server overall and I believe I got additional points because of that decision.

I decided to store globally for all time all the threads that I have. Partially because I was paranoid of the fact that the time can go backwards, because of strong consistency, etc.

To find a thread for the article I decided to use k-NN approaches (k nearest neighbors) and especially HNSW (Hierarchical Navigable Small World) similarity embedding search algorithm. According to ANN benchmarks it is one of the fastest algorithms. Yet, right after the contest Google published its own version of nearest neighbor search called scANN which in my opinion has the advantage that it utilizes the SIMD instructions better with its design and overall it is really faster, maybe for 20-30%. HNSW was more than enough for me, also I quite nicely understand how it works. Actually…

## How does HNSW work?

A naive solution to the k-NN problem is to compute the distances from a given query point to every data point within an index and then select the data points with the smallest k distances. It does not scale to the sizes of hundreds of thousands of vectors. An approximate k-NN algorithm may greatly reduce search latency at the cost of precision.

The HNSW algorithm focuses on reducing the number of comparisons by building a graph data structure on the constituent points of the data set.

With a graph data structure on the data set, approximate nearest neighbors can be found using graph traversal methods. Given a query point, we find its nearest neighbors by starting at a random point in the graph and computing its distance to the query point. From this entry point, we explore the graph, computing the distance to the query of each newly visited data point until the traversal can find no closer data points. To compute fewer distances while still retaining high accuracy, the HNSW algorithm builds on top of previous work on Navigable Small World (NSW) graphs. The NSW algorithm builds a graph with two key properties. The “small world” property is such that the number of edges in the shortest path between any pair of points grows poly-logarithmically with the number of points in the graph. The “navigable” property asserts that the greedy algorithm is likely to stay on this shortest path. Combining these two properties results in a graph structure so the greedy algorithm is likely to find the nearest data point to a query in logarithmic time.

HNSW extends the NSW algorithm by building multiple layers of interconnected NSW-like graphs. The top layer is a coarse graph built on a small subset of the data points in the index. Each lower layer incorporates more points in its graph until reaching the bottom layer, which consists of an NSW-like graph on every data point. To find the approximate nearest neighbors to a query, the search process finds the nearest neighbors in the graph at the top layer and uses these points as the entry points to the subsequent layer. This strategy results in a nearest neighbors search algorithm which runs logarithmically with respect to the number of data points in the index.

For full graphs it is working really nice and easily provides the insertion queries. Yet, deletion is a bit more complex because you need to somehow deal with the NSW layers and it might take some time to reconstruct them.

I utilized the online-hnsw library in order to support insertions, deletions and k-NN queries. But in a bit of an interesting way.

As you remember, SLINK clustering is transitive, it just puts two vertices in one cluster if there is a path between them where all edges are of weight $\leq \alpha$. In order to preserve this property I decided to do the following:

I find k nearest neighbors and when I stop finding by the same threshold as in the static version, I combine all threads in one. It guarantees that eventually we will have the cluster as in the SLINK algorithm (if k-NN search is accurate, of course). Approximately it works almost perfectly and within a small time all relevant articles are combined together.

When combining, I need to do several things, for example, to update the similarity matrix which is basically done incrementally:

This recalculation method is working very nicely when we merge big clusters with very small ones. Unfortunately, I made a very silly bug there, and in the submission and full matrix was recalculated which resulted in this after 9 days of server uptime.

That basically means that huge clusters (like COVID) were constantly updated and matrix multiplication was taking too long to update the similarity values. I also put the restriction on the cluster size but because of another bug it worked only when one document is merged with the big cluster.

Threads ranking is basically the sum of the pagerank weights of the articles inside plus small constant for all other news agencies. Small threads up to 5 docs are linearly penalized which works very good for small ranges of time (for example, show the news for the past 1 hour). News in clusters are sorted by sigmoid of the most recent time plus mean of the similarity matrix.

We needed to show the most relevant news depending on the given time range since the latest article (in the tests there were 1 hour, 3 hours and 1 day, in the statement it could be up to 30 days). Also, people can request categorized news like Science news for the last 1 hour in Russian language.

As I was updating the threads in real time, I just was looking up through top $T$ threads and was filtering out all the news that are not in line with the request (not the requested category, language, expired or not in the time-frame). Old news will be filtered out, new news remain and are likely relevant, ordering in the cluster preserves the same, title of thread is the title of the most relevant news (to avoid any copyright or misleading issues in the future 🙂 ), thread relevance is recalculated as it is linear in time. $T$ was set to 50000 as I thought that there could not be more than 1000-2000 news a day (and likely hundreds) and we should response for the maximum of 1 month.

Unfortunately, when the time range is very small, it could happen that there is no very new news in the top $T$ threads. In that case I was looking for the past $T$ news and their associated threads so that I can fulfill the request at least with something. Worked perfectly for the seconds/minutes ranges.

So, the news responses were very fast because the threads are updated in real time and the threads are precomputed for the response. Also, I was not including more than 100 threads in the response because starting somewhere from the 50-60th threads, all threads contain only one article so that it does not make sense to show. Also, it helped to improve the performance a lot without any loss in the quality (cmon, no one looks at more than 10-20 news at once).

## How I tested

The solutions were tested on pure Debian, I rented for 3-4 days the Digital Ocean Debian machine in order to better understand the performance of my application.

I wrote a bunch of tests for the static part, unfortunately, the dynamic part was written once and tested with my eyes. Build system was CMake and it was fairly easy to statically include all the projects I needed — gumbo, online-hnsw, abseil, http server, thread pool, protobuf, boost. Of course, I used all kinds of sanitizers to catch any issues which is basically a mandate in the modern C++ world.

Unfortunately, I was not able to benefit the server from multithreading because I could not do it for languages as the article may change its language and the judges may test the skew issues. So on each mutating request I acquired WriteLock, on each response request ReadLock with the higher priority not to block the clients.

## Conclusion

Whole application was written in C++17 when learning algorithms were in Python, I spent around 5-6 days on the static part and then 8-9 days on the server part 4 of which I was debugging all the cases that I am correctly updating the pointers to threads, thread combining, etc. Of course, I made a couple of bugs that I described above.

## What I liked

Finally some engineering contest and not sports-programming or pure ML competition on Kaggle. I liked that we were given some data to test on. I enjoyed writing C++ code and CMake was very easy to set up (I admit that maybe it is already easy for me as I spent enormous time in the past doing that kind of job). The problem was good and novel enough to work on from scratch.

Prize pool also was quite nice, I was immediately communicated to receive my prize.

## What I did not like

Absolutely no transparency on how we can communicate with the judges about the questions that arise. For example, nothing was said about which type of disk would be used or if we could delay the index updates (turned out to be that no or not always).

No transparency in how the evaluation or testing process happens. Likely they hired many volunteers to outsource the categorization and thread quality. Yet, we did not know how with which instructions and why they are silent about this. Incremental process of evaluation would be very fun to watch, for example.

It was unclear if we can work together or not. I did not work with anybody but I would definitely want to have an ML partner to flesh out and look for bugs. Some other teams were of 4 and more people which is a bit unfair(?).

Also it would be nice to have at least some statistics from Telegram on the data, at least the view count that they have in Instant View. With the raw data it was much harder to get what is relevant and farm the data for the training stages.

Unfortunately, I cannot open source the solution for now because I need to get an approval from my employer but as soon as I get it, I will probably open source that crunchy code 🙂 .

The second place in the contest can be placed without the server side which is quite strange, even in the final message the judges said: “Clustering tasks similar to the previous round (sorting by language, news or not, categories and threads) carried a lower weight. Overall speed of the algorithms influenced the final score.”

## True Conclusion

I managed to combine lots of ideas into one place and compete with many other contestants from ML sphere quite a little. I am satisfied with my results and looking forward to participating in other contests from Telegram. Of course, I am not satisfied with the code quality and what I haven’t managed to try or flesh out.

Also, this type of work gave me a good personal understanding that I show myself best when I have competition and/or tighten up by the deadlines. In the end, I show at least something acceptable.

My Telegram channel (in Russian)

## Optimizing 128-bit Division

When it comes to hashing, sometimes 64 bit is not enough, for example, because of birthday paradox — the hacker can iterate through random $2^{32}$ entities and it can be proven that with some constant probability they will find a collision, i.e. two different objects will have the same hash. $2^{32}$ is around 4 billion objects and with the current power capacity in each computer it is certainly achievable. That’s why we need sometimes to advance the bitness of hashing to at least 128 bits. Unfortunately, it comes with a cost because platforms and CPUs do not support 128 bit operations natively.

Division historically is the most complex operation on CPUs and all guidelines suggest avoiding the division at all costs.

At my job I faced an interesting problem of optimizing 128 bit division from abseil library in order to split some data across buckets with the help of 128 bit hashing (the number of buckets is not fixed for some uninteresting historical reasons). I found out that the division takes a really long time. The benchmarks from abseil on Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz show some horrible results

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor     13.8     13.8  // 128 bit by 128 bit
BM_DivideClass128SmallDivisor        168      168  // 128 bit by 64 bit

150 nanoseconds for dividing the random 128 bit number by a random 64 bit number? Sounds crazy. For example, div instruction on x86-64 Skylake takes 76 cycles (also, for AMD processors it is much less), the division takes around 20-22ns.

In reality everything is slightly better because of pipeline execution and division has its own ALU, so if you divide something and do something else in the next instructions, you will get lower average latency. Still, 128 bit division cannot be 8x slower than 64 bit division. All latencies you can find in Agner Fog instruction table for most of the modern x86 CPUs. The truth is more complex and division latency can even depend on the values given.

Even compilers when dividing by some constants, try to use the reciprocal (or, the same as inverse in a ring) value and multiply the reciprocal and the value with some shifts afterwards

Overall, given the fact that only some sin, cos instructions cost more than division, division is one of the most complex instructions in CPUs and optimizations in that place matter a lot. My exact case was more or less general, maybe I was dividing 128 bit by 64 bit a bit more frequent. We are going to optimize the general case in LLVM.

We need to understand how 128 bit division is working through the compiler stack.

It calls __udivti3 function. Let’s first understand how to read these functions. In runtime libraries the modes of the functions are:

QI: An integer that is as wide as the smallest addressable unit, usually 8 bits.
HI: An integer, twice as wide as a QI mode integer, usually 16 bits.
SI: An integer, four times as wide as a QI mode integer, usually 32 bits.
DI: An integer, eight times as wide as a QI mode integer, usually 64 bits.
SF: A floating point value, as wide as a SI mode integer, usually 32 bits.
DF: A floating point value, as wide as a DI mode integer, usually 64 bits.
TI: An integer, 16 times as wide as a QI mode integer, usually 128 bits.

So, udivti3 is an unsigned division of TI (128 bits) integers, last ‘3′ means that it has 3 arguments including the return value. Also, there is a function __udivmodti4 which computes the divisor and the remainder (division and modulo operation) and it has 4 arguments including the returning value. These functions are a part of runtime libraries which compilers provide by default. For example, in GCC it is libgcc, in LLVM it is compiler-rt, they are linked almost in every program if you have the corresponding toolchain. In LLVM, __udivti3 is a simple alias to __udivmodti4.

__udivmodti4 function was written with the help of Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide. After looking at it here, it looks like this was written long time ago and things have changed since then

First of all, let’s come up with something easy, like shift-subtract algorithm that we have been learning since childhood. First, if divisor > dividend, then the quotient is zero and remainder is the dividend, not an interesting case.

The algorithm is easy, we align the numbers by their most significant bits, if dividend is more than divisor, subtract and add 1 to the output, then shift by 1 and repeat. Some sort of animation can be seen like that:

For 128 bit division it will take at most 128 iterations in the for loop. Actually, the implementation in LLVM for loop is a fallback and we saw it takes 150+ns to complete it because it requires to shift many registers because 128 bit numbers are represented as two registers.

Now, let’s dive into the architecture features. I noticed that while the compiler generates the divq instructions, it frees rdx register

In the manual they say the following

divq instruction provides 128 bit division from [%rdx]:[%rax] by S. The quotient is stored in %rax and the remainder in %rdx. After some experimenting with inline asm in C/C++, I figured out that if the result does not fit in 64 bits, SIGFPE is raised. See:

Compilers don’t use this instruction in 128 bit division because they cannot know for sure if the result is going to fit in 64 bits. Yet, if the high 64 bits of the 128 bit number is smaller than the divisor, the result fits into 64 bits and we can use this instruction. As compilers don’t generate divq instruction for their own reasons, we would use inline asm for x86-64.

What to do if the high is not less than the divisor? The right answer is to use 2 divisions because

So, first we can divide hi by divisor and then {hi_r, lo} by divisor guaranteeing that hi_r is smaller than divisor and thus the result is smaller than $2^{64}$. We will get something like

After that the benchmarks improved significantly

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor 11.9      11.9
BM_DivideClass128SmallDivisor   26.6      26.6

Only 26.6ns for small divisors, that’s a clear 6x win.

Then there are multiple choices to do next but we know that both dividend and divisor have at least one bit in their high registers and the shift-subtract algorithm will have at most 64 iterations. Also the quotient is guaranteed to fit in 64 bits, thus we can use only the low register of the resulting quotient and save more shifts in the shift-subtract algorithm. That’s why the uniform divisor slightly improved.

One more optimization to do in shift-subtract algorithm is to remove the branch inside the for loop (read carefully, it should be understandable).

In the end, it gives 0.4ns more for uniform 128 bit divisor.

And finally I believe that’s one of the best algorithm to divide 128 bit by 128 bit numbers. From statistics, the case when the divisor is 64 bit is worth optimizing and we showed that additional checks on the high register of divisor has its own advantages and expansion of the invariants. Now let’s see what other libraries perform in that case.

## LibDivide

Libdivide is a small library targeting fast division, for example, if you divide by some fixed number a lot of times, there are techniques that can precalculate reciprocal and then multiply by it. Libdivide provides a very good interface for such optimizations. Even though, it has some optimizations regarding 128 bit division. For example, function libdivide_128_div_128_to_64 computes the division 128 bit number by 128 bit number if the result fits in 64 bits. In the case where both numbers are more or equal to $2^{64}$ it does the following algorithm that they took from Hackers Delight book:

$\begin{cases} n = MSB(\mathrm{divisor}) \geq 1 \\ \mathrm{divisor_1} = \lfloor \mathrm{divisor}/2^{64 - n} \rfloor \\ \mathrm{dividend_1} = \lfloor \mathrm{dividend}/2 \rfloor \end{cases}$

With the instruction that produces the 64 bit result when the divisor is 128 bit result we can compute

$\mathrm{quotient_1} = \lfloor \mathrm{dividend_1}/\mathrm{divisor_1} \rfloor$

Then we compute

$\mathrm{quotient_0} = \lfloor \mathrm{quotient_1}/2^{63 - n} \rfloor$.

It cannot overflow because $\mathrm{quotient_1} < 2^{64}$ because the maximum value of $\mathrm{dividend_1}$ is $2^{127} - 1$ and minimum value of $\mathrm{divisor_1}$ is $2^{63}$. Now let’s show that

$\lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor \leq \mathrm{quotient_0} \leq \lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor + 1$

$\mathrm{quotient_0} = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\mathrm{divisor_1}} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\left\lfloor\frac{\mathrm{divisor}}{2^{64 - n}}\right\rfloor} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n})} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \frac{\mathrm{dividend}(\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))}{\mathrm{divisor}(\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \delta \right\rfloor$.

Now we want to show that $\delta < 1$. $\delta$ is the largest when the remainder in the numerator is as large as possible, it can be up to $2^{64 - n} - 1$. Because of the definition of $n$, $\mathrm{divisor} \geq 2^{127 - n}$. The smallest value of $\mathrm{divisor}$ in the denominator is $2^{127 - n} + 2^{64 - n} - 1$. That’s why

$\delta \leq \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n} + 2^{64 - n} - 1)2^{127 - n }} < \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n })^2}$. As n iterates from 0 to 63, we can conclude that $\delta < \frac{\mathrm{dividend}}{2^{128}}$. So we got either the correct value, either the correct plus one. Everything else in the algorithms is just a correction of which result to choose.

Unfortunately, these corrections increase the latency of the benchmark pretty significant

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3    26.3
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2    26.2
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8    25.8
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3    26.3

So I decided to drop this idea after I’ve tried this.

## GMP

GMP library is a standard GNU library for long arithmetic. They also have something for 128 bit by 64 bit division and in my benchmark the following code worked

It divides the two limbs by a uint64_t and provides the result. Unfortunately, the latency is much higher than expected, also does not work

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<GmpDivision>          11.5    11.5
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7    10.7
BM_DivideClass128SmallDivisor<GmpDivision>            47.5    47.5
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8    47.8 

## Conclusion

In the end I’ve tried several method of 128 bit division and came up with something that is the fastest among popular alternatives. Here is the full code for benchmarks, though it is quite hard to install, maybe later I will provide an easy version of installation. The final benchmarks are

Benchmark                                           Time(ns)  CPU(ns)
---------------------------------------------------------------------
BM_DivideClass128UniformDivisor<absl::uint128>        13.7      13.7
BM_RemainderClass128UniformDivisor<absl::uint128>     14.0      14.0
BM_DivideClass128SmallDivisor<absl::uint128>          169       169
BM_RemainderClass128SmallDivisor<absl::uint128>       153       153
BM_DivideClass128UniformDivisor<LLVMDivision>         12.6      12.6
BM_RemainderClass128UniformDivisor<LLVMDivision>      12.3      12.3
BM_DivideClass128SmallDivisor<LLVMDivision>           145        145
BM_RemainderClass128SmallDivisor<LLVMDivision>        140        140
**BM_DivideClass128UniformDivisor<MyDivision1>          11.6      11.6**
**BM_RemainderClass128UniformDivisor<MyDivision1>       10.7      10.7**
**BM_DivideClass128SmallDivisor<MyDivision1>            25.5      25.5**
**BM_RemainderClass128SmallDivisor<MyDivision1>         26.2      26.2**
BM_DivideClass128UniformDivisor<MyDivision2>          12.7      12.7
BM_RemainderClass128UniformDivisor<MyDivision2>       12.8      12.8
BM_DivideClass128SmallDivisor<MyDivision2>            36.9      36.9
BM_RemainderClass128SmallDivisor<MyDivision2>         37.0      37.1
BM_DivideClass128UniformDivisor<GmpDivision>          11.5      11.5
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7      10.7
BM_DivideClass128SmallDivisor<GmpDivision>            47.5      47.5
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8      47.8
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3      26.3
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2      26.2
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8      25.8
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3      26.3

MyDivision1 is going to be upstreamed in LLVM, MyDivision2 will be the default version for all non x86-64 platforms which also has a solid latency, much better than the previous one.

## Future Work

However, benchmarks are biased in the uniform divisor case because the distance between most significant bits in the dividend and divisor falls exponentially and starting from 10-15, the benchmark becomes worse rather than libdivide approach.

I also prepared some recent research paper patch in https://reviews.llvm.org/D83547 where the reciprocal is computed beforehand and then only some multiplication happens. Yet, with the cold cache of the 512 byte lookup table it is worse than the already submitted approach. I also tried just division by $d_9$ in the screenshot below and it showed some inconsistent results which I don’t understand for now.

## Why is std::pair broken?

Today we are going to talk about C++ basic std::pair class in a way how and why it is broken.

## Story

std::pair first appeared in C++98 as a pretty basic class with a very simple semantics — you have two types T1 and T2, you can write std::pair<T1, T2> and access .first and .second members with all intuitive copy, assignment, comparison operators, etc. This is used in all map containers as value type to store and can be expanded into structured bindings, for example, with the following code:

Pretty convenient, almost all C++ programmers know about this class and use it appropriately but several style guides including Google Style Guide suggest using user defined structs where appropriate. It has several reasons but the first one is readability — I personally want to kill people who write more than one level of std::pair, for example, std::pair<int, std::pair<int, std::pair<int, int>>> and then access it with el.second.second.second. And as the structure of the objects in a maintainable code tends to grow, it is better to use structs rather than pairs and tuples. Other reasons are more interesting.

## Performance

### Default construction

This is a minor and debatable thing but std::pair with trivial and non default zero initialized scalars is zero initialized:

In contrast, std::array does not initialize their elements with the default construction (if the underlying type is trivial) and leaves this for the user. It is debatable because some people don’t like the uninitialized memory but in the end this is an inconsistency between the containers and is a different behavior from default struct what people tend to think about std::pair (yet, I still think this should look like this):

### Copying

I found this issue when I wanted to serialize and deserialize std::pair. One of the default implementations in struct/type serialization and deserialization is the following:

If we go to memcpy documentation and see the following:

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined.

https://en.cppreference.com/w/cpp/string/byte/memcpy

TriviallyCopyable are objects that satisfy the following conditions:

• Every copy constructor is trivial or deleted. pair(const pair&). So, it should be either = default, either = delete, either omitted/created by the rule of X and all underlying objects are trivially copyable.
• Every move constructor is trivial or deleted. Same with pair(pair&&).
• Every copy assignment operator is trivial or deleted. Same with pair& operator=(const pair&).
• Every move assignment operator is trivial or deleted. Same with pair& operator=(pair&&).
• At least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted.
• Trivial non-deleted destructor. This means that virtual classes cannot be trivially copyable because we need to copy the implementation vptr.

If we write our pair above and check if it is trivially copyable (with some trivial underlying types), it definitely is because we use the rule of zero and thus everything is generated automatically. Also, a small note — in the examples above you need to make sure that the types are trivially copyable, a perfect variant looks like (this is also heavily used by the compilers to generate the unaligned loads and work with unaligned memory):

Let’s see what std::pair guarantees. Good news is standard guarantees the copy and move constructors to be defaulted (see 7 and 8).

Also, from C++17 we have the requirement that if the types of the pair are trivially destructible, pair is also trivially destructible. This allows having many optimizations including putting the type onto registers. For example, std::variant and std::optional had a different proposal to make this possible.

With everything else, it is more complicated. Let’s see what is going on with the copy assignment, in libc++ it is implemented like this:

__nat is used for disabling the instantiation of the copy assignment operator if it is invoked — a cool C++98 hack to delete the operator.

This helps the compiler to use std::pair with the references. By default, references cannot be implicitly copied or, if said in a more clever way, if you have them in the class, the copy assignment operator is implicitly deleted unless user defined. For example:

In C++ copying the reference changes the underlying object and thus the referenced is binded to the class. Here is a good explanation why it was designed this way. So, the assignment operator is deleted because copying reference has slightly different semantics but in C++98 people (maybe) thought it was convenient. So, we cannot easily write = default to provide a default assignment operator (despite the fact this is an ABI (application binary interface) break).

All these facts lead to suboptimal performance, for example, in copying the vector of pairs. With the standard pair we have a simple loop, with trivially copyable type, we have fast memmove, the difference might be up to the register size, memmove is highly optimized.

#### Some attempts to fix this

First of all, all fixes will require an ABI break but as C++ standard will be (I hope) more or less ok with ABI breaks (or not :smile:), this can be done in a way.

Conceptually, we need to make a conditional default copy assignment operator with some restrictions:

• We cannot write = default in the main class because it disallows the assignment of the references.
• We cannot add the (possibly defaulted) template arguments to the pair because it will break the forward declarations.
• We cannot use SFINAE on the return types of the defaulted copy assignment operators.
• std::enable_if_t<std::is_reference<T1>::value || std::is_reference<T2>::value> operator=(const pair&) = default; is disallowed by the standard.
• We can move the .first and .second fields to the base classes and have partial specialisations like:

For example, std::optional does it this way. The downside that we need to copy lots of code and such a simple class like optional turned out to be 1300+ lines of code. Too much for std::pair, isn’t it?

Good solution I found is using C++20 concepts. They allow (at least they don’t disallow) the usage of the conditional default assignment operators.

This fixes the problem, does not blow up the code size and at least is very intuitive to read. In the end I found several proposals trying to suggest it in a more generic way. See one, two and see the blog where they use the feature and reduce the number of lines from 1200 to 400 for std::optional. Yet, I haven’t found that this was accepted but this is a very good idea of improving and simplifying things in C++.

### The funny

I knew it already but cannot be silent about it. LLVM and GCC optimizers optimize std::pair<int, int> to use one 64 bit register. It does not do it with std::tuple<int, int>. So, if you have the choice between pair and tuple, always use pair or even user defined structs.

## Conclusion

All that was written there also applies to std::tuple, they have the identical behavior in many places. Yet, this is one more argument why C++ old containers and even basic things are broken and ABI stability is doing more harm than good. I will finish it with a tweet that reminds me of a CAP theorem:

I really think that if at some point strong but breaking change does not happen, we will have a C++ fork that will do it.

## How to crack HashCode competition with some engineering skills

Hi everyone, hashcode is over. As now I can tell anything, I want just to tell how I was solving this problem, my methodologies, overall the structure of the data sets. I am not going to provide any internals and whatsoever, all data can be checked and much of it is my personal opinion though I had more time to reflect on it rather than contestants that only had 4 hours to solve it. So, I managed to get one of the highest scores (top 2, to be precise with the result 27,189,110) while testing the problem. Let’s talk about it. I provide all my solution data sets and pieces of code (though I don’t provide compilable repository because I am lazy, ok?). Please, get that this is an engineering research rather than a true contest full of combinatoric ideas. I suck at contests and don’t participate that often.

Here is the problem, very well written and explained. If converting it to a math problem, it is exactly this:

There are $B$ books and $L$ libraries. Books somehow assigned to libraries, one book can be in many libraries. Each book has a score $S_i$, each library has install time $T_i$ and daily throughput $M_i$ (maximum number of books shipped per day). We can only choose one library, wait for its install time, then choose another and so on but all libraries push their books at the same time and when they are ready. We need to maximize the score of all books given $D$ days. Look at the example if you did not get the statement.

So, this is clearly an NP-hard problem. For example, the simplest reduction is a knapsack problem — assume that all daily throughputs are infinite and libraries do not overlap in their sets of books, then you need to maximize the sum of the libraries with a given budget $D$. This is a very simple reduction with 2 constraints relaxed. So if we can solve this problem in polynomial time, we can solve any NP problem. The problem itself is very and very hard. There are multiple ways to optimize the score, you can choose these freedoms to gain better score:

• In the optimal solution, books in all libraries must be sorted descending with respect to their scores. That’s easy to get and polynomial.
• The set of libraries matters because we are given the budget $D$. This is exponential of the number of the libraries
• The choice of the books in the chosen libraries matters. For the books that are inside multiple libraries, it is sometimes crucial which library (or maybe not a single one) to choose to ship this book.
• The order of the libraries significantly matters. If you decide to push some library first with the install time $T_i$, all other libraries have at most $D - T_i$ days to ship books.

Some good observations that always help for such kind of problems

• They are easily simulated — given the answer, you can calculate the exact score — that’s the first you should do in such research problems. It helps you to check thousands of different hypotheses.
• Freedoms are good to have some basic solutions. For example, I did not know how to optimize the books choices, I used the following technique: if there are multiple libraries for a book, choose one who has the biggest capacity for now. Overall, does not matter much which strategy, just try something reasonable and try different “reasonable” things to get fast and pretty sustainable results. This helped me to reduce the problem to just picking up a set and to start hacking.

But in order to start the actual hacking, some prerequisites.

## Linear programming

This is a very sophisticated and interesting science about linear inequalities. Basically, you have a linear functional to maximize and a numerous of linear constraints to satisfy. One of the canonical forms looks like:

$\text{Maximize}\ \mathbf{c}^\mathrm{T} \mathbf{x}\ \text{subject to} A \mathbf{x} \leq \mathbf{b}\ \text{and}\ \mathbf{x} \ge \mathbf{0}$ where $x$ represents the vector of variables (to be determined), $c$ and $b$ are vectors of (known) coefficients, $A$ is a (known) matrix of coefficients, and ${\ (\cdot )^{\mathrm {T} }}$ is the matrix transpose.

This is not the form you should stick to, for example the positivity property of the vector is not needed or can be additionally added (because any number can be represented as the difference between two positive ones). You can have any kind of inequalities ($<, >, =$). The only requirement — the relations must be linear.

Why is this important? The answer is that the all feasible solutions represent a convex polytope, matrix theory is also widely known and many optimization problems can be formulated in the form of some linear programs. The first and the second help to have many algorithms to solve as this area is widely known from many perspectives (maths and computer science), the last helps to solve many problems in a generalized way. You can assume that the “real world” solvers can solve linear programs with around $5 \cdot 10^5$ variables and conditions. Or at least they can find some solutions that are quite good. We will talk about solvers a bit later.

## Integer linear programming

The same as linear program but the solution vector $x$ must be an integer vector. Often there are constraints on the vector $x$ to be a $\{0, 1\}^n$ vector which sometimes makes the problem much better. Solvers are less happy about the integer programming — it is NP-complete partially because MAX-3SAT can be reduced (try to do it on your own, it is not difficult, or you can look up the idea here and finish it by yourself).

This area is less known and harder to optimize in the end. This happens because it is really hard to use methods from simple linear programs because the feasible solutions are extremely spare and proving that some solution is optimal is hard (and, of course, it is NP-complete which is one of the main reasons for a such complexity). Modern solvers can solve “real world” integer linear programs for around $10^4$ variables. For more only luck and the variety of the solvers can help. Though, you can do something good even with that big amount of variables about which we will talk a bit later.

## NP-completeness and hardness. P != NP

A bit of philosophical reasoning. When people start programming, they express their thoughts inside the code to do something with the computer — to show some text, window, to compute some big expressions. Then, while educating, the problems become harder but still solvable. In many minds of the true beginners there sometimes can be a thought that computer can solve any practical problem (or at least, most of them). And then comes this monster, called NP-complete problems. This is a story about what the programmers cannot do.

There are several things in NP-completeness (further, NPC) right now. Discovering that some new problems are in NPC, either finding some approximations in some NPC problems or proving that some approximations of NPC problems are also NPC.

The first thing, reductions are very useful. They help to understand that if you can solve problem A, then you can solve problem B which is proven to be in NPC and then A is in NPC (I missed couple of details, does not matter). These reductions give a good understanding of what the hard problems are. A good intuition when the software developer gets that the challenging task comes to a very hard point and if the boss requires completing this task for 100%, the answer should be: “this problem is NP-complete, if I manage to solve it, I would be a millionaire and it does not matter whoever engineer you are going to hire”. Still, the “real world” is not black and white and we can something do about NPC problems — like finding approximations or trying to find some acceptable solution.

We need to admit that we need to solve these hard problems and that is what I do and don’t like about math at the same time — the math gives up and starts arguing that we need to stop solving this but the real world sometimes provide patterns inside the data which can be better approximated rather than the pure math says.

## 3SAT

3SAT is a classical NP-complete problem. Its main goal to satisfy the formulas with $n$ boolean variables of type ($\vee$ means OR, $\wedge$ means AND) 3CNF (each parenthesized expression is called clause):

$(x_i \vee \neg x_j \vee x_k) \wedge \ldots \wedge (\neg x_l \vee \neg x_m \vee x_n)$

The problem is known to be hard and each NP-complete problem can be polynomial (from input size) reduced to it. Though, knowing the exact reduction is mostly useless (because the formula of cubed or even squared size can be unacceptable) does not mean that the world is not still going crazy and there are yearly SAT competitions. Minisat library is a good starting point for solving SAT formulas. Minisat can solve real world SATs with around of $2 \cdot 10^4$ variables and clauses (clause is a disjunction of variables). Also, there is a really good blog about SAT solvers from the SAT geek. Though, kinda specific one.

## MAX-3SAT

This is the same problem but now we need to satisfy as many clauses as we can. This is more practical because mostly yes/no answer is not that important and some approximation can be a still acceptable and good solution.

## Why is it important?

We face these problems quite often, for example, Sudoku for a field $n^2 \times n^2$ with the squares $n \times n$ is a NP-complete problem, SameGame is an NP-complete, optimal scheduling is NP-complete. All these problems have some intuition behind and this intuition is a good sign of what the software engineers can and cannot do. 3SAT is just a good generalization of it.

## Complex 3SAT formulas

A very good question of producing really complex formulas still remains half open. In engineering part you can simply increase the number of variables and do some random stuff and a complex formula is going to happen at some point. Still a good question of producing rather small formulas making all solvers stuck and solve it forever. Only recently there was a paper around it with a pretty interesting results. It states that if the number of clauses ($m$) divided by the number of variables ($n$) is around 4.24, then modern SAT solvers explode in finding the solutions if the random clauses are generated. Also, it is important to know that random 3SAT is defined by the action that each clause is randomly generalized by the uniform distribution according to the clause/variable ratio $r$. Number $4.2 + \varepsilon$ is kind of important: it is shown by many different papers that this is likely a transition phase number for random 3SATs (so, if the ratio is bigger than that, 3SAT is easy to solve, if less, also easy but if it is around, 3SAT becomes complex). Random 3SAT does embrace a phase transition phenomenon! That’s kind of a funny fact and showed even within some physics laws here as well. The current lower and upper bounds are 3.52 and 4.51 as long as I could find it here and in this presentation. Practical bounds show that the value is around $4.2 + \varepsilon$. The algorithm is easy, just create random 3SAT with some existing solution and then show that, for example, MiniSat cannot solve it for a very long time.

## Engineering part in all this

There are many libraries and solvers that try to solve such complex problems. One of which I am used to and like the most is or-tools. It supports many languages and is a combination of many-many solvers. Has a pretty good documentation with many examples combining all the solvers such as GLOP, CLP, GLPK, SCIP and GUROBI. All these solvers are complex and require additional posts which will be interested only to some geeks. I would say that the advantage that you can try out many of them with one proxy that or-tools is providing. Many of the solvers can have a timeout and print the existing solution saying was it optimal or was not able to prove it. Also, most of the solvers accept a hint solution using which it can try something better around this solution — this is pretty useful, if you have some combinatoric idea, you can use the solvers to try out if they can find something better that you’ve come up with. Some of the solvers even support multithreaded mode but this is more an advanced feature and I haven’t seen much progress from it, solvers tend to roll down to some solution and don’t move out from it for a long time.

So, one standard thing within the optimization problems — research, check, put into the solvers, repeat. Let’s try to see how it works with the hashcode problem which is kind of optimization problem.

## Into the details

Problem is hard to formulate in terms of integer programming mostly because we have libraries which order matters — linear equations don’t like ordering or otherwise we should create many new variables. I thought this is a bold strategy given the fact that the conditions are already pretty big for the integer programming. I decided to look precisely inside the datasets and let’s try to break each of them. Example is not an interesting one, so let’s skip it.

This is a pretty simple dataset, all books are of cost 100, libraries have its own books, no book in any two libraries, all libraries have the 1000 days to ship everything (1 book per each of the 1000 days), so it is ok get around 90% of the libraries. Actually, as no book is in two libraries, it is just optimal to pick all libraries in the ascending order of their installation time. This is the optimal and we get 5,822,900 points. Easy dataset to start.

### Incunabula

This is a harder dataset. All costs seem random, the books are in many libraries but I found that the throughputs are high. That means that the order in which we are shipping the libraries does not matter, matters the sum of the installation times. In this case it is a maximum budgeted coverage problem. Indeed

maximize $\sum_{b \in B} S_b \cdot y_j$. (maximizing the weighted sum of covered books).

subject to $\sum{T_i \cdot x_i} \leq D$; (the installation time of the selected libraries cannot exceed $D$ — overall days).

$\sum_{b_j \in L_i} x_i \geq y_j$; (if a book is chosen, i.e. $y_j > 0$ then at least one library where this book presents $b_j \in L_i$ is selected).

$y_j \in \{0,1\}$; (if $y_j=1$ then book $b_j$ is chosen)

$x_i \in \{0,1\}$ (if $x_i=1$ then library $L_i$ is selected for the cover).

But if you will try to just put this the conditions inside the problem, the solver will not find anything cool in one hour. That’s why we need some sort of hinting. In that case I just came up with some greedy solution: pick the library that will give us the most value divided by the number of days it is installing, delete all books from other libraries, subtract the number of days and repeat the iteration. After some analysis, it can be proven this is a $1 - 1/e$ approximation. And this solution brings us ~5,689,000 points. After that, I moved it to a SCIP solver as a hint and after 20 seconds it said it found the optimum in 5,690,556 points and that’s it. Dataset is cracked. This is the optimal solution to it.

### Tough choices

As long as I have somewhat greedy solution and written linear program, let’s dive into this dataset. This dataset has all books of equal cost. All libraries contain small libraries, that means we should not care much about the ordering — most books will be deployed.

Also, 15000 books are exactly in 2 libraries (and adjacent ones), 63600 books are in exactly 3 libraries. Also, after that I found out that $63600/15000 = 4.24$ which means that it is likely common to random 3SAT with a transition phase. So, basically I split 30000 libraries in pairs, each pair is a variable in 3SAT, if you choose the first one, this is true, otherwise false, other books in the libraries are the clauses numbers, in the first element there are the clauses that contain the variable, in the second the negation of this variable. Then I created a 3SAT with 15000 variables which the world believes it is hard to solve.

The world is, of course, correct, I tried pushing this 3SAT inside Minisat and it did not manage to find any good solution for a long time and I gave up. Luckily, I could push it to the integer programming solvers and optimize the coverage of the books. Greedy solution from the previous solution gives 98% coverage (which results in ~5,006,820 points) which is already great. Some code prototype of the integer programming reduction can be found here. After 2-3 hours of waiting I got 99.75% of the coverage which resulted finally in 5,096,260 points. The solution you can find here. Solving it optimally is a really hard problem but having the 99.75% coverage is quite good.

### So many books

This is the dataset I liked the most, overall I didn’t manage to prove something interesting here but at least I can tell how I was solving it. In this dataset the order of the libraries matters a lot and linear program is possibly only when the order of the libraries is fixed, then we can optimally choose which books should be chosen from which libraries. I decided to run a greedy solution and only somehow chose the books and got around ~5,100,000 points. Then I did not know how to correctly assign the books to the libraries and I just used the solver of the fixed order of the libraries and got ~5,200,000 points, so it turned out that optimal assignment gives lots of more points. Then I found out that the optimal assignment ignores cheap books aggressively and in greedy solution I started to choose which books I should account for the greedy solution choice, in the end I counted only the books with 99+ score and finally got 5,235,423 points which you can check with this solution and this linear program which worked for 10 seconds, so finding the optimal score for a greedy solution was pretty doable.

I do believe this dataset has much inside it and can be cracked better than by me. Still, I got +120k from the greedy solution which possibly gave me so many points.

### Libraries of the world

This dataset is a bit easier than the previous one because installation time is quite big and the day count is rather small. It turned out that likely optimal solutions are in a range from 15 to 18 libraries and this is a good number to try different permutations, some random removals and new additions. That I was doing — remove half of the libraries, add greedily to match the needs, run the solver (which solved optimally in 5ms), so I managed to try around 3-4 billion different permutations and variants. In the end simple greedy solution gave me ~5,312,895 points and I managed to get with this algorithm 5,343,950 points which you can check here. I think this value is not far from optimal, otherwise the solution is too far from the greedy one which I personally don’t believe given number of the variants I’ve tried.

## Conclusion

I want to say that optimization problems arise in software engineering from time to time and it is essential to try some combinatoric solutions with the solvers — they can help find some interesting dependencies and optimize better your own solutions. Also, as there are many different solvers, it is easy to try them out given that or-tools already implemented a proxy where you can just change 1 parameter.

Also, hashcode is very fun to solve.

## I wrote Go code for 3 weeks and you won’t believe what happened next

Historically, I am a C++ programmer and really like all this kind of C-ish stuff — from raw assembly to high level abstractions and SFINAE methods. For the past three weeks I managed to write lots of Go code — around 10k LOC without almost knowing anything when I started. I want to talk what I like, what is bad and funny. I am not a Go expert yet and this is my opinion which might be wrong, offensive and misleading.

## The Good

### Go is (almost) easy to learn

This language has in its DNA this fact. Once you read “Tour of Go“, “Effective Go“, look through the standard library, you already can program some really cool things. I can admit that most things I read in Go, I can understand — what this code is doing and to this level of understanding I came in 3 weeks unlike C++ where I managed to read most of the standard library parts in 4 years after starting doing it. This is a huge advantage — simplicity. And, as you know, simplicity is very and very complicated.

I cannot say Go is very rich, standard library is pretty small but for the problems Go can and is solving, this is more than is enough. I don’t know but I really think that people that are the maintainers of Go say “No” to features more often than other languages because they want to be simple and restrictive at first and then try to expand the language.

### Goroutines and channels

Oh god, they are really cool. It requires some time to understand what they are and how to use them correctly. For me personally this was not a problem.

But using them in Go is so comfortable, inviting, snugly, that you want to write everything in goroutines. They are like a crux of the biscuit to start and never finish, they are begging you not to stop writing go f(), they are the hidden thing to say to others “Hey, how are you doing? – I am invoking a goroutine. – Ah, that’s cool, won’t disturb you”. So, as you see, I am impressed by the goroutine design and its ease to use. Possibly without it, it would be a pretty useless language.

I was writing code specifically doing something asynchronous, non blocking, background and I managed to use all my concurrency knowledge with a great and easy infrastructure.

### Go is pretty performant

Though, it has GC, it is pretty performant. I found around 10% lower performance than C++ on my synthetic benchmarks and that is pretty decent. I believe it might be faster than Java because in Java every structure is a pointer in a slice but I might be mistaken.

### defer f()

Yes, I really like this feature, it is easy to understand, read and use. It is one of the keys to understand the lifetime of the objects which is done much better than in C which usually uses goto to finish something. Though, I still think that the C++ destructor is really the best design approach for object lifetimes, defer is a good approach for non object oriented languages and it will be a definitely a good extension to C language (but I am pretty sure this will never happen).

### Code formatting

There is gofmt and that is all. All, you heard? Language defined code format, not a human.

And that is nice, no debates, just invoke gofmt and that’s it. Code is written once and read hundreds of times and the faster you understand the constructions, the more code you are going to write in the end. I would say I like this so much for that particular reason — you understand code very fast because you have lots of catchy constructions.

The linters are also nice but in my opinion in C++ you can do much more through LLVM infrastructure.

### iota

Iota is a const declaration to simplify definitions of “enums” in Go. For example, this is really nice:

const (
_           = iota // ignore first value by assigning to blank identifier
KB ByteSize = 1 << (10 * iota)
MB
GB
TB
PB
EB
ZB
YB
)


This is one more feature for the simplicity and I like it a lot. Define some constants and you know what their transformations are. I really think other languages miss this simple feature so much. Don’t need something specific? Write iota. Need some simple formula? Write iota. Need something more complex? Write yourself.

### Patterns

Humans like patterns, when they see something familiar, what they did themselves, they feel it, they know exactly how it is working.

Go is full of patterns, like the very simple one:

if err != nil {
return nil, err
}


Human eye is very catchy on such kind of things. So, this is one more argument for better readability. To make people use patterns in C++ is barely possible and is not a priority for the standard.

### Compilation time

I believe Go can compile hundreds of thousands lines of code in seconds? I really caught myself waiting for linkage more than compiling everything. I like this scalibility with the fact it is only around 10-15% slower than C++.

### Exported and unexported types

Hiding implementations by writing the function name with the lower case is pretty funny and accurate in my opinion. I was not able to understand it for the first time but then I was able to read and understand other code even better.

### Portability

Easily, without any specific thing, code can be built on Linux x86_64, Linux PowerPC, MacOS, Windows with the same semantics of goroutines, network, filesystem handling, etc. I was writing client side code for many platforms and Go perfectly fitted this requirement.

In C++ in order to write coroutines that work on all the platforms, you need to spend much time make it work like you really want and the community does not help here, most of the implementations are not that tested and there is little trust overall.

### Hacks around generic types

Maps, slices and channels are generic types. And others? Possibly now I understand all the lack of generics jokes in Go.

For example, I want to use goroutine safe map sync.Map. I look at the implementation and see interfaces and I need to use casts. Okay, you make me sad. I do believe it is possible to have generics in Go without destroying the language simplicity but I might be mistaken.

I would expect atomic types being generic at least because synchronization between goroutines is very important in a such language. And for now I need to write something like

type Status int32

var status Status
...
atomic.StoreInt32((*int32)(&State.status), int32(status))


### Interfaces

It is hard to find all the definitions of the interfaces, sometimes it is not clear how they are built — all I see is numerous NewSomeType functions to create a new type with some interface. To find all the instantiations is pretty hard. Possibly it is a matter of time to find hacks how to do it but I failed it completely. Maybe tooling is not the best but I am working at Google where Go was created … I don’t know why it was extremely hard to me.

### Zero values initialization

Types by default are nil and I don’t like it. Either you need to check in the implementations being non nil or panic.

Possibly constructors and destructors can solve this problem but still I am not sure this is possible in Go philosophy.

### Constant weirdness

I don’t understand why I can write 30 * time.Minute but cannot integerVariable * time.Minute. Of course, I can google it but understanding why is much more complex. I still think this is a great misbehavior in Golang.

### Compiler infrastructure

After C++, Go has a really bad compiler in its error messages. The simplest example is lack of typos finding. If somebody misses the character or mistypes it, it takes time to find the actual error. Better suggestions would also help, for example, for non-exported functions and fields.

Also, I don’t like when there is a monopoly inside the language compilers. Yes, there are gccgo and llvm-go but they are kind of … silent, I mean. And I don’t understand why this is happening. Having the competition in all kind of areas in the world is always better than not having it.

### Mutability

After const qualifier in C/C++, in Go I miss it a lot. Basically, you can change any type of field whenever you want. Adding const to a language possibly is pretty late but mutability is not an answer for such mid-level abstractions language, in my opinion.

### Lack of visibility support

For example, protobuf generated code for enums is ugly.

message SomeMessage {
enum SomeEnum {
SomeValue = 0;
}
}


To access SomeValue, you need to write SomeMessage_SomeEnum_SomeValue with the type SomeMessage_SomeEnum. I don’t like the fully qualified names here. Possibly some kind of visibility namespaces inside the structs can help. I find such code generation ugly and as a true believer in Protobuf over JSON battle, this looks as a disadvantage. Also, golang syntax for protobufs is VERY different from other languages.

### Testing infrastructure

I hate writing tests in go. It is not structured at all, it is difficult to use mocks, it is difficult to test different parameters, there is no infrastructure for comparing two values, expecting some error is always done with a different test case where you specify some wantError variable to compare with. If the language prioritizes being more language specific rather than community specific, I expect having perfect testing library.

For example, this leads to a fact that it is hard to get 100% code coverage. Because Golang code is full of error checking and mocking these errors is always hard. This can be solved by a better testing library support or some macro machinery around error handling, because sometimes you are tired of writing hundreds of 3 lines of err != nil conditions without testing them properly.

### Where is my memory?

After C and C++ it is really hard to understand that you don’t need to clean up the memory but you are still thinking too much how not to allocate something bigger, reuse some memory, etc. This is mostly annoying and Golang does not talk about this much but probably sometimes should — how GC works, what it is doing, that its semantics is low-latency but probably high CPU for horizontal scaling.

### Messy assembly

I am a low level guy and sometimes this is hard to force the compiler to do what I really want and I was not able to understand what the compiler is doing and why. For example, https://go.godbolt.org/z/z-T6Ck. TL; DR. It is really hard to read the assembly of it. For example, for StoreInt32 it generates xchgl AX, (CX) instruction

What if I want another memory ordering? No default support. Also, lots of noisy pcdata things for garbage collector.

### Time format layout

No comments, just funny. TL; DR To format time variable t you can only write the format with a specific hardcoded date.

s := t.Format("2006/01/02")


More here, for example.

### Community and gopher apocalypse

A lot of stuff in the reviews like Good to Go, Gotchas or many gopher memes around the community. I mean, this is catchy, nice and friendly.

### Conclusion

Go is not a universal language for any kind of things you want. This is good for simple server applications, client side code (with little “multithreaded” overhead), some caching stuff and not big projects. Though, I like Go a lot and if I want to write something small, networky and useful, Go is my choice right now. For big and longstanding projects, I do believe Go is bad because if you want to do something complex, Go won’t let you do this anyhow.

## Overview of Abseil

Abseil is a great C++ library with a special and unique philosophy inside it. While there is a lot of debate what C++ is prioritizing more – portability or performance with possibly breaking changes, Abseil library aims in a standard compatible way to provide best library features of all standards sometimes with a very good performance. No need to upgrade your compiler which may contain bugs which are extremely hard to debug (for example, if you are developing highly reliable software for rockets, being on the edge with the latest compilers is often not even a choice), no need to upgrade C++ language features with some possible breaking changes, just get Abseil and get all features from future standards which are already highly tested by millions of lines of code across Google.

As an example to show, all standards prior to C++17 lack of useful class std::optional<T> which provide an optional access to some type T. This class can be efficiently implemented even in C++11. By the way, optimizations for std::optional is another a very interesting topic – how to create such a simple class being 1500+ lines of code overall. So, Abseil implements this like

Then, you just use in your code absl::optional. After the update to C++17, it is a drop-in replacement of std::optional and you can replace one class to another. Abseil guarantees 5 years of support after the standard release and then workarounds can be removed for previous standards. Abseil does not guarantee ABI compatibility – this is a very debatable state but improving the things over the time is one of the humanity problems – we tend to make mistakes and a good thing is that we should learn how to fix them. Modern C++ compilers can do this in a very efficient way (clang-tidy, all kind of tooling, etc). That’s why if Abseil introduces some breaking change, it comes with a tool to fix all of your code. Google has more than 250 millions lines of C++ code – all breaking changes are firstly tested on Google’s codebase and then ported to open source. Dealing with that amount of code cannot be done with a human interaction and automation is one of the main priorities to change something – you break, you provide an automated tool to fix this. We will also talk about this a bit later.

Of course, not everything is that cool, I decided to spend an hour or so and to find some incompatibilities, and I managed to find them. Abseil provides std::string_view replacement absl::string_view which claims to be compatible with a standard version. And consider the following example: one is working in C++17 mode, the other is not in C++11 mode:

We just assign to std::string an object of type absl::string_view. It fails to compile in C++11 and is successful in C++17. Which, I mean, is not that big problem because people always move forward with a standard rather than backward.

This is a very neat example shows that forward compatibility guarantee is hard because of multiple interaction between classes. One of these is shown above – std::string provides an assignment operator accepting std::string_view in C++17. Personally I don’t understand why. std::string_view claims to be a class that if you want a conversion to an owned string, you should write it explicitly. And… There are very very few options in C++ doing so with an assignment operator efficiently.

For example, writing something like

s = std::string{sv};


Would be much less efficient because we create a temporary string.

For constructors there is a choice to make the constructor explicit or implicit. For assignment there is only one syntax, unfortunately.

Of course, we have assign(). Traditionally assign() mirrors constructors (not necessarily all, e.g. assigning a default-constructed string is s.clear() rather than s.assign()), and operator=() is a subset of assign() with a single parameter. I do not think there is a precedent for making some single-parameter assign() overload not available as operator=(). People from standard could have introduced such convention for explicit constructors but they did not. Actually, libc++ implementation calls assign(const char*, size_t) as you can see in the godbolt link. I claim this as a defect of a standard but not a huge one, of course. That way Abseil does not provide a full C++17 replacement for the types.

## ABSL_FLAG and HOW_TO_GET_CRAZY

After some brief overview of the Abseil library, we can continue digging inside it even more. I want to talk more about the command line flags library. In the documentation it is stated (with my comments in italic):

The Abseil flags library allows programmatic access to flag values passed on the command-line to binaries. The Abseil Flags library provides the following features:

• Access to Abseil flags in a thread-safe manner. This is a very interesting property to have for all flags.
• Access to flag values that are valid at any point during a program’s lifetime. This means right here that they are global variables and we know that globals are harmful…
• Allows distributed declaration and definition of flags, though this usage has drawbacks and should generally be avoided. extern hello darkness, my old friend

Values for these flags are parsed from the command line by absl::ParseCommandLine(). The resulting value for each flag is stored in a global variable of an unspecified type absl::Flag<T>.

TL; DR. Flags are global moduled variables that must be defined in .cc|.cpp|.cxx files (otherwise you will get an ODR violation) that can be declared and accessed by any code if you declare them in a translation unit and link with the definition.

I think this is one of the most convenient way to turn on/off features in big codebases if you are depending on some other APIs – it is easy to test features, to setup an environment. The other option I see in production binaries is to provide a context to propagate it through the entire program stack and have some kind of global config. Either way, we need something global and to avoid bloat, Google decided to use global variables.

And now come the most sophisticated parts. First, global variables are bad for multiple reasons

• You should not define them in headers unless they are inlined (weak symbols) because you will have ODR violations (which variable to choose across translation units is an open question to a linker). And it is an undefined behavior in C++.
• Global variables with dynamic storage duration should be created and used very carefully. The flags have default values and you can specify your own flags with any type. Many guidelines and even universities (especially in Russia) suggest creating singletons (and flags are singletons in a way) with the BadSingleton() function:

Flags library solves the last problem in an interesting way. With a definition they don’t create anything, literally anything. Let’s see how it is done:

And you can see that flag is a compile time definition – all pointers, all data is known at compile time and no memory is allocated and it is nicely generated in assembly code then. Also, remember that flags are thread safe, at least for big types it should contain mutex (actually, mutex is always needed because initialization requires more than just replacing the value). As mutex is not a trivial type, in this situation, absl::FlagImpl does a great trick:

All this is done not only for safety reasons. My guess that some programs at Google can contains hundreds of thousands of different modules with multiple flags in each of it. If we allocate even something during the initialization, the binary startups linearly slow down which is a so huge and bad effect on the user experience and performance overall (even recovery latency, sometimes if the binary segfaults on a killer query, for example, search engine, it is better to have a very fast startup to continue working ASAP). As only a small fraction of flags are accessed during the execution, the decision to make the initialization lazy is a great performance and user experience solution.

Flags are not meant to be performance critical and used in tight loops though the reality can show something different. Many APIs create their options and access the flags in the constructors (for example, to distinguish test and prod environment). And it becomes a performance critical feature because options can be created thousands of times with a multi-thread flags access. Though, I think calling absl::SetFlag happens really rare and might not be have that performance hefty.

What we know for now about flags:

• They are global variables with lazy initialization.
• They might be read performance critical.
• They are thread safe in their access.

These properties are a bit controversy because thread safety property is a really complex issue with a templated arbitrary type. In general case it is done by holding a mutex because it cannot be done in any other way. To make the situation a bit better, flags did an optimization for trivial types like integers and floats (I put a lot of comments inside the snippet, please read them all):

So, the implementation is holding one atomic of 8 bytes and tries to get the fast path if the flag was initialized based on a heuristic in that way that if AtomicGet returned true, that means that flag was initialized and we can safely return.

This generates very good assembly with few instructions and the benchmark for flags access showed the following results for all the one worded trivial types (I show only for bool for simplicity):

Benchmark                                     Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------------
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:32 0.086    0.963    691627808

We can conclude that it is extremely fast and does not degrade much with the number of threads increased. It is achieved because of std::memory_order_acquire in atomic load and x86_64 with its strong memory model (almost each instruction guarantees release and acquire semantics) generates around the following code which is efficient, small and extremely fast:

The benchmark of course is far from ideal (guess why).

For all other types absl::Mutex is held for accessing the value which has around 17ns latency for holding in one threaded mode and around 150ns in 4 threaded mode. Hi, latency.

## My concerns about all above

And here comes my biggest concern – I don’t feel this API guarantees the right thing enough (I mean, atomic hack). Though, in a true way this is not a guarantee but looking at the implementation while programming and profiling absolutely unrelated thing, I put this thing into the list to look into because I felt through my veins the lack of completeness around the API and it bothered me a lot. I got the feeling about all this but the most fascinating part that I was not able to formulate it in the next couple of hours but I caught that feeling and it got inside me that in my background thoughts I was thinking and trying to get all the right words to understand what really bothered me. Okay, let’s roast a bit this implementation.

1. This works bad for other trivial types like enums that fit into the 8 byte register. And why are we holding mutex inside it for them, are these types special?
2. Why do we have really special types? Yes, they are built-in, but they are not general, they optimize many things but not all.
3. This is not scalable and this is the biggest concern. What if I want to provide other lock free types? I should list them in that list. What? Why? I mean, I should not, this involves circular dependencies and additional macro expansions. C++ is much more powerful to deal with such things at compile time.
4. absl::Flag provides absl::Duration type implementation which is 12 bytes in size. C’mon, modern processors have 16 and 32 byte registers, can we operate with them in an atomic way? I should investigate all these issues. Also, absl::Duration turned out to be very popular flag and it potentially can save many CPUs and latency which is hard to measure just running some benchmarks.
5. Oh, what if I can do this for 16 bytes registers, I can hack small string optimization (will talk about this a bit later).

When I told all this to one of the Abseil team members, I got around this feedback (stolen Russian meme):

Some of the Abseil team members were skeptical at first but I decided not to say any more words and make a prototype of this.

## Let’s implement this

First, we need to create a trait to understand whether the type can fit into the atomic. As we use memcpy, the type must be trivially copyable according to a C++ standard. So, I came up with something like:

And it failed so many times during the testing because … because it is C++ and it is going to punch you in the face all the time when you are not doing really correct things.

First, trivially copyable type might not be default constructible, i.e. we cannot write T value;. To work around this, we need to either now how it can be worked around either remember some other examples. For example, optional<T> can be created with a non default constructible type. Looking into the Abseil implementation, we can see a very interesting thing:

It is using a union type, indeed if we put a type into the union, it just prepares the memory layout for it. So, I ended up in something like:

Run tests again and they again failed. ARGH, now it is not working on all compilers. GCC 4.x+ is failing but luckily Abseil has a workaround this (look at line 8).

So, after all miscompilations, bugs, finally we have a version working for all trivially copyable types.

And it turned out to be breaking again. 🙂

Before that all trivial types were defined in .cc file which has external linkage. Assume now that you have a shared library that accesses some trivial flag. Briefly, see

As now the linkage is internal, Get on its first initialization will get the pointer that was not firstly provided inside the flag definition because shared libraries are loaded with another addresses dynamically. So, it will result in a log statement which can pollute a bit the logs of the execution. This check is here because we want to check that the flag definition is the same as its declaration, to do this we need to check that types are the same and it is done in not the best way. The best way will be to check the typeids but this comes with some cost. Overall, we decided not to patch it right now as shared libraries for flags are already broken (for all non trivial types) and rare.

But this is funny – I changed 3 places writing pretty understandable code and it turned out that all 3 places are breaking changes and ended up with something ugly and crunchy. C++ in a nutshell. And https://www.hyrumslaw.com/.

Okay, let’s assume we are finished with trivial types. Now comes the boss called 16 byte atomics. I really think that the theoretical knowledge helps in this case. Everyone should know that in order to create an atomic register we need a CAS operation on the register. For example, Introduction to Reliable and Secure Distributed Programming book helped me a lot to understand how memory models work from its theoretical to practical parts. To be dull, I just googled CAS 16 bytes and figured out that there is an instruction cmpxch16b on modern x86_64 platforms (by modern I mean westmere from 2010).

It turned out that even windows 8.1 requires this instruction for installation. This instruction comes with a bigger cost (by the way, I still don’t know why 16 byte loads do not meet strong model memory consistency but this is an absolutely another topic).

However, this instruction comes with a cost because it requires lock prefix. I am not an x86_64 memory model expert but my understanding what will happen:

The cost of the x86 LOCK prefix, or CAS, before PentiumPro, is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.

Not going into too much detail, I’ll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses. L1 Cache miss is around 4ns, so we can wait for 8ns latency for the lock prefix. Cachelines are always at least 16 bytes, so probably we should expect something like this.

It is no way <=1ns with one word atomics. I decided that we need a dispatch for this. So, for trivially copyable types >8 bytes and not bigger than 16 we can generate two words atomic while for others we should stick with a very nice and fast mov instructions. I ended up implementing something like this:

The benchmark showed 7.6ns accessing absl::Duration flag in 1 threaded mode and 10 times better in 16 threaded mode. So, we saved a lot of contention, provided lock free implementation for bigger types increasing the flag size only by 8 bytes.

Funny fact that GCC since version 7.1 does not generate 16 byte atomic lock free code because

IIRC this was removed as the instruction cannot be used for read only memory.

Andrew Pinski 2017-05-25 12:49:51 UTC. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

I barely understand what this means but the fact remains the fact. GCC does not provide this instruction in any of the last implementations. Sorry, Clang, you win this game. See https://gcc.godbolt.org/z/zzicCu. And I turned it on for all clang compilers that have -mcx16 option inside. Windows does a complete garbage here and I don’t even think about it right now.

After I sent the patch on the review, we got 2 additional meetings, around 3000 words pretty tough arguing if we need this, checking that nothing else breaks. (My side note – I find sometimes very hard to convince people at Google doing some crazy things but with enough intention, good intuition, it is possible in short amount of time). I was polishing it and making the code more readable. In the end it turned out completely insane and now I FEEL this is way clearer in its internals rather than it was before.

As it is a well known fact that Google uses LLVM infrastructure, my change might possibly be useful to all services. This patch was open sourced couple of days ago under the name absl::GetFlag is lock free for small trivially copyable types. with other internal changes 🙂 .

## Next

Maybe this is not the end. The way I told you, it might be possible to hack libc++ small strings <=14 bytes. They are stored on stack and first byte of the representation indicates if the string is small and you can get the string size. If it is small, we can atomically load 16 bytes to return a valid small string to a user. This requires using undocumented function __is_long() and very robust code which might end in really crazy things and bugs.

Also, on other platforms there are 16 byte atomic loads but they are not generated by all compilers and for all platforms. For example, for PowerPC GCC generates stq instruction (store quad) but Clang does not. For ARM the situation is the opposite. Also, it is hard to get old compilers for all the platforms and I decided to give up a bit on this. But this is absolutely possible.

## Epilogue

To write the working code took me around 2 hours from scratch. Research on how the things are going around took me around a day. Then 5 days of arguing with people and why we need or don’t need this. But this definitely is one of the greatest ideas that came to my mind for the past couple of years. Also thanks to C++ that it is possible to do this crazy thing and gain the maximum we can do.

What am I doing with my life?

## About me and this blog

I am Danila Kutenin. I am an engineer working at Google and doing distributed and high performance C++ software. My main interests are distributed systems, operating systems design, hardware testing, compilers, APIs, algorithms and theoretical computer science.

I worked for almost for 2 years at Yandex doing the core search engine – one of the biggest runtime services in Russia facing million of daily users and thousands QPS. Also, I was one of the C++ library maintainers fixing, optimizing and providing cleaner APIs for all C++ code at Yandex. There I learned a lot how to do high performance and distributed things, how to manage huge clusters and to find my way to what I like and what I don’t.

Now I am working at Google doing internal distributed file system facing all Googlers all around the world, providing best user experience in latency, speed and synchronization. My side projects include patching compilers, LLVM, C++ main library, documentation engine, Chromium and everything that I find funny enough to do and to investigate.

This blog is a way of expressing and combining my interests and feelings about practical engineering explaining a lot of “why”s from its theoretical, hardware and software side – from getting a brilliant math idea to an actual implementation, assembly code generation and how it is dealing with a hardware revealing all the magic that is happening through all the stack.

If you want something to discuss, everything you want me to ask/suggest, please reach me out at kutdanila at gmail dot com. Also, you can always find me on Twitter @Danlark1.