128 bit division

When it comes to hashing, sometimes 64 bit is not enough, for example, because of birthday paradox — the hacker can iterate through random 2^{32} entities and it can be proven that with some constant probability they will find a collision, i.e. two different objects will have the same hash. 2^{32} is around 4 billion objects and with the current power capacity in each computer it is certainly achievable. That’s why we need sometimes to advance the bitness of hashing to at least 128 bits. Unfortunately, it comes with a cost because platforms and CPUs do not support 128 bit operations natively.

Division historically is the most complex operation on CPUs and all guidelines suggest avoiding the division at all costs.

At my job, I faced an interesting problem of optimizing 128 bit division from abseil library in order to split some data across buckets with the help of 128 bit hashing (the number of buckets is not fixed for some uninteresting historical reasons). I found out that the division takes a really long time. The benchmarks from abseil on Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz show some horrible results

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor     13.8     13.8  // 128 bit by 128 bit
BM_DivideClass128SmallDivisor        168      168  // 128 bit by 64 bit

150 nanoseconds for dividing the random 128 bit number by a random 64 bit number? Sounds crazy. For example, div instruction on x86-64 Skylake takes 76 cycles (also, for AMD processors it is much less), the division takes around 20-22ns.

https://godbolt.org/z/o2vTZr

In reality everything is slightly better because of pipeline execution and division has its own ALU, so if you divide something and do something else in the next instructions, you will get lower average latency. Still, 128 bit division cannot be 8x slower than 64 bit division. All latencies you can find in Agner Fog instruction table for most of the modern x86 CPUs. The truth is more complex and division latency can even depend on the values given.

Agner Fog instruction table for Skylake CPUs, the second but last column is the latency.

Even compilers when dividing by some constants, try to use the reciprocal (or, the same as inverse in a ring) value and multiply the reciprocal and the value with some shifts afterwards

https://gcc.godbolt.org/z/PRibsx

Overall, given the fact that only some sin, cos instructions cost more than division, division is one of the most complex instructions in CPUs and optimizations in that place matter a lot. My exact case was more or less general, maybe I was dividing 128 bit by 64 bit a bit more frequent. We are going to optimize the general case in LLVM.

We need to understand how 128 bit division is working though the compiler stack.

https://gcc.godbolt.org/z/fB3aq2

It calls __udivti3 function. Let’s first understand how to read these functions. In runtime libraries the modes of the functions are:

QI: An integer that is as wide as the smallest addressable unit, usually 8 bits.
HI: An integer, twice as wide as a QI mode integer, usually 16 bits.
SI: An integer, four times as wide as a QI mode integer, usually 32 bits.
DI: An integer, eight times as wide as a QI mode integer, usually 64 bits.
SF: A floating point value, as wide as a SI mode integer, usually 32 bits.
DF: A floating point value, as wide as a DI mode integer, usually 64 bits.
TI: An integer, 16 times as wide as a QI mode integer, usually 128 bits.

So, udivti3 is an unsigned division of TI (128 bits) integers, last ‘3′ means that it has 3 arguments including the return value. Also, there is a function __udivmodti4 which computes the divisor and the remainder (division and modulo operation) and it has 4 arguments including the returning value. These functions are a part of runtime libraries which compilers provide by default. For example, in GCC it is libgcc, in LLVM it is compiler-rt, they are linked almost in every program if you have the corresponding toolchain. In LLVM, __udivti3 is a simple alias to __udivmodti4.

__udivmodti4 function was written with the help of Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide. After looking at it here, it looks like this was written long time ago and things have changed since then

First of all, let’s come up with something easy, like shift-subtract algorithm that we have been learning since childhood. First, if divisor > dividend, then the quotient is zero and remainder is the dividend, not an interesting case.

The algorithm is easy, we align the numbers by their most significant bits, if dividend is more than divisor, subtract and add 1 to the output, then shift by 1 and repeat. Some sort of animation can be seen like that:

For 128 bit division it will take at most 128 iterations in the for loop. Actually, the implementation in LLVM for loop is a fallback and we saw it takes 150+ns to complete it because it requires to shift many registers because 128 bit numbers are represented as two registers.

Now, let’s dive into the architecture features. I noticed that while the compiler generates the divq instructions, it frees rdx register

In the manual they say the following

divq instruction provides 128 bit division from [%rdx]:[%rax] by S. The quotient is stored in %rax and the remainder in %rdx. After some experimenting with inline asm in C/C++, I figured out that if the result does not fit in 64 bits, SIGFPE is raised. See:

Compilers don’t use this instruction in 128 bit division because they cannot know for sure if the result is going to fit in 64 bits. Yet, if the high 64 bits of the 128 bit number is smaller than the divisor, the result fits into 64 bits and we can use this instruction. As compilers don’t generate divq instruction for their own reasons, we would use inline asm for x86-64.

What to do if the high is not less than the divisor? The right answer is to use 2 divisions because

So, first we can divide hi by divisor and then {hi_r, lo} by divisor guaranteeing that hi_r is smaller than divisor and thus the result is smaller than 2^{64}. We will get something like

After that the benchmarks improved significantly

Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor 11.9      11.9
BM_DivideClass128SmallDivisor   26.6      26.6

Only 26.6ns for small divisors, that’s a clear 6x win.

Then there are multiple choices to do next but we know that both dividend and divisor have at least one bit in their high registers and the shift-subtract algorithm will have at most 64 iterations. Also the quotient is guaranteed to fit in 64 bits, thus we can use only the low register of the resulting quotient and save more shifts in the shift-subtract algorithm. That’s why the uniform divisor slightly improved.

One more optimization to do in shift-subtract algorithm is to remove the branch inside the for loop (read carefully, it should be understandable).

In the end, it gives 0.4ns more for uniform 128 bit divisor.

And finally I believe that’s one of the best algorithm to divide 128 bit by 128 bit numbers. From statistics, the case when the divisor is 64 bit is worth optimizing and we showed that additional checks on the high register of divisor has its own advantages and expansion of the invariants. Now let’s see what other libraries perform in that case.

LibDivide

Libdivide is a small library targeting fast division, for example, if you divide by some fixed number a lot of times, there are techniques that can precalculate reciprocal and then multiply by it. Libdivide provides a very good interface for such optimizations. Even though, it has some optimizations regarding 128 bit division. For example, function libdivide_128_div_128_to_64 computes the division 128 bit number by 128 bit number if the result fits in 64 bits. In the case where both numbers are more or equal to 2^{64} it does the following algorithm that they took from Hackers Delight book:

\begin{cases} n = MSB(\mathrm{divisor}) \geq 1 \\ \mathrm{divisor_1} = \lfloor \mathrm{divisor}/2^{64 - n} \rfloor \\ \mathrm{dividend_1} = \lfloor \mathrm{dividend}/2 \rfloor \end{cases}

With the instruction that produces the 64 bit result when the divisor is 128 bit result we can compute

\mathrm{quotient_1} = \lfloor \mathrm{dividend_1}/\mathrm{divisor_1} \rfloor

Then we compute

\mathrm{quotient_0} = \lfloor \mathrm{quotient_1}/2^{63 - n} \rfloor.

It cannot overflow because \mathrm{quotient_1} < 2^{64} because the maximum value of \mathrm{dividend_1} is 2^{127} - 1 and minimum value of \mathrm{divisor_1} is 2^{63}. Now let’s show that

\lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor \leq \mathrm{quotient_0} \leq  \lfloor \mathrm{dividend}/\mathrm{divisor} \rfloor + 1

\mathrm{quotient_0} = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\mathrm{divisor_1}} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{2^{64 - n}\left\lfloor\frac{\mathrm{divisor}}{2^{64 - n}}\right\rfloor} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n})} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \frac{\mathrm{dividend}(\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))}{\mathrm{divisor}(\mathrm{divisor} - (\mathrm{divisor} \mathrm{\ mod\ } 2^{64 - n}))} \right\rfloor = \left\lfloor \frac{\mathrm{dividend}}{\mathrm{divisor}} + \delta \right\rfloor.

Now we want to show that \delta < 1. \delta is the largest when the remainder in the numerator is as large as possible, it can be up to 2^{64 - n} - 1. Because of the definition of n, \mathrm{divisor} \geq 2^{127 - n}. The smallest value of \mathrm{divisor} in the denominator is 2^{127 - n} + 2^{64 - n} - 1. That’s why

\delta \leq \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n} + 2^{64 - n} - 1)2^{127 - n }} < \frac{\mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n })^2}. As n iterates from 0 to 63, we can conclude that \delta < \frac{\mathrm{dividend}}{2^{128}}. So we got either the correct value, either the correct plus one. Everything else in the algorithms is just a correction of which result to choose.

Unfortunately, these corrections increase the latency of the benchmark pretty significant

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3    26.3  
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2    26.2
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8    25.8
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3    26.3

So I decided to drop this idea after I’ve tried this.

GMP

GMP library is a standard GNU library for long arithmetic. They also have something for 128 bit by 64 bit division and in my benchmark the following code worked

It divides the two limbs by a uint64_t and provides the result. Unfortunately, the latency is much higher than expected, also does not work

Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor<GmpDivision>          11.5    11.5
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7    10.7
BM_DivideClass128SmallDivisor<GmpDivision>            47.5    47.5
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8    47.8 

Conclusion

In the end I’ve tried several method of 128 bit division and came up with something that is the fastest among popular alternatives. Here is the full code for benchmarks, though it is quite hard to install, maybe later I will provide an easy version of installation. The final benchmarks are

Benchmark                                           Time(ns)  CPU(ns)
---------------------------------------------------------------------
BM_DivideClass128UniformDivisor<absl::uint128>        13.7      13.7
BM_RemainderClass128UniformDivisor<absl::uint128>     14.0      14.0  
BM_DivideClass128SmallDivisor<absl::uint128>          169       169    
BM_RemainderClass128SmallDivisor<absl::uint128>       153       153    
BM_DivideClass128UniformDivisor<LLVMDivision>         12.6      12.6  
BM_RemainderClass128UniformDivisor<LLVMDivision>      12.3      12.3  
BM_DivideClass128SmallDivisor<LLVMDivision>           145        145    
BM_RemainderClass128SmallDivisor<LLVMDivision>        140        140    
**BM_DivideClass128UniformDivisor<MyDivision1>          11.6      11.6**
**BM_RemainderClass128UniformDivisor<MyDivision1>       10.7      10.7**
**BM_DivideClass128SmallDivisor<MyDivision1>            25.5      25.5**
**BM_RemainderClass128SmallDivisor<MyDivision1>         26.2      26.2**
BM_DivideClass128UniformDivisor<MyDivision2>          12.7      12.7  
BM_RemainderClass128UniformDivisor<MyDivision2>       12.8      12.8  
BM_DivideClass128SmallDivisor<MyDivision2>            36.9      36.9  
BM_RemainderClass128SmallDivisor<MyDivision2>         37.0      37.1  
BM_DivideClass128UniformDivisor<GmpDivision>          11.5      11.5  
BM_RemainderClass128UniformDivisor<GmpDivision>       10.7      10.7  
BM_DivideClass128SmallDivisor<GmpDivision>            47.5      47.5  
BM_RemainderClass128SmallDivisor<GmpDivision>         47.8      47.8  
BM_DivideClass128UniformDivisor<LibDivideDivision>    26.3      26.3  
BM_RemainderClass128UniformDivisor<LibDivideDivision> 26.2      26.2  
BM_DivideClass128SmallDivisor<LibDivideDivision>      25.8      25.8  
BM_RemainderClass128SmallDivisor<LibDivideDivision>   26.3      26.3

MyDivision1 is going to be upstreamed in LLVM, MyDivision2 will be the default version for all non x86-64 platforms which also has a solid latency, much better than the previous one.

Why is std::pair broken?

Today we are going to talk about C++ basic std::pair class in a way how and why it is broken.

Story

std::pair first appeared in C++98 as a pretty basic class with a very simple semantics — you have two types T1 and T2, you can write std::pair<T1, T2> and access .first and .second members with all intuitive copy, assignment, comparison operators, etc. This is used in all map containers as value type to store and can be expanded into structured bindings, for example, with the following code:

Pretty convenient, almost all C++ programmers know about this class and use it appropriately but several style guides including Google Style Guide suggest using user defined structs where appropriate. It has several reasons but the first one is readability — I personally want to kill people who write more than one level of std::pair, for example, std::pair<int, std::pair<int, std::pair<int, int>>> and then access it with el.second.second.second. And as the structure of the objects in a maintainable code tends to grow, it is better to use structs rather than pairs and tuples. Other reasons are more interesting.

Performance

Default construction

This is a minor and debatable thing but std::pair with trivial and non default zero initialized scalars is zero initialized:

In contrast, std::array does not initialize their elements with the default construction (if the underlying type is trivial) and leaves this for the user. It is debatable because some people don’t like the uninitialized memory but in the end this is an inconsistency between the containers and is a different behavior from default struct what people tend to think about std::pair (yet, I still think this should look like this):

Copying

I found this issue when I wanted to serialize and deserialize std::pair. One of the default implementations in struct/type serialization and deserialization is the following:

If we go to memcpy documentation and see the following:

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined.

https://en.cppreference.com/w/cpp/string/byte/memcpy

TriviallyCopyable are objects that satisfy the following conditions:

  • Every copy constructor is trivial or deleted. pair(const pair&). So, it should be either = default, either = delete, either omitted/created by the rule of X and all underlying objects are trivially copyable.
  • Every move constructor is trivial or deleted. Same with pair(pair&&).
  • Every copy assignment operator is trivial or deleted. Same with pair& operator=(const pair&).
  • Every move assignment operator is trivial or deleted. Same with pair& operator=(pair&&).
  • At least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted.
  • Trivial non-deleted destructor. This means that virtual classes cannot be trivially copyable because we need to copy the implementation vptr.

If we write our pair above and check if it is trivially copyable (with some trivial underlying types), it definitely is because we use the rule of zero and thus everything is generated automatically. Also, a small note — in the examples above you need to make sure that the types are trivially copyable, a perfect variant looks like (this is also heavily used by the compilers to generate the unaligned loads and work with unaligned memory):

Let’s see what std::pair guarantees. Good news is standard guarantees the copy and move constructors to be defaulted (see 7 and 8).

Also, from C++17 we have the requirement that if the types of the pair are trivially destructible, pair is also trivially destructible. This allows having many optimizations including putting the type onto registers. For example, std::variant and std::optional had a different proposal to make this possible.

With everything else, it is more complicated. Let’s see what is going on with the copy assignment, in libc++ it is implemented like this:

__nat is used for disabling the instantiation of the copy assignment operator if it is invoked — a cool C++98 hack to delete the operator.

This helps the compiler to use std::pair with the references. By default, references cannot be implicitly copied or, if said in a more clever way, if you have them in the class, the copy assignment operator is implicitly deleted unless user defined. For example:

In C++ copying the reference changes the underlying object and thus the referenced is binded to the class. Here is a good explanation why it was designed this way. So, the assignment operator is deleted because copying reference has slightly different semantics but in C++98 people (maybe) thought it was convenient. So, we cannot easily write = default to provide a default assignment operator (despite the fact this is an ABI (application binary interface) break).

All these facts lead to suboptimal performance, for example, in copying the vector of pairs. With the standard pair we have a simple loop, with trivially copyable type, we have fast memmove, the difference might be up to the register size, memmove is highly optimized.

https://gcc.godbolt.org/z/wtLLBU

Some attempts to fix this

First of all, all fixes will require an ABI break but as C++ standard will be (I hope) more or less ok with ABI breaks (or not :smile:), this can be done in a way.

Conceptually, we need to make a conditional default copy assignment operator with some restrictions:

  • We cannot write = default in the main class because it disallows the assignment of the references.
  • We cannot add the (possibly defaulted) template arguments to the pair because it will break the forward declarations.
  • We cannot use SFINAE on the return types of the defaulted copy assignment operators.
    • std::enable_if_t<std::is_reference<T1>::value || std::is_reference<T2>::value> operator=(const pair&) = default; is disallowed by the standard.
  • We can move the .first and .second fields to the base classes and have partial specialisations like:

For example, std::optional does it this way. The downside that we need to copy lots of code and such a simple class like optional turned out to be 1300+ lines of code. Too much for std::pair, isn’t it?

Good solution I found is using C++20 concepts. They allow (at least they don’t disallow) the usage of the conditional default assignment operators.

https://gcc.godbolt.org/z/XuvrC9

This fixes the problem, does not blow up the code size and at least is very intuitive to read. In the end I found several proposals trying to suggest it in a more generic way. See one, two and see the blog where they use the feature and reduce the number of lines from 1200 to 400 for std::optional. Yet, I haven’t found that this was accepted but this is a very good idea of improving and simplifying things in C++.

The funny

I knew it already but cannot be silent about it. LLVM and GCC optimizers optimize std::pair<int, int> to use one 64 bit register. It does not do it with std::tuple<int, int>. So, if you have the choice between pair and tuple, always use pair or even user defined structs.

https://gcc.godbolt.org/z/TCp9my

Conclusion

All that was written there also applies to std::tuple, they have the identical behavior in many places. Yet, this is one more argument why C++ old containers and even basic things are broken and ABI stability is doing more harm than good. I will finish it with a tweet that reminds me of a CAP theorem:

I really think that if at some point strong but breaking change does not happen, we will have a C++ fork that will do it.

How to crack HashCode competition with some engineering skills

Hi everyone, hashcode is over. As now I can tell anything, I want just to tell how I was solving this problem, my methodologies, overall the structure of the data sets. I am not going to provide any internals and whatsoever, all data can be checked and much of it is my personal opinion though I had more time to reflect on it rather than contestants that only had 4 hours to solve it. So, I managed to get one of the highest scores (top 2, to be precise with the result 27,189,110) while testing the problem. Let’s talk about it. I provide all my solution data sets and pieces of code (though I don’t provide compilable repository because I am lazy, ok?). Please, get that this is an engineering research rather than a true contest full of combinatoric ideas. I suck at contests and don’t participate that often.

Here is the problem, very well written and explained. If converting it to a math problem, it is exactly this:

There are B books and L libraries. Books somehow assigned to libraries, one book can be in many libraries. Each book has a score S_i, each library has install time T_i and daily throughput M_i (maximum number of books shipped per day). We can only choose one library, wait for its install time, then choose another and so on but all libraries push their books at the same time and when they are ready. We need to maximize the score of all books given D days. Look at the example if you did not get the statement.

undefined

So, this is clearly an NP-hard problem. For example, the simplest reduction is a knapsack problem — assume that all daily throughputs are infinite and libraries do not overlap in their sets of books, then you need to maximize the sum of the libraries with a given budget D. This is a very simple reduction with 2 constraints relaxed. So if we can solve this problem in polynomial time, we can solve any NP problem. The problem itself is very and very hard. There are multiple ways to optimize the score, you can choose these freedoms to gain better score:

  • In the optimal solution, books in all libraries must be sorted descending with respect to their scores. That’s easy to get and polynomial.
  • The set of libraries matters because we are given the budget D. This is exponential of the number of the libraries
  • The choice of the books in the chosen libraries matters. For the books that are inside multiple libraries, it is sometimes crucial which library (or maybe not a single one) to choose to ship this book.
  • The order of the libraries significantly matters. If you decide to push some library first with the install time T_i, all other libraries have at most D - T_i days to ship books.

Some good observations that always help for such kind of problems

  • They are easily simulated — given the answer, you can calculate the exact score — that’s the first you should do in such research problems. It helps you to check thousands of different hypotheses.
  • Freedoms are good to have some basic solutions. For example, I did not know how to optimize the books choices, I used the following technique: if there are multiple libraries for a book, choose one who has the biggest capacity for now. Overall, does not matter much which strategy, just try something reasonable and try different “reasonable” things to get fast and pretty sustainable results. This helped me to reduce the problem to just picking up a set and to start hacking.

But in order to start the actual hacking, some prerequisites.

Linear programming

This is a very sophisticated and interesting science about linear inequalities. Basically, you have a linear functional to maximize and a numerous of linear constraints to satisfy. One of the canonical forms looks like:

\text{Maximize}\  \mathbf{c}^\mathrm{T} \mathbf{x}\  \text{subject to} A \mathbf{x} \leq \mathbf{b}\  \text{and}\  \mathbf{x} \ge \mathbf{0} where x represents the vector of variables (to be determined), c and b are vectors of (known) coefficients, A is a (known) matrix of coefficients, and {\ (\cdot )^{\mathrm {T} }} is the matrix transpose.

This is not the form you should stick to, for example the positivity property of the vector is not needed or can be additionally added (because any number can be represented as the difference between two positive ones). You can have any kind of inequalities (<, >, =). The only requirement — the relations must be linear.

Why is this important? The answer is that the all feasible solutions represent a convex polytope, matrix theory is also widely known and many optimization problems can be formulated in the form of some linear programs. The first and the second help to have many algorithms to solve as this area is widely known from many perspectives (maths and computer science), the last helps to solve many problems in a generalized way. You can assume that the “real world” solvers can solve linear programs with around 5 \cdot 10^5 variables and conditions. Or at least they can find some solutions that are quite good. We will talk about solvers a bit later.

Integer linear programming

The same as linear program but the solution vector x must be an integer vector. Often there are constraints on the vector x to be a \{0, 1\}^n vector which sometimes makes the problem much better. Solvers are less happy about the integer programming — it is NP-complete partially because MAX-3SAT can be reduced (try to do it on your own, it is not difficult, or you can look up the idea here and finish it by yourself).

This area is less known and harder to optimize in the end. This happens because it is really hard to use methods from simple linear programs because the feasible solutions are extremely spare and proving that some solution is optimal is hard (and, of course, it is NP-complete which is one of the main reasons for a such complexity). Modern solvers can solve “real world” integer linear programs for around 10^4 variables. For more only luck and the variety of the solvers can help. Though, you can do something good even with that big amount of variables about which we will talk a bit later.

NP-completeness and hardness. P != NP

I hope most of my readers are familiar with these words. If not, google and read about this, right now, this is one of the most important things in the computer science overall.

A bit of philosophical reasoning. When people start programming, they express their thoughts inside the code to do something with the computer — to show some text, window, to compute some big expressions. Then, while educating, the problems become harder but still solvable. In many minds of the true beginners there sometimes can be a thought that computer can solve any practical problem (or at least, most of them). And then comes this monster, called NP-complete problems. This is a story about what the programmers cannot do.

There are several things in NP-completeness (further, NPC) right now. Discovering that some new problems are in NPC, either finding some approximations in some NPC problems or proving that some approximations of NPC problems are also NPC.

The first thing, reductions are very useful. They help to understand that if you can solve problem A, then you can solve problem B which is proven to be in NPC and then A is in NPC (I missed couple of details, does not matter). These reductions give a good understanding of what the hard problems are. A good intuition when the software developer gets that the challenging task comes to a very hard point and if the boss requires completing this task for 100%, the answer should be: “this problem is NP-complete, if I manage to solve it, I would be a millionaire and it does not matter whoever engineer you are going to hire”. Still, the “real world” is not black and white and we can something do about NPC problems — like finding approximations or trying to find some acceptable solution.

We need to admit that we need to solve these hard problems and that is what I do and don’t like about math at the same time — the math gives up and starts arguing that we need to stop solving this but the real world sometimes provide patterns inside the data which can be better approximated rather than the pure math says.

3SAT

3SAT is a classical NP-complete problem. Its main goal to satisfy the formulas with n boolean variables of type (\vee means OR, \wedge means AND) 3CNF (each parenthesized expression is called clause):

(x_i \vee \neg x_j \vee x_k) \wedge \ldots \wedge (\neg x_l \vee \neg x_m \vee x_n)

The problem is known to be hard and each NP-complete problem can be polynomial (from input size) reduced to it. Though, knowing the exact reduction is mostly useless (because the formula of cubed or even squared size can be unacceptable) does not mean that the world is not still going crazy and there are yearly SAT competitions. Minisat library is a good starting point for solving SAT formulas. Minisat can solve real world SATs with around of 2 \cdot 10^4 variables and clauses (clause is a disjunction of variables). Also, there is a really good blog about SAT solvers from the SAT geek. Though, kinda specific one.

MAX-3SAT

This is the same problem but now we need to satisfy as many clauses as we can. This is more practical because mostly yes/no answer is not that important and some approximation can be a still acceptable and good solution.

Why is it important?

We face these problems quite often, for example, Sudoku for a field n^2 \times n^2 with the squares n \times n is a NP-complete problem, SameGame is an NP-complete, optimal scheduling is NP-complete. All these problems have some intuition behind and this intuition is a good sign of what the software engineers can and cannot do. 3SAT is just a good generalization of it.

Complex 3SAT formulas

A very good question of producing really complex formulas still remains half open. In engineering part you can simply increase the number of variables and do some random stuff and a complex formula is going to happen at some point. Still a good question of producing rather small formulas making all solvers stuck and solve it forever. Only recently there was a paper around it with a pretty interesting results. It states that if the number of clauses (m) divided by the number of variables (n) is around 4.24, then modern SAT solvers explode in finding the solutions if the random clauses are generated. Also, it is important to know that random 3SAT is defined by the action that each clause is randomly generalized by the uniform distribution according to the clause/variable ratio r. Number 4.2 + \varepsilon is kind of important: it is shown by many different papers that this is likely a transition phase number for random 3SATs (so, if the ratio is bigger than that, 3SAT is easy to solve, if less, also easy but if it is around, 3SAT becomes complex). Random 3SAT does embrace a phase transition phenomenon! That’s kind of a funny fact and showed even within some physics laws here as well. The current lower and upper bounds are 3.52 and 4.51 as long as I could find it here and in this presentation. Practical bounds show that the value is around 4.2 + \varepsilon. The algorithm is easy, just create random 3SAT with some existing solution and then show that, for example, MiniSat cannot solve it for a very long time.

Engineering part in all this

There are many libraries and solvers that try to solve such complex problems. One of which I am used to and like the most is or-tools. It supports many languages and is a combination of many-many solvers. Has a pretty good documentation with many examples combining all the solvers such as GLOP, CLP, GLPK, SCIP and GUROBI. All these solvers are complex and require additional posts which will be interested only to some geeks. I would say that the advantage that you can try out many of them with one proxy that or-tools is providing. Many of the solvers can have a timeout and print the existing solution saying was it optimal or was not able to prove it. Also, most of the solvers accept a hint solution using which it can try something better around this solution — this is pretty useful, if you have some combinatoric idea, you can use the solvers to try out if they can find something better that you’ve come up with. Some of the solvers even support multithreaded mode but this is more an advanced feature and I haven’t seen much progress from it, solvers tend to roll down to some solution and don’t move out from it for a long time.

So, one standard thing within the optimization problems — research, check, put into the solvers, repeat. Let’s try to see how it works with the hashcode problem which is kind of optimization problem.

Into the details

Problem is hard to formulate in terms of integer programming mostly because we have libraries which order matters — linear equations don’t like ordering or otherwise we should create many new variables. I thought this is a bold strategy given the fact that the conditions are already pretty big for the integer programming. I decided to look precisely inside the datasets and let’s try to break each of them. Example is not an interesting one, so let’s skip it.

Read On

This is a pretty simple dataset, all books are of cost 100, libraries have its own books, no book in any two libraries, all libraries have the 1000 days to ship everything (1 book per each of the 1000 days), so it is ok get around 90% of the libraries. Actually, as no book is in two libraries, it is just optimal to pick all libraries in the ascending order of their installation time. This is the optimal and we get 5,822,900 points. Easy dataset to start.

Incunabula

This is a harder dataset. All costs seem random, the books are in many libraries but I found that the throughputs are high. That means that the order in which we are shipping the libraries does not matter, matters the sum of the installation times. In this case it is a maximum budgeted coverage problem. Indeed

maximize \sum_{b \in B} S_b \cdot y_j . (maximizing the weighted sum of covered books).

subject to \sum{T_i \cdot x_i} \leq  D; (the installation time of the selected libraries cannot exceed D — overall days).

\sum_{b_j \in L_i} x_i \geq y_j ; (if a book is chosen, i.e. y_j > 0 then at least one library where this book presents b_j \in L_i is selected).

y_j \in \{0,1\}; (if y_j=1 then book b_j is chosen)

x_i \in \{0,1\} (if x_i=1 then library L_i is selected for the cover).

But if you will try to just put this the conditions inside the problem, the solver will not find anything cool in one hour. That’s why we need some sort of hinting. In that case I just came up with some greedy solution: pick the library that will give us the most value divided by the number of days it is installing, delete all books from other libraries, subtract the number of days and repeat the iteration. After some analysis, it can be proven this is a 1 - 1/e approximation. And this solution brings us ~5,689,000 points. After that, I moved it to a SCIP solver as a hint and after 20 seconds it said it found the optimum in 5,690,556 points and that’s it. Dataset is cracked. This is the optimal solution to it.

Tough choices

As long as I have somewhat greedy solution and written linear program, let’s dive into this dataset. This dataset has all books of equal cost. All libraries contain small libraries, that means we should not care much about the ordering — most books will be deployed.

Also, 15000 books are exactly in 2 libraries (and adjacent ones), 63600 books are in exactly 3 libraries. Also, after that I found out that 63600/15000 = 4.24 which means that it is likely common to random 3SAT with a transition phase. So, basically I split 30000 libraries in pairs, each pair is a variable in 3SAT, if you choose the first one, this is true, otherwise false, other books in the libraries are the clauses numbers, in the first element there are the clauses that contain the variable, in the second the negation of this variable. Then I created a 3SAT with 15000 variables which the world believes it is hard to solve.

The world is, of course, correct, I tried pushing this 3SAT inside Minisat and it did not manage to find any good solution for a long time and I gave up. Luckily, I could push it to the integer programming solvers and optimize the coverage of the books. Greedy solution from the previous solution gives 98% coverage (which results in ~5,006,820 points) which is already great. Some code prototype of the integer programming reduction can be found here. After 2-3 hours of waiting I got 99.75% of the coverage which resulted finally in 5,096,260 points. The solution you can find here. Solving it optimally is a really hard problem but having the 99.75% coverage is quite good.

So many books

This is the dataset I liked the most, overall I didn’t manage to prove something interesting here but at least I can tell how I was solving it. In this dataset the order of the libraries matters a lot and linear program is possibly only when the order of the libraries is fixed, then we can optimally choose which books should be chosen from which libraries. I decided to run a greedy solution and only somehow chose the books and got around ~5,100,000 points. Then I did not know how to correctly assign the books to the libraries and I just used the solver of the fixed order of the libraries and got ~5,200,000 points, so it turned out that optimal assignment gives lots of more points. Then I found out that the optimal assignment ignores cheap books aggressively and in greedy solution I started to choose which books I should account for the greedy solution choice, in the end I counted only the books with 99+ score and finally got 5,235,423 points which you can check with this solution and this linear program which worked for 10 seconds, so finding the optimal score for a greedy solution was pretty doable.

I do believe this dataset has much inside it and can be cracked better than by me. Still, I got +120k from the greedy solution which possibly gave me so many points.

Libraries of the world

This dataset is a bit easier than the previous one because installation time is quite big and the day count is rather small. It turned out that likely optimal solutions are in a range from 15 to 18 libraries and this is a good number to try different permutations, some random removals and new additions. That I was doing — remove half of the libraries, add greedily to match the needs, run the solver (which solved optimally in 5ms), so I managed to try around 3-4 billion different permutations and variants. In the end simple greedy solution gave me ~5,312,895 points and I managed to get with this algorithm 5,343,950 points which you can check here. I think this value is not far from optimal, otherwise the solution is too far from the greedy one which I personally don’t believe given number of the variants I’ve tried.

Conclusion

I want to say that optimization problems arise in software engineering from time to time and it is essential to try some combinatoric solutions with the solvers — they can help find some interesting dependencies and optimize better your own solutions. Also, as there are many different solvers, it is easy to try them out given that or-tools already implemented a proxy where you can just change 1 parameter.

Also, hashcode is very fun to solve.

I wrote Go code for 3 weeks and you won’t believe what happened next

Historically, I am a C++ programmer and really like all this kind of C-ish stuff — from raw assembly to high level abstractions and SFINAE methods. For the past three weeks I managed to write lots of Go code — around 10k LOC without almost knowing anything when I started. I want to talk what I like, what is bad and funny. I am not a Go expert yet and this is my opinion which might be wrong, offensive and misleading.

The Good

Go is (almost) easy to learn

This language has in its DNA this fact. Once you read “Tour of Go“, “Effective Go“, look through the standard library, you already can program some really cool things. I can admit that most things I read in Go, I can understand — what this code is doing and to this level of understanding I came in 3 weeks unlike C++ where I managed to read most of the standard library parts in 4 years after starting doing it. This is a huge advantage — simplicity. And, as you know, simplicity is very and very complicated.

I cannot say Go is very rich, standard library is pretty small but for the problems Go can and is solving, this is more than is enough. I don’t know but I really think that people that are the maintainers of Go say “No” to features more often than other languages because they want to be simple and restrictive at first and then try to expand the language.

Goroutines and channels

Oh god, they are really cool. It requires some time to understand what they are and how to use them correctly. For me personally this was not a problem.

But using them in Go is so comfortable, inviting, snugly, that you want to write everything in goroutines. They are like a crux of the biscuit to start and never finish, they are begging you not to stop writing go f(), they are the hidden thing to say to others “Hey, how are you doing? – I am invoking a goroutine. – Ah, that’s cool, won’t disturb you”. So, as you see, I am impressed by the goroutine design and its ease to use. Possibly without it, it would be a pretty useless language.

I was writing code specifically doing something asynchronous, non blocking, background and I managed to use all my concurrency knowledge with a great and easy infrastructure.

Go is pretty performant

Though, it has GC, it is pretty performant. I found around 10% lower performance than C++ on my synthetic benchmarks and that is pretty decent. I believe it might be faster than Java because in Java every structure is a pointer in a slice but I might be mistaken.

defer f()

Yes, I really like this feature, it is easy to understand, read and use. It is one of the keys to understand the lifetime of the objects which is done much better than in C which usually uses goto to finish something. Though, I still think that the C++ destructor is really the best design approach for object lifetimes, defer is a good approach for non object oriented languages and it will be a definitely a good extension to C language (but I am pretty sure this will never happen).

Code formatting

There is gofmt and that is all. All, you heard? Language defined code format, not a human.

And that is nice, no debates, just invoke gofmt and that’s it. Code is written once and read hundreds of times and the faster you understand the constructions, the more code you are going to write in the end. I would say I like this so much for that particular reason — you understand code very fast because you have lots of catchy constructions.

The linters are also nice but in my opinion in C++ you can do much more through LLVM infrastructure.

iota

Iota is a const declaration to simplify definitions of “enums” in Go. For example, this is really nice:

const (
	_           = iota // ignore first value by assigning to blank identifier
	KB ByteSize = 1 << (10 * iota)
	MB
	GB
	TB
	PB
	EB
	ZB
	YB
)

This is one more feature for the simplicity and I like it a lot. Define some constants and you know what their transformations are. I really think other languages miss this simple feature so much. Don’t need something specific? Write iota. Need some simple formula? Write iota. Need something more complex? Write yourself.

Patterns

Humans like patterns, when they see something familiar, what they did themselves, they feel it, they know exactly how it is working.

Go is full of patterns, like the very simple one:

if err != nil {
  return nil, err
}

Human eye is very catchy on such kind of things. So, this is one more argument for better readability. To make people use patterns in C++ is barely possible and is not a priority for the standard.

Compilation time

I believe Go can compile hundreds of thousands lines of code in seconds? I really caught myself waiting for linkage more than compiling everything. I like this scalibility with the fact it is only around 10-15% slower than C++.

Exported and unexported types

Hiding implementations by writing the function name with the lower case is pretty funny and accurate in my opinion. I was not able to understand it for the first time but then I was able to read and understand other code even better.

Portability

Easily, without any specific thing, code can be built on Linux x86_64, Linux PowerPC, MacOS, Windows with the same semantics of goroutines, network, filesystem handling, etc. I was writing client side code for many platforms and Go perfectly fitted this requirement.

In C++ in order to write coroutines that work on all the platforms, you need to spend much time make it work like you really want and the community does not help here, most of the implementations are not that tested and there is little trust overall.

The Bad

Hacks around generic types

Maps, slices and channels are generic types. And others? Possibly now I understand all the lack of generics jokes in Go.

For example, I want to use goroutine safe map sync.Map. I look at the implementation and see interfaces and I need to use casts. Okay, you make me sad. I do believe it is possible to have generics in Go without destroying the language simplicity but I might be mistaken.

I would expect atomic types being generic at least because synchronization between goroutines is very important in a such language. And for now I need to write something like

type Status int32

var status Status
...
atomic.StoreInt32((*int32)(&State.status), int32(status))

Interfaces

It is hard to find all the definitions of the interfaces, sometimes it is not clear how they are built — all I see is numerous NewSomeType functions to create a new type with some interface. To find all the instantiations is pretty hard. Possibly it is a matter of time to find hacks how to do it but I failed it completely. Maybe tooling is not the best but I am working at Google where Go was created … I don’t know why it was extremely hard to me.

Zero values initialization

Types by default are nil and I don’t like it. Either you need to check in the implementations being non nil or panic.

Possibly constructors and destructors can solve this problem but still I am not sure this is possible in Go philosophy.

Constant weirdness

I don’t understand why I can write 30 * time.Minute but cannot integerVariable * time.Minute. Of course, I can google it but understanding why is much more complex. I still think this is a great misbehavior in Golang.

Compiler infrastructure

After C++, Go has a really bad compiler in its error messages. The simplest example is lack of typos finding. If somebody misses the character or mistypes it, it takes time to find the actual error. Better suggestions would also help, for example, for non-exported functions and fields.

Also, I don’t like when there is a monopoly inside the language compilers. Yes, there are gccgo and llvm-go but they are kind of … silent, I mean. And I don’t understand why this is happening. Having the competition in all kind of areas in the world is always better than not having it.

Mutability

After const qualifier in C/C++, in Go I miss it a lot. Basically, you can change any type of field whenever you want. Adding const to a language possibly is pretty late but mutability is not an answer for such mid-level abstractions language, in my opinion.

Lack of visibility support

For example, protobuf generated code for enums is ugly.

message SomeMessage {
  enum SomeEnum {
    SomeValue = 0;
  }
}

To access SomeValue, you need to write SomeMessage_SomeEnum_SomeValue with the type SomeMessage_SomeEnum. I don’t like the fully qualified names here. Possibly some kind of visibility namespaces inside the structs can help. I find such code generation ugly and as a true believer in Protobuf over JSON battle, this looks as a disadvantage. Also, golang syntax for protobufs is VERY different from other languages.

Testing infrastructure

I hate writing tests in go. It is not structured at all, it is difficult to use mocks, it is difficult to test different parameters, there is no infrastructure for comparing two values, expecting some error is always done with a different test case where you specify some wantError variable to compare with. If the language prioritizes being more language specific rather than community specific, I expect having perfect testing library.

For example, this leads to a fact that it is hard to get 100% code coverage. Because Golang code is full of error checking and mocking these errors is always hard. This can be solved by a better testing library support or some macro machinery around error handling, because sometimes you are tired of writing hundreds of 3 lines of err != nil conditions without testing them properly.

Where is my memory?

After C and C++ it is really hard to understand that you don’t need to clean up the memory but you are still thinking too much how not to allocate something bigger, reuse some memory, etc. This is mostly annoying and Golang does not talk about this much but probably sometimes should — how GC works, what it is doing, that its semantics is low-latency but probably high CPU for horizontal scaling.

Messy assembly

I am a low level guy and sometimes this is hard to force the compiler to do what I really want and I was not able to understand what the compiler is doing and why. For example, https://go.godbolt.org/z/z-T6Ck. TL; DR. It is really hard to read the assembly of it. For example, for StoreInt32 it generates xchgl AX, (CX) instruction

What if I want another memory ordering? No default support. Also, lots of noisy pcdata things for garbage collector.

The Funny

Time format layout

No comments, just funny. TL; DR To format time variable t you can only write the format with a specific hardcoded date.

s := t.Format("2006/01/02")

More here, for example.

Community and gopher apocalypse

A lot of stuff in the reviews like Good to Go, Gotchas or many gopher memes around the community. I mean, this is catchy, nice and friendly.

Conclusion

Go is not a universal language for any kind of things you want. This is good for simple server applications, client side code (with little “multithreaded” overhead), some caching stuff and not big projects. Though, I like Go a lot and if I want to write something small, networky and useful, Go is my choice right now. For big and longstanding projects, I do believe Go is bad because if you want to do something complex, Go won’t let you do this anyhow.

How to contribute to Abseil with a visible performance gain

Overview of Abseil

Abseil is a great C++ library with a special and unique philosophy inside it. While there is a lot of debate what C++ is prioritizing more – portability or performance with possibly breaking changes, Abseil library aims in a standard compatible way to provide best library features of all standards sometimes with a very good performance. No need to upgrade your compiler which may contain bugs which are extremely hard to debug (for example, if you are developing highly reliable software for rockets, being on the edge with the latest compilers is often not even a choice), no need to upgrade C++ language features with some possible breaking changes, just get Abseil and get all features from future standards which are already highly tested by millions of lines of code across Google.

As an example to show, all standards prior to C++17 lack of useful class std::optional<T> which provide an optional access to some type T. This class can be efficiently implemented even in C++11. By the way, optimizations for std::optional is another a very interesting topic – how to create such a simple class being 1500+ lines of code overall. So, Abseil implements this like

Abseil very rough idea of implementing things

Then, you just use in your code absl::optional. After the update to C++17, it is a drop-in replacement of std::optional and you can replace one class to another. Abseil guarantees 5 years of support after the standard release and then workarounds can be removed for previous standards. Abseil does not guarantee ABI compatibility – this is a very debatable state but improving the things over the time is one of the humanity problems – we tend to make mistakes and a good thing is that we should learn how to fix them. Modern C++ compilers can do this in a very efficient way (clang-tidy, all kind of tooling, etc). That’s why if Abseil introduces some breaking change, it comes with a tool to fix all of your code. Google has more than 250 millions lines of C++ code – all breaking changes are firstly tested on Google’s codebase and then ported to open source. Dealing with that amount of code cannot be done with a human interaction and automation is one of the main priorities to change something – you break, you provide an automated tool to fix this. We will also talk about this a bit later.

Of course, not everything is that cool, I decided to spend an hour or so and to find some incompatibilities, and I managed to find them. Abseil provides std::string_view replacement absl::string_view which claims to be compatible with a standard version. And consider the following example: one is working in C++17 mode, the other is not in C++11 mode:

https://gcc.godbolt.org/z/ius9az

We just assign to std::string an object of type absl::string_view. It fails to compile in C++11 and is successful in C++17. Which, I mean, is not that big problem because people always move forward with a standard rather than backward.

This is a very neat example shows that forward compatibility guarantee is hard because of multiple interaction between classes. One of these is shown above – std::string provides an assignment operator accepting std::string_view in C++17. Personally I don’t understand why. std::string_view claims to be a class that if you want a conversion to an owned string, you should write it explicitly. And… There are very very few options in C++ doing so with an assignment operator efficiently.

For example, writing something like

s = std::string{sv};

Would be much less efficient because we create a temporary string.

For constructors there is a choice to make the constructor explicit or implicit. For assignment there is only one syntax, unfortunately.

Of course, we have assign(). Traditionally assign() mirrors constructors (not necessarily all, e.g. assigning a default-constructed string is s.clear() rather than s.assign()), and operator=() is a subset of assign() with a single parameter. I do not think there is a precedent for making some single-parameter assign() overload not available as operator=(). People from standard could have introduced such convention for explicit constructors but they did not. Actually, libc++ implementation calls assign(const char*, size_t) as you can see in the godbolt link. I claim this as a defect of a standard but not a huge one, of course. That way Abseil does not provide a full C++17 replacement for the types.

ABSL_FLAG and HOW_TO_GET_CRAZY

After some brief overview of the Abseil library, we can continue digging inside it even more. I want to talk more about the command line flags library. In the documentation it is stated (with my comments in italic):

The Abseil flags library allows programmatic access to flag values passed on the command-line to binaries. The Abseil Flags library provides the following features:

  • Access to Abseil flags in a thread-safe manner. This is a very interesting property to have for all flags.
  • Access to flag values that are valid at any point during a program’s lifetime. This means right here that they are global variables and we know that globals are harmful…
  • Allows distributed declaration and definition of flags, though this usage has drawbacks and should generally be avoided. extern hello darkness, my old friend

Values for these flags are parsed from the command line by absl::ParseCommandLine(). The resulting value for each flag is stored in a global variable of an unspecified type absl::Flag<T>.

TL; DR. Flags are global moduled variables that must be defined in .cc|.cpp|.cxx files (otherwise you will get an ODR violation) that can be declared and accessed by any code if you declare them in a translation unit and link with the definition.

I think this is one of the most convenient way to turn on/off features in big codebases if you are depending on some other APIs – it is easy to test features, to setup an environment. The other option I see in production binaries is to provide a context to propagate it through the entire program stack and have some kind of global config. Either way, we need something global and to avoid bloat, Google decided to use global variables.

And now come the most sophisticated parts. First, global variables are bad for multiple reasons

  • You should not define them in headers unless they are inlined (weak symbols) because you will have ODR violations (which variable to choose across translation units is an open question to a linker). And it is an undefined behavior in C++.
  • Global variables with dynamic storage duration should be created and used very carefully. The flags have default values and you can specify your own flags with any type. Many guidelines and even universities (especially in Russia) suggest creating singletons (and flags are singletons in a way) with the BadSingleton() function:
Bad and Good singletons

Flags library solves the last problem in an interesting way. With a definition they don’t create anything, literally anything. Let’s see how it is done:

Rough absl::Flag definition

And you can see that flag is a compile time definition – all pointers, all data is known at compile time and no memory is allocated and it is nicely generated in assembly code then. Also, remember that flags are thread safe, at least for big types it should contain mutex (actually, mutex is always needed because initialization requires more than just replacing the value). As mutex is not a trivial type, in this situation, absl::FlagImpl does a great trick:

If you don’t know that the placement new is, look into https://en.cppreference.com/w/cpp/language/new#Placement_new

All this is done not only for safety reasons. My guess that some programs at Google can contains hundreds of thousands of different modules with multiple flags in each of it. If we allocate even something during the initialization, the binary startups linearly slow down which is a so huge and bad effect on the user experience and performance overall (even recovery latency, sometimes if the binary segfaults on a killer query, for example, search engine, it is better to have a very fast startup to continue working ASAP). As only a small fraction of flags are accessed during the execution, the decision to make the initialization lazy is a great performance and user experience solution.

Flags are not meant to be performance critical and used in tight loops though the reality can show something different. Many APIs create their options and access the flags in the constructors (for example, to distinguish test and prod environment). And it becomes a performance critical feature because options can be created thousands of times with a multi-thread flags access. Though, I think calling absl::SetFlag happens really rare and might not be have that performance hefty.

What we know for now about flags:

  • They are global variables with lazy initialization.
  • They might be read performance critical.
  • They are thread safe in their access.

These properties are a bit controversy because thread safety property is a really complex issue with a templated arbitrary type. In general case it is done by holding a mutex because it cannot be done in any other way. To make the situation a bit better, flags did an optimization for trivial types like integers and floats (I put a lot of comments inside the snippet, please read them all):

So, the implementation is holding one atomic of 8 bytes and tries to get the fast path if the flag was initialized based on a heuristic in that way that if AtomicGet returned true, that means that flag was initialized and we can safely return.

This generates very good assembly with few instructions and the benchmark for flags access showed the following results for all the one worded trivial types (I show only for bool for simplicity):

Benchmark                                     Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------------
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:1  0.464    0.464   1000000000
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:2  0.233    0.465   1445775046
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:4  0.126    0.476   1427829560
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:8  0.136    0.875    576009808
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:16 0.097    0.963    673725472
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:32 0.086    0.963    691627808

We can conclude that it is extremely fast and does not degrade much with the number of threads increased. It is achieved because of std::memory_order_acquire in atomic load and x86_64 with its strong memory model (almost each instruction guarantees release and acquire semantics) generates around the following code which is efficient, small and extremely fast:

The benchmark of course is far from ideal (guess why).

For all other types absl::Mutex is held for accessing the value which has around 17ns latency for holding in one threaded mode and around 150ns in 4 threaded mode. Hi, latency.

My concerns about all above

And here comes my biggest concern – I don’t feel this API guarantees the right thing enough (I mean, atomic hack). Though, in a true way this is not a guarantee but looking at the implementation while programming and profiling absolutely unrelated thing, I put this thing into the list to look into because I felt through my veins the lack of completeness around the API and it bothered me a lot. I got the feeling about all this but the most fascinating part that I was not able to formulate it in the next couple of hours but I caught that feeling and it got inside me that in my background thoughts I was thinking and trying to get all the right words to understand what really bothered me. Okay, let’s roast a bit this implementation.

  1. This works bad for other trivial types like enums that fit into the 8 byte register. And why are we holding mutex inside it for them, are these types special?
  2. Why do we have really special types? Yes, they are built-in, but they are not general, they optimize many things but not all.
  3. This is not scalable and this is the biggest concern. What if I want to provide other lock free types? I should list them in that list. What? Why? I mean, I should not, this involves circular dependencies and additional macro expansions. C++ is much more powerful to deal with such things at compile time.
  4. absl::Flag provides absl::Duration type implementation which is 12 bytes in size. C’mon, modern processors have 16 and 32 byte registers, can we operate with them in an atomic way? I should investigate all these issues. Also, absl::Duration turned out to be very popular flag and it potentially can save many CPUs and latency which is hard to measure just running some benchmarks.
  5. Oh, what if I can do this for 16 bytes registers, I can hack small string optimization (will talk about this a bit later).

When I told all this to one of the Abseil team members, I got around this feedback (stolen Russian meme):

Some of the Abseil team members were skeptical at first but I decided not to say any more words and make a prototype of this.

Let’s implement this

First, we need to create a trait to understand whether the type can fit into the atomic. As we use memcpy, the type must be trivially copyable according to a C++ standard. So, I came up with something like:

And it failed so many times during the testing because … because it is C++ and it is going to punch you in the face all the time when you are not doing really correct things.

First, trivially copyable type might not be default constructible, i.e. we cannot write T value;. To work around this, we need to either now how it can be worked around either remember some other examples. For example, optional<T> can be created with a non default constructible type. Looking into the Abseil implementation, we can see a very interesting thing:

It is using a union type, indeed if we put a type into the union, it just prepares the memory layout for it. So, I ended up in something like:

Run tests again and they again failed. ARGH, now it is not working on all compilers. GCC 4.x+ is failing but luckily Abseil has a workaround this (look at line 8).

So, after all miscompilations, bugs, finally we have a version working for all trivially copyable types.

And it turned out to be breaking again. 🙂

Before that all trivial types were defined in .cc file which has external linkage. Assume now that you have a shared library that accesses some trivial flag. Briefly, see

As now the linkage is internal, Get on its first initialization will get the pointer that was not firstly provided inside the flag definition because shared libraries are loaded with another addresses dynamically. So, it will result in a log statement which can pollute a bit the logs of the execution. This check is here because we want to check that the flag definition is the same as its declaration, to do this we need to check that types are the same and it is done in not the best way. The best way will be to check the typeids but this comes with some cost. Overall, we decided not to patch it right now as shared libraries for flags are already broken (for all non trivial types) and rare.

But this is funny – I changed 3 places writing pretty understandable code and it turned out that all 3 places are breaking changes and ended up with something ugly and crunchy. C++ in a nutshell. And https://www.hyrumslaw.com/.

Okay, let’s assume we are finished with trivial types. Now comes the boss called 16 byte atomics. I really think that the theoretical knowledge helps in this case. Everyone should know that in order to create an atomic register we need a CAS operation on the register. For example, Introduction to Reliable and Secure Distributed Programming book helped me a lot to understand how memory models work from its theoretical to practical parts. To be dull, I just googled CAS 16 bytes and figured out that there is an instruction cmpxch16b on modern x86_64 platforms (by modern I mean westmere from 2010).

It turned out that even windows 8.1 requires this instruction for installation. This instruction comes with a bigger cost (by the way, I still don’t know why 16 byte loads do not meet strong model memory consistency but this is an absolutely another topic).

However, this instruction comes with a cost because it requires lock prefix. I am not an x86_64 memory model expert but my understanding what will happen:

The cost of the x86 LOCK prefix, or CAS, before PentiumPro, is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.

Not going into too much detail, I’ll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses. L1 Cache miss is around 4ns, so we can wait for 8ns latency for the lock prefix. Cachelines are always at least 16 bytes, so probably we should expect something like this.

It is no way <=1ns with one word atomics. I decided that we need a dispatch for this. So, for trivially copyable types >8 bytes and not bigger than 16 we can generate two words atomic while for others we should stick with a very nice and fast mov instructions. I ended up implementing something like this:

The benchmark showed 7.6ns accessing absl::Duration flag in 1 threaded mode and 10 times better in 16 threaded mode. So, we saved a lot of contention, provided lock free implementation for bigger types increasing the flag size only by 8 bytes.

Funny fact that GCC since version 7.1 does not generate 16 byte atomic lock free code because

IIRC this was removed as the instruction cannot be used for read only memory.

Andrew Pinski 2017-05-25 12:49:51 UTC. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

I barely understand what this means but the fact remains the fact. GCC does not provide this instruction in any of the last implementations. Sorry, Clang, you win this game. See https://gcc.godbolt.org/z/zzicCu. And I turned it on for all clang compilers that have -mcx16 option inside. Windows does a complete garbage here and I don’t even think about it right now.

After I sent the patch on the review, we got 2 additional meetings, around 3000 words pretty tough arguing if we need this, checking that nothing else breaks. (My side note – I find sometimes very hard to convince people at Google doing some crazy things but with enough intention, good intuition, it is possible in short amount of time). I was polishing it and making the code more readable. In the end it turned out completely insane and now I FEEL this is way clearer in its internals rather than it was before.

As it is a well known fact that Google uses LLVM infrastructure, my change might possibly be useful to all services. This patch was open sourced couple of days ago under the name absl::GetFlag is lock free for small trivially copyable types. with other internal changes 🙂 .

Next

Maybe this is not the end. The way I told you, it might be possible to hack libc++ small strings <=14 bytes. They are stored on stack and first byte of the representation indicates if the string is small and you can get the string size. If it is small, we can atomically load 16 bytes to return a valid small string to a user. This requires using undocumented function __is_long() and very robust code which might end in really crazy things and bugs.

Also, on other platforms there are 16 byte atomic loads but they are not generated by all compilers and for all platforms. For example, for PowerPC GCC generates stq instruction (store quad) but Clang does not. For ARM the situation is the opposite. Also, it is hard to get old compilers for all the platforms and I decided to give up a bit on this. But this is absolutely possible.

Epilogue

To write the working code took me around 2 hours from scratch. Research on how the things are going around took me around a day. Then 5 days of arguing with people and why we need or don’t need this. But this definitely is one of the greatest ideas that came to my mind for the past couple of years. Also thanks to C++ that it is possible to do this crazy thing and gain the maximum we can do.

What am I doing with my life?

About me and this blog

I am Danila Kutenin. I am an engineer working at Google and doing distributed and high performance C++ software. My main interests are distributed systems, operating systems design, hardware testing, compilers, APIs, algorithms and theoretical computer science.

I worked for almost for 2 years at Yandex doing the core search engine – one of the biggest runtime services in Russia facing million of daily users and thousands QPS. Also, I was one of the C++ library maintainers fixing, optimizing and providing cleaner APIs for all C++ code at Yandex. There I learned a lot how to do high performance and distributed things, how to manage huge clusters and to find my way to what I like and what I don’t.

Now I am working at Google doing internal distributed file system facing all Googlers all around the world, providing best user experience in latency, speed and synchronization. My side projects include patching compilers, LLVM, C++ main library, documentation engine, Chromium and everything that I find funny enough to do and to investigate.

This blog is a way of expressing and combining my interests and feelings about practical engineering explaining a lot of “why”s from its theoretical, hardware and software side – from getting a brilliant math idea to an actual implementation, assembly code generation and how it is dealing with a hardware revealing all the magic that is happening through all the stack.

If you want something to discuss, everything you want me to ask/suggest, please reach me out at kutdanila at gmail dot com.