How to contribute to Abseil with a visible performance gain

Overview of Abseil

Abseil is a great C++ library with a special and unique philosophy inside it. While there is a lot of debate what C++ is prioritizing more – portability or performance with possibly breaking changes, Abseil library aims in a standard compatible way to provide best library features of all standards sometimes with a very good performance. No need to upgrade your compiler which may contain bugs which are extremely hard to debug (for example, if you are developing highly reliable software for rockets, being on the edge with the latest compilers is often not even a choice), no need to upgrade C++ language features with some possible breaking changes, just get Abseil and get all features from future standards which are already highly tested by millions of lines of code across Google.

As an example to show, all standards prior to C++17 lack of useful class std::optional<T> which provide an optional access to some type T. This class can be efficiently implemented even in C++11. By the way, optimizations for std::optional is another a very interesting topic – how to create such a simple class being 1500+ lines of code overall. So, Abseil implements this like

Abseil very rough idea of implementing things

Then, you just use in your code absl::optional. After the update to C++17, it is a drop-in replacement of std::optional and you can replace one class to another. Abseil guarantees 5 years of support after the standard release and then workarounds can be removed for previous standards. Abseil does not guarantee ABI compatibility – this is a very debatable state but improving the things over the time is one of the humanity problems – we tend to make mistakes and a good thing is that we should learn how to fix them. Modern C++ compilers can do this in a very efficient way (clang-tidy, all kind of tooling, etc). That’s why if Abseil introduces some breaking change, it comes with a tool to fix all of your code. Google has more than 250 millions lines of C++ code – all breaking changes are firstly tested on Google’s codebase and then ported to open source. Dealing with that amount of code cannot be done with a human interaction and automation is one of the main priorities to change something – you break, you provide an automated tool to fix this. We will also talk about this a bit later.

Of course, not everything is that cool, I decided to spend an hour or so and to find some incompatibilities, and I managed to find them. Abseil provides std::string_view replacement absl::string_view which claims to be compatible with a standard version. And consider the following example: one is working in C++17 mode, the other is not in C++11 mode:

We just assign to std::string an object of type absl::string_view. It fails to compile in C++11 and is successful in C++17. Which, I mean, is not that big problem because people always move forward with a standard rather than backward.

This is a very neat example shows that forward compatibility guarantee is hard because of multiple interaction between classes. One of these is shown above – std::string provides an assignment operator accepting std::string_view in C++17. Personally I don’t understand why. std::string_view claims to be a class that if you want a conversion to an owned string, you should write it explicitly. And… There are very very few options in C++ doing so with an assignment operator efficiently.

For example, writing something like

s = std::string{sv};

Would be much less efficient because we create a temporary string.

For constructors there is a choice to make the constructor explicit or implicit. For assignment there is only one syntax, unfortunately.

Of course, we have assign(). Traditionally assign() mirrors constructors (not necessarily all, e.g. assigning a default-constructed string is s.clear() rather than s.assign()), and operator=() is a subset of assign() with a single parameter. I do not think there is a precedent for making some single-parameter assign() overload not available as operator=(). People from standard could have introduced such convention for explicit constructors but they did not. Actually, libc++ implementation calls assign(const char*, size_t) as you can see in the godbolt link. I claim this as a defect of a standard but not a huge one, of course. That way Abseil does not provide a full C++17 replacement for the types.


After some brief overview of the Abseil library, we can continue digging inside it even more. I want to talk more about the command line flags library. In the documentation it is stated (with my comments in italic):

The Abseil flags library allows programmatic access to flag values passed on the command-line to binaries. The Abseil Flags library provides the following features:

  • Access to Abseil flags in a thread-safe manner. This is a very interesting property to have for all flags.
  • Access to flag values that are valid at any point during a program’s lifetime. This means right here that they are global variables and we know that globals are harmful…
  • Allows distributed declaration and definition of flags, though this usage has drawbacks and should generally be avoided. extern hello darkness, my old friend

Values for these flags are parsed from the command line by absl::ParseCommandLine(). The resulting value for each flag is stored in a global variable of an unspecified type absl::Flag<T>.

TL; DR. Flags are global moduled variables that must be defined in .cc|.cpp|.cxx files (otherwise you will get an ODR violation) that can be declared and accessed by any code if you declare them in a translation unit and link with the definition.

I think this is one of the most convenient way to turn on/off features in big codebases if you are depending on some other APIs – it is easy to test features, to setup an environment. The other option I see in production binaries is to provide a context to propagate it through the entire program stack and have some kind of global config. Either way, we need something global and to avoid bloat, Google decided to use global variables.

And now come the most sophisticated parts. First, global variables are bad for multiple reasons

  • You should not define them in headers unless they are inlined (weak symbols) because you will have ODR violations (which variable to choose across translation units is an open question to a linker). And it is an undefined behavior in C++.
  • Global variables with dynamic storage duration should be created and used very carefully. The flags have default values and you can specify your own flags with any type. Many guidelines and even universities (especially in Russia) suggest creating singletons (and flags are singletons in a way) with the BadSingleton() function:
Bad and Good singletons

Flags library solves the last problem in an interesting way. With a definition they don’t create anything, literally anything. Let’s see how it is done:

Rough absl::Flag definition

And you can see that flag is a compile time definition – all pointers, all data is known at compile time and no memory is allocated and it is nicely generated in assembly code then. Also, remember that flags are thread safe, at least for big types it should contain mutex (actually, mutex is always needed because initialization requires more than just replacing the value). As mutex is not a trivial type, in this situation, absl::FlagImpl does a great trick:

If you don’t know that the placement new is, look into

All this is done not only for safety reasons. My guess that some programs at Google can contains hundreds of thousands of different modules with multiple flags in each of it. If we allocate even something during the initialization, the binary startups linearly slow down which is a so huge and bad effect on the user experience and performance overall (even recovery latency, sometimes if the binary segfaults on a killer query, for example, search engine, it is better to have a very fast startup to continue working ASAP). As only a small fraction of flags are accessed during the execution, the decision to make the initialization lazy is a great performance and user experience solution.

Flags are not meant to be performance critical and used in tight loops though the reality can show something different. Many APIs create their options and access the flags in the constructors (for example, to distinguish test and prod environment). And it becomes a performance critical feature because options can be created thousands of times with a multi-thread flags access. Though, I think calling absl::SetFlag happens really rare and might not be have that performance hefty.

What we know for now about flags:

  • They are global variables with lazy initialization.
  • They might be read performance critical.
  • They are thread safe in their access.

These properties are a bit controversy because thread safety property is a really complex issue with a templated arbitrary type. In general case it is done by holding a mutex because it cannot be done in any other way. To make the situation a bit better, flags did an optimization for trivial types like integers and floats (I put a lot of comments inside the snippet, please read them all):

So, the implementation is holding one atomic of 8 bytes and tries to get the fast path if the flag was initialized based on a heuristic in that way that if AtomicGet returned true, that means that flag was initialized and we can safely return.

This generates very good assembly with few instructions and the benchmark for flags access showed the following results for all the one worded trivial types (I show only for bool for simplicity):

Benchmark                                     Time(ns) CPU(ns) Iterations
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:1  0.464    0.464   1000000000
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:2  0.233    0.465   1445775046
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:4  0.126    0.476   1427829560
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:8  0.136    0.875    576009808
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:16 0.097    0.963    673725472
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:32 0.086    0.963    691627808

We can conclude that it is extremely fast and does not degrade much with the number of threads increased. It is achieved because of std::memory_order_acquire in atomic load and x86_64 with its strong memory model (almost each instruction guarantees release and acquire semantics) generates around the following code which is efficient, small and extremely fast:

The benchmark of course is far from ideal (guess why).

For all other types absl::Mutex is held for accessing the value which has around 17ns latency for holding in one threaded mode and around 150ns in 4 threaded mode. Hi, latency.

My concerns about all above

And here comes my biggest concern – I don’t feel this API guarantees the right thing enough (I mean, atomic hack). Though, in a true way this is not a guarantee but looking at the implementation while programming and profiling absolutely unrelated thing, I put this thing into the list to look into because I felt through my veins the lack of completeness around the API and it bothered me a lot. I got the feeling about all this but the most fascinating part that I was not able to formulate it in the next couple of hours but I caught that feeling and it got inside me that in my background thoughts I was thinking and trying to get all the right words to understand what really bothered me. Okay, let’s roast a bit this implementation.

  1. This works bad for other trivial types like enums that fit into the 8 byte register. And why are we holding mutex inside it for them, are these types special?
  2. Why do we have really special types? Yes, they are built-in, but they are not general, they optimize many things but not all.
  3. This is not scalable and this is the biggest concern. What if I want to provide other lock free types? I should list them in that list. What? Why? I mean, I should not, this involves circular dependencies and additional macro expansions. C++ is much more powerful to deal with such things at compile time.
  4. absl::Flag provides absl::Duration type implementation which is 12 bytes in size. C’mon, modern processors have 16 and 32 byte registers, can we operate with them in an atomic way? I should investigate all these issues. Also, absl::Duration turned out to be very popular flag and it potentially can save many CPUs and latency which is hard to measure just running some benchmarks.
  5. Oh, what if I can do this for 16 bytes registers, I can hack small string optimization (will talk about this a bit later).

When I told all this to one of the Abseil team members, I got around this feedback (stolen Russian meme):

Some of the Abseil team members were skeptical at first but I decided not to say any more words and make a prototype of this.

Let’s implement this

First, we need to create a trait to understand whether the type can fit into the atomic. As we use memcpy, the type must be trivially copyable according to a C++ standard. So, I came up with something like:

And it failed so many times during the testing because … because it is C++ and it is going to punch you in the face all the time when you are not doing really correct things.

First, trivially copyable type might not be default constructible, i.e. we cannot write T value;. To work around this, we need to either now how it can be worked around either remember some other examples. For example, optional<T> can be created with a non default constructible type. Looking into the Abseil implementation, we can see a very interesting thing:

It is using a union type, indeed if we put a type into the union, it just prepares the memory layout for it. So, I ended up in something like:

Run tests again and they again failed. ARGH, now it is not working on all compilers. GCC 4.x+ is failing but luckily Abseil has a workaround this (look at line 8).

So, after all miscompilations, bugs, finally we have a version working for all trivially copyable types.

And it turned out to be breaking again. 🙂

Before that all trivial types were defined in .cc file which has external linkage. Assume now that you have a shared library that accesses some trivial flag. Briefly, see

As now the linkage is internal, Get on its first initialization will get the pointer that was not firstly provided inside the flag definition because shared libraries are loaded with another addresses dynamically. So, it will result in a log statement which can pollute a bit the logs of the execution. This check is here because we want to check that the flag definition is the same as its declaration, to do this we need to check that types are the same and it is done in not the best way. The best way will be to check the typeids but this comes with some cost. Overall, we decided not to patch it right now as shared libraries for flags are already broken (for all non trivial types) and rare.

But this is funny – I changed 3 places writing pretty understandable code and it turned out that all 3 places are breaking changes and ended up with something ugly and crunchy. C++ in a nutshell. And

Okay, let’s assume we are finished with trivial types. Now comes the boss called 16 byte atomics. I really think that the theoretical knowledge helps in this case. Everyone should know that in order to create an atomic register we need a CAS operation on the register. For example, Introduction to Reliable and Secure Distributed Programming book helped me a lot to understand how memory models work from its theoretical to practical parts. To be dull, I just googled CAS 16 bytes and figured out that there is an instruction cmpxch16b on modern x86_64 platforms (by modern I mean westmere from 2010).

It turned out that even windows 8.1 requires this instruction for installation. This instruction comes with a bigger cost (by the way, I still don’t know why 16 byte loads do not meet strong model memory consistency but this is an absolutely another topic).

However, this instruction comes with a cost because it requires lock prefix. I am not an x86_64 memory model expert but my understanding what will happen:

The cost of the x86 LOCK prefix, or CAS, before PentiumPro, is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.

Not going into too much detail, I’ll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses. L1 Cache miss is around 4ns, so we can wait for 8ns latency for the lock prefix. Cachelines are always at least 16 bytes, so probably we should expect something like this.

It is no way <=1ns with one word atomics. I decided that we need a dispatch for this. So, for trivially copyable types >8 bytes and not bigger than 16 we can generate two words atomic while for others we should stick with a very nice and fast mov instructions. I ended up implementing something like this:

The benchmark showed 7.6ns accessing absl::Duration flag in 1 threaded mode and 10 times better in 16 threaded mode. So, we saved a lot of contention, provided lock free implementation for bigger types increasing the flag size only by 8 bytes.

Funny fact that GCC since version 7.1 does not generate 16 byte atomic lock free code because

IIRC this was removed as the instruction cannot be used for read only memory.

Andrew Pinski 2017-05-25 12:49:51 UTC.

I barely understand what this means but the fact remains the fact. GCC does not provide this instruction in any of the last implementations. Sorry, Clang, you win this game. See And I turned it on for all clang compilers that have -mcx16 option inside. Windows does a complete garbage here and I don’t even think about it right now.

After I sent the patch on the review, we got 2 additional meetings, around 3000 words pretty tough arguing if we need this, checking that nothing else breaks. (My side note – I find sometimes very hard to convince people at Google doing some crazy things but with enough intention, good intuition, it is possible in short amount of time). I was polishing it and making the code more readable. In the end it turned out completely insane and now I FEEL this is way clearer in its internals rather than it was before.

As it is a well known fact that Google uses LLVM infrastructure, my change might possibly be useful to all services. This patch was open sourced couple of days ago under the name absl::GetFlag is lock free for small trivially copyable types. with other internal changes 🙂 .


Maybe this is not the end. The way I told you, it might be possible to hack libc++ small strings <=14 bytes. They are stored on stack and first byte of the representation indicates if the string is small and you can get the string size. If it is small, we can atomically load 16 bytes to return a valid small string to a user. This requires using undocumented function __is_long() and very robust code which might end in really crazy things and bugs.

Also, on other platforms there are 16 byte atomic loads but they are not generated by all compilers and for all platforms. For example, for PowerPC GCC generates stq instruction (store quad) but Clang does not. For ARM the situation is the opposite. Also, it is hard to get old compilers for all the platforms and I decided to give up a bit on this. But this is absolutely possible.


To write the working code took me around 2 hours from scratch. Research on how the things are going around took me around a day. Then 5 days of arguing with people and why we need or don’t need this. But this definitely is one of the greatest ideas that came to my mind for the past couple of years. Also thanks to C++ that it is possible to do this crazy thing and gain the maximum we can do.

What am I doing with my life?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: