How to contribute to Abseil with a visible performance gain

Overview of Abseil

Abseil is a great C++ library with a special and unique philosophy inside it. While there is a lot of debate what C++ is prioritizing more – portability or performance with possibly breaking changes, Abseil library aims in a standard compatible way to provide best library features of all standards sometimes with a very good performance. No need to upgrade your compiler which may contain bugs which are extremely hard to debug (for example, if you are developing highly reliable software for rockets, being on the edge with the latest compilers is often not even a choice), no need to upgrade C++ language features with some possible breaking changes, just get Abseil and get all features from future standards which are already highly tested by millions of lines of code across Google.

As an example to show, all standards prior to C++17 lack of useful class std::optional<T> which provide an optional access to some type T. This class can be efficiently implemented even in C++11. By the way, optimizations for std::optional is another a very interesting topic – how to create such a simple class being 1500+ lines of code overall. So, Abseil implements this like

	#if __cplusplus >= 201703 && __has_include(<optional>)

	#include <optional>

	namespace absl {
	using std::bad_optional_access;
	using std::optional;
	using std::make_optional;
	using std::nullopt_t;
	using std::nullopt;
	} // namespace absl

	#else // __cplusplus >= 201703 && __has_include(<optional>)

	// Internal abseil implementation.
	// Under define not to generate more code to a compiler because std::optional
	// and absl::optional claim to be compatible.

	#endif

view raw optional.h hosted with ❤ by GitHub

Abseil very rough idea of implementing things

Then, you just use in your code absl::optional. After the update to C++17, it is a drop-in replacement of std::optional and you can replace one class to another. Abseil guarantees 5 years of support after the standard release and then workarounds can be removed for previous standards. Abseil does not guarantee ABI compatibility – this is a very debatable state but improving the things over the time is one of the humanity problems – we tend to make mistakes and a good thing is that we should learn how to fix them. Modern C++ compilers can do this in a very efficient way (clang-tidy, all kind of tooling, etc). That’s why if Abseil introduces some breaking change, it comes with a tool to fix all of your code. Google has more than 250 millions lines of C++ code – all breaking changes are firstly tested on Google’s codebase and then ported to open source. Dealing with that amount of code cannot be done with a human interaction and automation is one of the main priorities to change something – you break, you provide an automated tool to fix this. We will also talk about this a bit later.

Of course, not everything is that cool, I decided to spend an hour or so and to find some incompatibilities, and I managed to find them. Abseil provides std::string_view replacement absl::string_view which claims to be compatible with a standard version. And consider the following example: one is working in C++17 mode, the other is not in C++11 mode:

We just assign to std::string an object of type absl::string_view. It fails to compile in C++11 and is successful in C++17. Which, I mean, is not that big problem because people always move forward with a standard rather than backward.

This is a very neat example shows that forward compatibility guarantee is hard because of multiple interaction between classes. One of these is shown above – std::string provides an assignment operator accepting std::string_view in C++17. Personally I don’t understand why. std::string_view claims to be a class that if you want a conversion to an owned string, you should write it explicitly. And… There are very very few options in C++ doing so with an assignment operator efficiently.

For example, writing something like

s = std::string{sv};

Would be much less efficient because we create a temporary string.

For constructors there is a choice to make the constructor explicit or implicit. For assignment there is only one syntax, unfortunately.

Of course, we have assign(). Traditionally assign() mirrors constructors (not necessarily all, e.g. assigning a default-constructed string is s.clear() rather than s.assign()), and operator=() is a subset of assign() with a single parameter. I do not think there is a precedent for making some single-parameter assign() overload not available as operator=(). People from standard could have introduced such convention for explicit constructors but they did not. Actually, libc++ implementation calls assign(const char*, size_t) as you can see in the godbolt link. I claim this as a defect of a standard but not a huge one, of course. That way Abseil does not provide a full C++17 replacement for the types.

ABSL_FLAG and HOW_TO_GET_CRAZY

After some brief overview of the Abseil library, we can continue digging inside it even more. I want to talk more about the command line flags library. In the documentation it is stated (with my comments in italic):

The Abseil flags library allows programmatic access to flag values passed on the command-line to binaries. The Abseil Flags library provides the following features:

Access to Abseil flags in a thread-safe manner. This is a very interesting property to have for all flags.
…
Access to flag values that are valid at any point during a program’s lifetime. This means right here that they are global variables and we know that globals are harmful…
Allows distributed declaration and definition of flags, though this usage has drawbacks and should generally be avoided. extern hello darkness, my old friend

Values for these flags are parsed from the command line by absl::ParseCommandLine(). The resulting value for each flag is stored in a global variable of an unspecified type absl::Flag<T>.

TL; DR. Flags are global moduled variables that must be defined in .cc|.cpp|.cxx files (otherwise you will get an ODR violation) that can be declared and accessed by any code if you declare them in a translation unit and link with the definition.

I think this is one of the most convenient way to turn on/off features in big codebases if you are depending on some other APIs – it is easy to test features, to setup an environment. The other option I see in production binaries is to provide a context to propagate it through the entire program stack and have some kind of global config. Either way, we need something global and to avoid bloat, Google decided to use global variables.

And now come the most sophisticated parts. First, global variables are bad for multiple reasons

You should not define them in headers unless they are inlined (weak symbols) because you will have ODR violations (which variable to choose across translation units is an open question to a linker). And it is an undefined behavior in C++.
Global variables with dynamic storage duration should be created and used very carefully. The flags have default values and you can specify your own flags with any type. Many guidelines and even universities (especially in Russia) suggest creating singletons (and flags are singletons in a way) with the BadSingleton() function:

	// Very bad because the destruction order is not specified.
	// From experience, these are one of the hardest to debug bugs.
	template <class T>
	T BadSingleton() {
	static T t;
	return t;
	}

	// Better. It is NOT a leak because the compiler will clear it after the shutdown.
	// Downside — destuctors are not called. But you should never do complicated
	// things in destructors.
	template <class T>
	T BetterSingleton() {
	static T* t = new T();
	return *t;
	}

view raw singleton.h hosted with ❤ by GitHub

Bad and Good singletons

Flags library solves the last problem in an interesting way. With a definition they don’t create anything, literally anything. Let’s see how it is done:

	// ABSL_CONST_INIT
	//
	// A variable declaration annotated with the `ABSL_CONST_INIT` attribute will
	// not compile (on supported platforms) unless the variable has a constant
	// initializer. This is useful for variables with static and thread storage
	// duration, because it guarantees that they will not suffer from the so-called
	// "static init order fiasco". Prefer to put this attribute on the most visible
	// declaration of the variable, if there's more than one, because code that
	// accesses the variable can then use the attribute for optimization.
	// Note that this attribute is redundant if the variable is declared constexpr.
	#define ABSL_CONST_INIT [[clang::require_constant_initialization]]

	// This struct and corresponding overload to MakeDefaultValue are used to
	// facilitate usage of {} as default value in ABSL_FLAG macro.
	struct EmptyBraces {};

	template <typename T>
	T* MakeFromDefaultValue(T t) {
	return new T(std::move(t));
	}

	template <typename T>
	T* MakeFromDefaultValue(EmptyBraces) {
	return new T;
	}

	#define ABSL_FLAG_IMPL_DECLARE_DEF_VAL_WRAPPER(name, Type, default_value) \
	static void* AbslFlagsInitFlag##name() { \
	return absl::flags_internal::MakeFromDefaultValue<Type>(default_value); \
	}

	#define ABSL_FLAG(Type, name, default_value, help) \
	namespace absl /* block flags in namespaces */ {} \
	ABSL_FLAG_IMPL_DECLARE_DEF_VAL_WRAPPER(name, Type, default_value) \ // Creating a function to create a default flag
	ABSL_FLAG_IMPL_DECLARE_HELP_WRAPPER(name, help); \
	ABSL_CONST_INIT absl::Flag<Type> FLAGS_##name{ \ // Const init initialization (see above)
	ABSL_FLAG_IMPL_FLAGNAME(#name), ABSL_FLAG_IMPL_FILENAME(), \ // Both compile time pointers.
	&absl::flags_internal::FlagMarshallingOps<Type>, \ // Compile time pointer to dispatch the operations.
	absl::flags_internal::HelpArg<AbslFlagHelpGenFor##name>(0), \ // Compile time struct to generate –help values.
	&AbslFlagsInitFlag##name}; // Compile time pointer from line 34.

view raw flags.h hosted with ❤ by GitHub

Rough absl::Flag definition

And you can see that flag is a compile time definition – all pointers, all data is known at compile time and no memory is allocated and it is nicely generated in assembly code then. Also, remember that flags are thread safe, at least for big types it should contain mutex (actually, mutex is always needed because initialization requires more than just replacing the value). As mutex is not a trivial type, in this situation, absl::FlagImpl does a great trick:

	// This is reserved space for an absl::Mutex to guard flag data. It will be
	// initialized in FlagImpl::Init via placement new.
	// We can't use "absl::Mutex data_guard_", since this class is not literal.
	// We do not want to use "absl::Mutex* data_guard_", since this would require
	// heap allocation during initialization, which is both slows program startup
	// and can fail. Using reserved space + placement new allows us to avoid both
	// problems.
	alignas(absl::Mutex) mutable char data_guard_[sizeof(absl::Mutex)];

view raw flag_trick.h hosted with ❤ by GitHub

If you don’t know that the placement new is, look into https://en.cppreference.com/w/cpp/language/new#Placement_new

All this is done not only for safety reasons. My guess that some programs at Google can contains hundreds of thousands of different modules with multiple flags in each of it. If we allocate even something during the initialization, the binary startups linearly slow down which is a so huge and bad effect on the user experience and performance overall (even recovery latency, sometimes if the binary segfaults on a killer query, for example, search engine, it is better to have a very fast startup to continue working ASAP). As only a small fraction of flags are accessed during the execution, the decision to make the initialization lazy is a great performance and user experience solution.

Flags are not meant to be performance critical and used in tight loops though the reality can show something different. Many APIs create their options and access the flags in the constructors (for example, to distinguish test and prod environment). And it becomes a performance critical feature because options can be created thousands of times with a multi-thread flags access. Though, I think calling absl::SetFlag happens really rare and might not be have that performance hefty.

What we know for now about flags:

They are global variables with lazy initialization.
They might be read performance critical.
They are thread safe in their access.

These properties are a bit controversy because thread safety property is a really complex issue with a templated arbitrary type. In general case it is done by holding a mutex because it cannot be done in any other way. To make the situation a bit better, flags did an optimization for trivial types like integers and floats (I put a lot of comments inside the snippet, please read them all):

	// This macro is the "source of truth" for the list of supported flag types we
	// expect to perform lock free operations on. Specifically it generates code,
	// a one argument macro operating on a type, supplied as a macro argument, for
	// each type in the list.
	#define ABSL_FLAGS_INTERNAL_FOR_EACH_LOCK_FREE(A) \
	A(bool) \
	A(short) \
	A(unsigned short) \
	A(int) \
	A(unsigned int) \
	A(long) \
	A(unsigned long) \
	A(long long) \
	A(unsigned long long) \
	A(double) \
	A(float)

	constexpr int64_t AtomicInit() { return 0xababababababababll; }

	class FlagImpl {
	…
	// Init this atomic with some garbage where
	// the collisions are negligible.
	std::atomic<int64_t> atomic_{AtomicInit()};
	…
	};

	void FlagImpl::StoreAtomic() {
	// Get the size of a type.
	size_t data_size = Sizeof(op_);
	// If it fits into the word on 64 bit systems.
	if (data_size <= sizeof(int64_t)) {
	int64_t t = 0;
	// Copy it inside this atomic.
	std::memcpy(&t, cur_, data_size);
	atomic_.store(t, std::memory_order_release);
	}
	}

	template <typename T>
	bool AtomicGet(T* v) const {
	// Load the byte representation.
	const int64_t r = atomic_.load(std::memory_order_acquire);
	// If it is not atomic init which means that the atomic
	// value was initialized with something but not a garbage.
	if (r != flags_internal::AtomicInit()) {
	// Copy it inside the value.
	// Compilers are very smart enough to generate
	// couple of cmov, mov instructions when the memcpy size is known.
	std::memcpy(v, &r, sizeof(T));
	// Return that we successfully returned a true value depending
	// on a "garbage heuristic".
	return true;
	}
	// In that case we do believe that the flag was not yet initialized.
	return false;
	}

	#define ABSL_FLAGS_ATOMIC_GET(T) \
	T GetFlag(const absl::Flag<T>& flag) { \
	T result; \
	/* Try to get the heuristic */ \
	if (flag.AtomicGet(&result)) { \
	/* If it was initialized, return*/ \
	return result; \
	} \
	/* Use slow mutex holding impl */ \
	return flag.Get(); \
	}

	ABSL_FLAGS_INTERNAL_FOR_EACH_LOCK_FREE(ABSL_FLAGS_ATOMIC_GET)

	// For other types use mutex holding implementation.
	template <typename T>
	ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
	return flag.Get();
	}

view raw lock_free_flags.h hosted with ❤ by GitHub

So, the implementation is holding one atomic of 8 bytes and tries to get the fast path if the flag was initialized based on a heuristic in that way that if AtomicGet returned true, that means that flag was initialized and we can safely return.

This generates very good assembly with few instructions and the benchmark for flags access showed the following results for all the one worded trivial types (I show only for bool for simplicity):

Benchmark                                     Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------------
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:1  0.464    0.464   1000000000
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:2  0.233    0.465   1445775046
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:4  0.126    0.476   1427829560
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:8  0.136    0.875    576009808
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:16 0.097    0.963    673725472
(BM_ManyReadersFlag<bool, FLAG_b>)/threads:32 0.086    0.963    691627808

We can conclude that it is extremely fast and does not degrade much with the number of threads increased. It is achieved because of std::memory_order_acquire in atomic load and x86_64 with its strong memory model (almost each instruction guarantees release and acquire semantics) generates around the following code which is efficient, small and extremely fast:

	movq (%rdi), %rax # load from the input
	xorl %ecx, %ecx
	movabsq $-6076574518398440533, %rdx # imm = 0xABABABABABABABAB
	cmpq %rdx, %rax # Check with a garbage value
	cmovel %ecx, %eax # return it in %rax
	retq

view raw atomic_get.s hosted with ❤ by GitHub

The benchmark of course is far from ideal (guess why).

For all other types absl::Mutex is held for accessing the value which has around 17ns latency for holding in one threaded mode and around 150ns in 4 threaded mode. Hi, latency.

My concerns about all above

And here comes my biggest concern – I don’t feel this API guarantees the right thing enough (I mean, atomic hack). Though, in a true way this is not a guarantee but looking at the implementation while programming and profiling absolutely unrelated thing, I put this thing into the list to look into because I felt through my veins the lack of completeness around the API and it bothered me a lot. I got the feeling about all this but the most fascinating part that I was not able to formulate it in the next couple of hours but I caught that feeling and it got inside me that in my background thoughts I was thinking and trying to get all the right words to understand what really bothered me. Okay, let’s roast a bit this implementation.

This works bad for other trivial types like enums that fit into the 8 byte register. And why are we holding mutex inside it for them, are these types special?
Why do we have really special types? Yes, they are built-in, but they are not general, they optimize many things but not all.
This is not scalable and this is the biggest concern. What if I want to provide other lock free types? I should list them in that list. What? Why? I mean, I should not, this involves circular dependencies and additional macro expansions. C++ is much more powerful to deal with such things at compile time.
absl::Flag provides absl::Duration type implementation which is 12 bytes in size. C’mon, modern processors have 16 and 32 byte registers, can we operate with them in an atomic way? I should investigate all these issues. Also, absl::Duration turned out to be very popular flag and it potentially can save many CPUs and latency which is hard to measure just running some benchmarks.
Oh, what if I can do this for 16 bytes registers, I can hack small string optimization (will talk about this a bit later).

When I told all this to one of the Abseil team members, I got around this feedback (stolen Russian meme):

Some of the Abseil team members were skeptical at first but I decided not to say any more words and make a prototype of this.

Let’s implement this

First, we need to create a trait to understand whether the type can fit into the atomic. As we use memcpy, the type must be trivially copyable according to a C++ standard. So, I came up with something like:

	// The minimum atomic size we believe to generate lock free code, i.e. all
	// trivially copyable types not bigger this size generate lock free code.
	static constexpr int kMinLockFreeAtomicSize = 8;

	template <typename T>
	struct IsAtomicFlagTypeTrait {
	static constexpr bool value = (sizeof(T) <= kMinLockFreeAtomicSize
	&& std::is_trivially_copyable<T>::value);
	}

	// If not atomic fittable type, use mutex implementation.
	template <typename T,
	typename std::enable_if<
	!flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
	ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
	return flag.Get();
	}

	// If it is a trivially copyable type, use atomic hack.
	template <typename T,
	typename std::enable_if<
	flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
	ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
	T result;
	if (flag.AtomicGet(&result.value)) {
	return result.value;
	}
	return flag.Get();
	}

view raw flag_trait.h hosted with ❤ by GitHub

And it failed so many times during the testing because … because it is C++ and it is going to punch you in the face all the time when you are not doing really correct things.

First, trivially copyable type might not be default constructible, i.e. we cannot write T value;. To work around this, we need to either now how it can be worked around either remember some other examples. For example, optional<T> can be created with a non default constructible type. Looking into the Abseil implementation, we can see a very interesting thing:

	struct empty_struct {};

	struct dummy_type {
	static_assert(sizeof(T) % sizeof(empty_struct) == 0, "");
	// Use an array to avoid GCC 6 placement-new warning.
	empty_struct data[sizeof(T) / sizeof(empty_struct)];
	};

	// Data storage
	union {
	T data_;
	dummy_type dummy_;
	};

view raw union_hack.h hosted with ❤ by GitHub

It is using a union type, indeed if we put a type into the union, it just prepares the memory layout for it. So, I ended up in something like:

	// The minimum atomic size we believe to generate lock free code, i.e. all
	// trivially copyable types not bigger this size generate lock free code.
	static constexpr int kMinLockFreeAtomicSize = 8;

	template <typename T>
	struct IsAtomicFlagTypeTrait {
	static constexpr bool value = (sizeof(T) <= kMinLockFreeAtomicSize
	&& std::is_trivially_copyable<T>::value);
	}

	// If not atomic fittable type, use mutex implementation.
	template <typename T,
	typename std::enable_if<
	!flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
	ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
	return flag.Get();
	}

	// If it is a trivially copyable type, use atomic hack.
	template <typename T,
	typename std::enable_if<
	flags_internal::IsAtomicFlagTypeTrait<T>::value, int>::type = 0>
	ABSL_MUST_USE_RESULT T GetFlag(const absl::Flag<T>& flag) {
	// T might not be default constructible.
	union U {
	T value;
	U() {}
	};
	U result;
	if (flag.AtomicGet(&result.value)) {
	return result.value;
	}
	return flag.Get();
	}

view raw flat_trait_2.h hosted with ❤ by GitHub

Run tests again and they again failed. ARGH, now it is not working on all compilers. GCC 4.x+ is failing but luckily Abseil has a workaround this (look at line 8).

	namespace type_traits_internal {
	// is_trivially_copyable()
	//
	// Determines whether the passed type `T` is trivially copyable.
	//
	// This metafunction is designed to be a drop-in replacement for the C++11
	// `std::is_trivially_copyable()` metafunction for platforms that have
	// incomplete C++11 support (such as libstdc++ 4.x). We use the C++17 definition
	// of TriviallyCopyable.
	//
	// NOTE: `is_trivially_copyable<T>::value` is `true` if all of T's copy/move
	// constructors/assignment operators are trivial or deleted, T has at least
	// one non-deleted copy/move constructor/assignment operator, and T is trivially
	// destructible. Arrays of trivially copyable types are trivially copyable.
	//
	// We expose this metafunction only for internal use within absl.
	template <typename T>
	class is_trivially_copyable_impl {
	using ExtentsRemoved = typename std::remove_all_extents<T>::type;
	static constexpr bool kIsCopyOrMoveConstructible =
	std::is_copy_constructible<ExtentsRemoved>::value \|\|
	std::is_move_constructible<ExtentsRemoved>::value;
	static constexpr bool kIsCopyOrMoveAssignable =
	absl::is_copy_assignable<ExtentsRemoved>::value \|\|
	absl::is_move_assignable<ExtentsRemoved>::value;

	public:
	static constexpr bool kValue =
	(__has_trivial_copy(ExtentsRemoved) \|\| !kIsCopyOrMoveConstructible) &&
	(__has_trivial_assign(ExtentsRemoved) \|\| !kIsCopyOrMoveAssignable) &&
	(kIsCopyOrMoveConstructible \|\| kIsCopyOrMoveAssignable) &&
	is_trivially_destructible<ExtentsRemoved>::value &&
	// We need to check for this explicitly because otherwise we'll say
	// references are trivial copyable when compiled by MSVC.
	!std::is_reference<ExtentsRemoved>::value;
	};

	template <typename T>
	struct is_trivially_copyable
	: std::integral_constant<
	bool, type_traits_internal::is_trivially_copyable_impl<T>::kValue> {};
	} // namespace type_traits_internal

view raw gcc_bug.h hosted with ❤ by GitHub

So, after all miscompilations, bugs, finally we have a version working for all trivially copyable types.

And it turned out to be breaking again. 🙂

Before that all trivial types were defined in .cc file which has external linkage. Assume now that you have a shared library that accesses some trivial flag. Briefly, see

	// flag.h
	template <typename T>
	class Flag {
	public:
	T Get() const {
	// See implementation notes in CommandLineFlag::Get().
	union U {
	T value;
	U() {}
	~U() { value.~T(); }
	};
	U u;

	// Operations are good to check that the flag was declared with the same type.
	// IF THIS COMES FROM SHARED library.
	impl_.Read(&u.value, &flags_internal::FlagOps<T>);
	return std::move(u.value);
	}
	}

	// flag.cc
	void FlagImpl::Read(void* dst, const flags_internal::FlagOpFn dst_op) const {
	absl::ReaderMutexLock l(DataGuard());

	// `dst_op` is the unmarshaling operation corresponding to the declaration
	// visibile at the call site. `op` is the Flag's defined unmarshalling
	// operation. They must match for this operation to be well-defined.
	// THIS IS NOT TRUE ANYMORE AS POINTERS FROM SHARED LIBRARIES ARE
	// DIFFERENT FROM STATIC INITIALIZATION.
	if (ABSL_PREDICT_FALSE(dst_op != op_)) {
	ABSL_INTERNAL_LOG(
	ERROR,
	absl::StrCat("Flag '", Name(),
	"' is defined as one type and declared as another"));
	}
	CopyConstruct(op_, cur_, dst);
	}

view raw flag.h hosted with ❤ by GitHub

As now the linkage is internal, Get on its first initialization will get the pointer that was not firstly provided inside the flag definition because shared libraries are loaded with another addresses dynamically. So, it will result in a log statement which can pollute a bit the logs of the execution. This check is here because we want to check that the flag definition is the same as its declaration, to do this we need to check that types are the same and it is done in not the best way. The best way will be to check the typeids but this comes with some cost. Overall, we decided not to patch it right now as shared libraries for flags are already broken (for all non trivial types) and rare.

But this is funny – I changed 3 places writing pretty understandable code and it turned out that all 3 places are breaking changes and ended up with something ugly and crunchy. C++ in a nutshell. And https://www.hyrumslaw.com/.

Okay, let’s assume we are finished with trivial types. Now comes the boss called 16 byte atomics. I really think that the theoretical knowledge helps in this case. Everyone should know that in order to create an atomic register we need a CAS operation on the register. For example, Introduction to Reliable and Secure Distributed Programming book helped me a lot to understand how memory models work from its theoretical to practical parts. To be dull, I just googled CAS 16 bytes and figured out that there is an instruction cmpxch16b on modern x86_64 platforms (by modern I mean westmere from 2010).

It turned out that even windows 8.1 requires this instruction for installation. This instruction comes with a bigger cost (by the way, I still don’t know why 16 byte loads do not meet strong model memory consistency but this is an absolutely another topic).

However, this instruction comes with a cost because it requires lock prefix. I am not an x86_64 memory model expert but my understanding what will happen:

The cost of the x86 LOCK prefix, or CAS, before PentiumPro, is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for Writeback (i.e. cacheable) memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cacheline is blocked (based on the link posted above).

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer.

Not going into too much detail, I’ll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses. L1 Cache miss is around 4ns, so we can wait for 8ns latency for the lock prefix. Cachelines are always at least 16 bytes, so probably we should expect something like this.

It is no way <=1ns with one word atomics. I decided that we need a dispatch for this. So, for trivially copyable types >8 bytes and not bigger than 16 we can generate two words atomic while for others we should stick with a very nice and fast mov instructions. I ended up implementing something like this:

	// The minimum atomic size we believe to generate lock free code, i.e. all
	// trivially copyable types not bigger this size generate lock free code.
	static constexpr int kMinLockFreeAtomicSize = 8;

	// The same as kMinLockFreeAtomicSize but maximum atomic size. As double words
	// might use two registers, we want to dispatch the logic for them.
	#if defined(ABSL_FLAGS_INTERNAL_ATOMIC_DOUBLE_WORD)
	static constexpr int kMaxLockFreeAtomicSize = 16;
	#else
	static constexpr int kMaxLockFreeAtomicSize = 8;
	#endif

	// We can use atomic in cases when it fits in the register, trivially copyable
	// in order to make memcpy operations.
	template <typename T>
	struct IsAtomicFlagTypeTrait {
	static constexpr bool value =
	(sizeof(T) <= kMaxLockFreeAtomicSize &&
	type_traits_internal::is_trivially_copyable<T>::value);
	};

	// Clang does not always produce cmpxchg16b instruction when alignment of a 16
	// bytes type is not 16.
	struct alignas(16) FlagsInternalTwoWordsType {
	int64_t first;
	int64_t second;
	};

	constexpr bool operator==(const FlagsInternalTwoWordsType& that,
	const FlagsInternalTwoWordsType& other) {
	return that.first == other.first && that.second == other.second;
	}
	constexpr bool operator!=(const FlagsInternalTwoWordsType& that,
	const FlagsInternalTwoWordsType& other) {
	return !(that == other);
	}

	constexpr int64_t SmallAtomicInit() { return 0xababababababababll; }

	template <typename T, typename S = void>
	struct BestAtomicType {
	using type = int64_t;
	static constexpr int64_t AtomicInit() { return SmallAtomicInit(); }
	};

	template <typename T>
	struct BestAtomicType<
	T, typename std::enable_if<(kMinLockFreeAtomicSize < sizeof(T) &&
	sizeof(T) <= kMaxLockFreeAtomicSize),
	void>::type> {
	using type = FlagsInternalTwoWordsType;
	static constexpr FlagsInternalTwoWordsType AtomicInit() {
	return {SmallAtomicInit(), SmallAtomicInit()};
	}
	};

	template <typename T>
	class Flag {
	…
	// For some types, a copy of the current value is kept in an atomically
	// accessible field.
	union Atomics {
	// Using small atomic for small types.
	std::atomic<int64_t> small_atomic;
	template <typename T,
	typename K = typename std::enable_if<
	(sizeof(T) <= kMinLockFreeAtomicSize), void>::type>
	int64_t load() const {
	return small_atomic.load(std::memory_order_acquire);
	}

	#if defined(ABSL_FLAGS_INTERNAL_ATOMIC_DOUBLE_WORD)
	// Using big atomics for big types.
	std::atomic<FlagsInternalTwoWordsType> big_atomic;
	template <typename T, typename K = typename std::enable_if<
	(kMinLockFreeAtomicSize < sizeof(T) &&
	sizeof(T) <= kMaxLockFreeAtomicSize),
	void>::type>
	FlagsInternalTwoWordsType load() const {
	return big_atomic.load(std::memory_order_acquire);
	}
	constexpr Atomics()
	: big_atomic{FlagsInternalTwoWordsType{SmallAtomicInit(),
	SmallAtomicInit()}} {}
	#else
	constexpr Atomics() : small_atomic{SmallAtomicInit()} {}
	#endif
	};
	Atomics atomics_{};
	…
	};

view raw flag_final.h hosted with ❤ by GitHub

The benchmark showed 7.6ns accessing absl::Duration flag in 1 threaded mode and 10 times better in 16 threaded mode. So, we saved a lot of contention, provided lock free implementation for bigger types increasing the flag size only by 8 bytes.

Funny fact that GCC since version 7.1 does not generate 16 byte atomic lock free code because

IIRC this was removed as the instruction cannot be used for read only memory.
Andrew Pinski 2017-05-25 12:49:51 UTC. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

I barely understand what this means but the fact remains the fact. GCC does not provide this instruction in any of the last implementations. Sorry, Clang, you win this game. See https://gcc.godbolt.org/z/zzicCu. And I turned it on for all clang compilers that have -mcx16 option inside. Windows does a complete garbage here and I don’t even think about it right now.

After I sent the patch on the review, we got 2 additional meetings, around 3000 words pretty tough arguing if we need this, checking that nothing else breaks. (My side note – I find sometimes very hard to convince people at Google doing some crazy things but with enough intention, good intuition, it is possible in short amount of time). I was polishing it and making the code more readable. In the end it turned out completely insane and now I FEEL this is way clearer in its internals rather than it was before.

As it is a well known fact that Google uses LLVM infrastructure, my change might possibly be useful to all services. This patch was open sourced couple of days ago under the name absl::GetFlag is lock free for small trivially copyable types. with other internal changes 🙂 .

Maybe this is not the end. The way I told you, it might be possible to hack libc++ small strings <=14 bytes. They are stored on stack and first byte of the representation indicates if the string is small and you can get the string size. If it is small, we can atomically load 16 bytes to return a valid small string to a user. This requires using undocumented function __is_long() and very robust code which might end in really crazy things and bugs.

Also, on other platforms there are 16 byte atomic loads but they are not generated by all compilers and for all platforms. For example, for PowerPC GCC generates stq instruction (store quad) but Clang does not. For ARM the situation is the opposite. Also, it is hard to get old compilers for all the platforms and I decided to give up a bit on this. But this is absolutely possible.

Epilogue

To write the working code took me around 2 hours from scratch. Research on how the things are going around took me around a day. Then 5 days of arguing with people and why we need or don’t need this. But this definitely is one of the greatest ideas that came to my mind for the past couple of years. Also thanks to C++ that it is possible to do this crazy thing and gain the maximum we can do.

What am I doing with my life?

How to contribute to Abseil with a visible performance gain

Overview of Abseil

ABSL_FLAG and HOW_TO_GET_CRAZY

My concerns about all above

Let’s implement this

Next

Epilogue

2 thoughts on “How to contribute to Abseil with a visible performance gain”

Leave a comment Cancel reply

Overview of Abseil

ABSL_FLAG and HOW_TO_GET_CRAZY

My concerns about all above

Let’s implement this

Next

Epilogue

Share this:

Related

2 thoughts on “How to contribute to Abseil with a visible performance gain”

Leave a comment Cancel reply