Optimizing 128-bit Division

2020-06-18T14:05:29+00:00

Good work!

Somehow many of my personal projects involve integer division and hence I have spent a lot of time writing my own optimized integer division functions because GCC’s and LLVM’s implementations were good enough for me. So it is great to see that you have upstreamed your improvements to LLVM. Integer division has been a pain for me over the past 20 years, but over the past 2 years the first CPUs (i.e. IBM POWER9, Intel Cannon Lake) have appeared on the market with tremendously better integer division performance. For this reason I am reasonably optimistic that by the end of this decade integer division performance will be fixed problem.

LikeLiked by 1 person

Reply

2020-07-19T11:44:06+00:00

Many years ago I faced the same scale of birthday paradox in Varnish Cache.

After looking at the various options, I chose SHA256 as my hash function, trading a little bit of speed, for never having to think about that part of the code ever again.

LikeLike

Reply

2020-07-19T21:52:55+00:00

“Still, 128 bit division cannot be 8x slower than 64 bit division.”

That sentence is backwards, right? 64-bit actually has the higher latency in that chart.

LikeLike

Reply

2020-07-19T21:54:59+00:00

No, the latency for 64 bit division is around 20-20ns but for 128 bit is 168ns which gives us 8x

LikeLike

Reply

2020-07-20T14:39:26+00:00

Is the table mislabeled then? Are we reading the same thing? The table right below the third paragraph says:

13.8 // 128 bit by 128 bit

168 // 128 bit by 64 bit

LikeLike

2020-07-21T07:18:42+00:00

Is there any plan to commit those codes to libgcc?

LikeLike

Reply

2020-07-21T10:52:09+00:00

To be completely honest, it looks like libgcc has a very similar implementation when I disassembled it, also I was not able to find the entry point in the repository and I eventually gave up

LikeLike

Reply

2020-07-22T02:39:01+00:00

Oh, i see.
BTW, it’s in libgcc/libgcc2.c, with the name__udivmoddi4 which would be replace by preprocessor as __udivmodti4.
libgcc use udiv_qrnnd which is similar as your Divide128Div64To64.

LikeLike

2020-07-21T13:47:39+00:00

For “big” divisors, you optimized but not the general case.
See https://skanthak.homepage.t-online.de/integer.html for a faster implementation.
Also see https://skanthak.homepage.t-online.de/division.html for the shortcomings of the implementation for non-AMD64 machines.
And finally see https://skanthak.homepage.t-online.de/msvc.html#comparision for benchmarks of various implementations of 64-bit by 64-bit division for 32-bit Intel processors, including those used by LLVM’s Compiler-RT library.

LikeLike

Reply

2020-08-01T22:29:55+00:00

Apropos “branch-free”: EVERY self-respecting optimizing C compiler SHOULD translate a sequence

X <= Z) Y -= Z; X += 1;

into machine code equivalent to the following AMD64 instructions (assuming that X lives in %rcx, Y in %rdx:%rax, Z in %r8:%r9, and that %r10:%r11 are scratch):

MOVQ %rax,%r10; MOVQ %rdx,%r11; SUBQ %rax,%r9; SBBQ %rdx,%r8; CMOVB %r10,%rax; CMOVB %r11,%rdx; CMC; ADCQ %rcx,%rcx

LikeLike

Reply

2020-08-08T17:58:51+00:00

From your first paragraph, you suggest that iterating over 2^32 random elements can find a collision with some probability, but that probability is very low, 2^-32 (ie, 1 in 4 billion that you’d get a collision). To get 50% chance, you’d need to iterate through 2^63 random elements. Do you consider 2^-32 a manageable attack?

LikeLike

Reply

2023-07-14T17:22:39+00:00

The collision is between any two of the 2^32 elements, not between a fixed target and the 2^32 elements. This changes the expected probably very significantly.

LikeLike

Reply

	#include "int_lib.h"

	/*
	typedef int si_int;
	typedef unsigned su_int;

	typedef long long di_int;
	typedef unsigned long long du_int;

	typedef int ti_int __attribute__((mode(TI))); // 128 signed
	typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned
	*/

	#ifdef CRT_HAS_128BIT

	COMPILER_RT_ABI tu_int __umodti3(tu_int a, tu_int b) {
	tu_int r;
	__udivmodti4(a, b, &r);
	return r;
	}

	// Returns: a % b

	COMPILER_RT_ABI ti_int __modti3(ti_int a, ti_int b) {
	const int bits_in_tword_m1 = (int)(sizeof(ti_int) * CHAR_BIT) – 1;
	ti_int s = b >> bits_in_tword_m1; // s = b < 0 ? -1 : 0
	b = (b ^ s) – s; // negate if s == -1
	s = a >> bits_in_tword_m1; // s = a < 0 ? -1 : 0
	a = (a ^ s) – s; // negate if s == -1
	tu_int r;
	__udivmodti4(a, b, &r);
	return ((ti_int)r ^ s) – s; // negate if s == -1
	}

	#endif // CRT_HAS_128BIT

	#include "int_lib.h"

	/*
	typedef int si_int;
	typedef unsigned su_int;

	typedef long long di_int;
	typedef unsigned long long du_int;

	typedef int ti_int __attribute__((mode(TI))); // 128 signed
	typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned
	*/

	#ifdef CRT_HAS_128BIT

	// Returns: a / b

	COMPILER_RT_ABI tu_int __udivti3(tu_int a, tu_int b) {
	return __udivmodti4(a, b, 0);
	}

	COMPILER_RT_ABI ti_int __divti3(ti_int a, ti_int b) {
	const int bits_in_tword_m1 = (int)(sizeof(ti_int) * CHAR_BIT) – 1;
	ti_int s_a = a >> bits_in_tword_m1; // s_a = a < 0 ? -1 : 0
	ti_int s_b = b >> bits_in_tword_m1; // s_b = b < 0 ? -1 : 0
	a = (a ^ s_a) – s_a; // negate if s_a == -1
	b = (b ^ s_b) – s_b; // negate if s_b == -1
	s_a ^= s_b; // sign of quotient
	return (__udivmodti4(a, b, (tu_int *)0) ^ s_a) – s_a; // negate if s_a == -1
	}

	#endif // CRT_HAS_128BIT

	// dividend / divisor, remainder is stored in rem.
	uint128 __udivmodti4(uint128 dividend, uint128 divisor, uint128* rem) {
	if (divisor > dividend) {
	if (rem)
	*rem = dividend;
	return 0;
	}
	// Calculate the distance between most significant bits, 128 > shift >= 0.
	int shift = Distance(dividend, divisor);
	divisor <<= shift;
	quotient = 0;
	for (; shift >= 0; –shift) {
	quotient <<= 1;
	if (dividend >= divisor) {
	dividend -= divisor;
	quotient \|= 1;
	}
	divisor >>= 1;
	}
	if (rem)
	*rem = dividend;
	return quotient;
	}

	#include <cinttypes>

	uint64_t div(uint64_t u1, uint64_t u0, uint64_t v) {
	uint64_t result;
	uint64_t remainder;
	__asm__("divq %[v]" : "=a"(result), "=d"(remainder) : [v] "r"(v), "a"(u0), "d"(u1));
	return result;
	}

	int main() {
	div(1, 0, 1); // 2**64 / 1
	}

	/*
	g++ -std=c++17 -O0 main.cpp -o main
	./main
	“./main” terminated by signal SIGFPE (Floating point exception)
	*/

	#if defined(__x86_64__)
	inline uint64_t Divide128Div64To64(uint64_t high, uint64_t low,
	uint64_t divisor, uint64_t* remainder) {
	uint64_t result;
	__asm__("divq %[v]"
	: "=a"(result), "=d"(*remainder) // Output parametrs, =a for rax, =d for rdx, [v] is an
	// alias for divisor, input paramters "a" and "d" for low and high.
	: [v] "r"(divisor), "a"(low), "d"(high));
	return result;
	}
	#endif

	tu_int __udivmodti4(tu_int dividend, tu_int divisor, tu_int* remainder) {
	…

	#if defined(__x86_64__)
	if (divisor.high == 0 && dividend.high < divisor) {
	remainder->high = 0;
	uint64_t quotient =
	Divide128Div64To64(dividend.high, dividend.low,
	divisor.low, &remainder->low);
	return quotient;
	}
	#endif
	…
	}

Optimizing 128-bit Division

LibDivide

GMP

Conclusion

Future Work

12 thoughts on “Optimizing 128-bit Division”

Leave a comment Cancel reply

	struct GmpDiv {
	uint64_t operator()(uint64_t u1, uint64_t u0, uint64_t v, du_int* r) const {
	mp_limb_t q[2] = {u0, u1};
	mp_limb_t result[2] = {0, 0};
	*r = mpn_divrem_1(result, 0, q, 2, v);
	return result[0];
	}
	};

LibDivide

GMP

Conclusion

Future Work

Share this:

Related

12 thoughts on “Optimizing 128-bit Division”

Leave a comment Cancel reply