Wednesday, September 14, 2016

C+++ Compile, Part 1, Vectors vs Native Arrays in x86-64 bit Debug Builds

Learning to love assembly language

Have you ever looked at the assembly output of your program in a debugger? That assembly is what your program really does, and understanding it is crucial to getting the greatest performance out of your tools. Each short little symbol directly corresponds to an instruction on the CPU. Behind each is a number. And these days, those little numbers can actually be pretty powerful. Some of the things that the new Intel Broadwell or new AMD CPUs do are just remarkable.

"But the compiler will optimize it!" You might say. "I don't need to do this!" Not so fast! C++, and particularly the C it descended from, was a language designed to be a sort of shorthand for assembler. Back in the old days of C you could even say "use this variable in a register", and the compiler would try to oblige. Almost 50 years ago, some genuinely brilliant guys at AT&T decided they wanted to write a space war game, and along the way, they invented not only C, but Unix, to do it. C came about because they wanted something performant but also portable. All this business about pointers is an abstraction very close to how chips actually work - the developer thinks about addresses and arrays and math, and the compiler spits out the exact sequence of instructions in that architecture. So addresses and variables in memory are actually just numbers too and you can do math on them. But because of this, this shorthand heritage, C++ has a lot of rules about what it actually can and can't touch when it optimizes.

Oversat is a good stomping ground for this kind of performance self-education. In SAT solvers, everything is crazy huge and everything needs to be crazy fast. The problem teaches us about both tight code and clever algorithms. The problem is well bounded, and its real world applicable. In the case of OverSAT, the string handling and conflict management has direct applications in big data, from everything in geospatial analysis, to complex financial models and even in joins of in memory databases. At the center of this, understanding how in C++ arrays and vectors work is beneficial. As I studied the problem, a big question emerged.

Question: Does it make a difference which style of looping I choose?

Answer: Yes, it does, in debug.

Viewing the Assembly Language For Your Code

You can view the assembly language for your application in Visual Studio 2015 by the Right Mouse Menu over your code while you are in the debugger at a breakpoint.

Comparing Styles of Looping

In all of the examples, we'll notice a few things. Register use is simplistic, owing to debug build. The code you write is what you get - there's no optimization at all and in fact there is extra code to take your results from registers and put them back into variables so you can step through your code and look at it. You won't find that in release builds, except when C++ contracts require it. Also, you can see in debug build the extra penalty paid for having relocatable code - that is, the OS can move your stuff around when it loads it so it can squeeze your code in with other libraries. There's an extra jump when calling a static method. My guess is that's a jump table for relocation, but I reserve the right to be entirely wrong! Either way, there's no simple call instruction to call a method. It's a jump and then a call, and it's certainly not optimal. Debug is meant to be debug, but it is surprising how slow it really is.

Filling an Integer Array Using a Array Index

In this first example, we fill a simple integer array using an array index and a counter in a loop. The compiler output is vaguely reminiscent of the sort of code that a beginning assembly developer may employ. It's a bit much, code, but it is reasonably quick. You can see instructions there to increment the array index and also see that Intel has a swank instruction of getting an address from a simple array. That particular instruction, lea, has historically been so peppy that you will sometimes see it used for simple math. But we're all exploring 64 bit assembly, together, and that may not be the case today.

Filling an Integer Array Using a Pointer

In this example, we will fill an array using a pointer. If we look at the generated assembly, we can see right away that the use of registers is not optimal. This will be the case with all the debug samples. What I believe is happening is that the compiler always generates code to write the register contents back to the variable as this simplifies viewing variable contents in the debugger. That the optimizer uses registers more aggressively is probably tied to why you generally cannot view variables when looking at release code. Visual C++ probably doesn't have the mechanism needed to tie register graphs to variables for debugging purposes.

Filling an Array Using STL Iterators

Now, let's see how a loop on a vector looks when using standard STL iterators. It's a mess! Look at all that slow code. The begin, end and operator ++ in STL are designed to mimic the "feel" of working with pointer types, except at least in debug, they all resolve to function calls. Instead of a speedy pointer manipulation in a register, you get slow function calls.

Filling an Array Using STL Auto

C++11 has a brilliant new for (auto&v : collection) syntax. This syntax is designed to be a shortcut for the usual invocation of C++ iterators and in that it works brilliantly. But, how good is it for working with vectors. It is interesting to see that there appears to be no debug shortcut for how for auto resolves to calls to STL iterator when dealing with C++ vectors. I believe this due to the C++ standard. Clearly though, we would prefer array implementations to resolve directly to a pointer or array based implementation directly. We shall see how this turns out, once we investigate release builds in a subsequent post.

Filling an Integer Array Using STL Indeces

While the iterators are designed for most kinds of containers, the C++ vector has a random access mode as it is meant to be used in a fashion similar to the standard C [] vector. Calling size() on the vector is a method call, and while it may be optimized away, in debug, assigning it to a temporary produces a positive result. Direct vector access produces the cleanest code of any of the vector iterator methods. The only call that is generated is the array access method. That is the price you pay for bounds checking in debug mode for C++ STL.

Conclusion

In this post, we took a first crack at what 64 bit assembly looks like under the hood in debug builds of a simple set of test cases in C++. These are attached to the Oversat project and we'll post the link so you can download the same out of codeplex and check it out for yourself. Next stop, we'll have a look at what release does. These experiments and others will help guide our thinking for the next round of oversat changes, and even help us with the game we shall build.

Now, tech leaders routinely preach that assembly language is a past thing and dying art, the province of compiler writers that the rest of us need not be concerned with. There is room for performance code. When considering a choice between C++ versus C style implementations. Programming in assembly language is not for everyone. It's even weirder than C. Perhaps for many developers, the old refrain "a compiler can generate better code than you can" might well be true.

But we don't want to be like any developers on this blog! We're looking for an extra edge and want to understand the hardware so that we can tell our compiler how to use it.

No comments:

Post a Comment