A collection of thoughts related to the challenges of software engineering

stay connected

Posts Tagged with ‘intel’

March 27th, 2010

This morning was one of these mornings when you wake up with that kind of strange idea only sleep can bring you. It's a concept related to software optimization at the assembly level.

Before I explore this idea with you, I'd like to spend few paragraphs to talk about the vast and inspiring field of assembly optimization.

Optimizing code

Assembly optimization is such a huge topic that I won't even try to be exhaustive. This brief tour should help you understand where the idea comes from. Experts will forgive the generalities and approximations of this quick overview. Should you wish to know more, feel free to read the abundant and excellent documentation provided by Intel.

The natural approach

This first and most straightforward one consists in replacing a set of instructions by an equivalent, but faster, set of instructions.

An example of this is when you write

1
xor    eax, eax

instead of

1
mov    eax, 0

on the IA32.

These instructions both set the eax register to 0, except that the former is faster (This xor register, register instruction also has got other special powers I will tell you later about...)

Optimizing is in that case just a matter of search and replace, but unfortunately this technique will not lead you very far.

On our heavy-pipelined processors, you will additionally do everything you can to make the most of branch prediction.

Optimizing branches

Branch prediction is a set of heuristics built inside your CPU to avoid flushing the pipeline when branching. It determines the most likely branch when reaching a conditional instruction and preloads the guessed address. Unexplored branches are predicted using static heuristics, explored branches using statistics.

Making the most of branch predictions means disposing conditions in a way favorable for these heuristics, or even avoid branching in the first place through the use of conditional instructions. Unrolling loops is another way to tackle this as it replaces the loop by the equivalent repeated instructions. Unrolling has however to be done reasonably, as excessive use of this technique may result in a larger code that no longer fits in the trace cache.

Putting branches aside, arranging instructions also has got a great impact of performances, because of stalling.

Avoiding stalls

Nota bene : the following example oversimplifies the issue as it overlooks - amongst other things - parallelization on processors featuring multiple treatment units. In other words, rearranging the above instructions might not always yield a performance improvement, but it nevertheless shows the underlying concept.

1
2
3
4
mov    eax, 3
mov    ebx, 4
add    ebx, eax ; addition of 3 + 4
xor    ecx, ecx

The addition can only occur when the ebx register is ready, in other words, the addition will have to wait for the register to be fully set before doing anything. That's a stall. If you notice the instruction on ecx, you will see there is no dependence .

If you place the operation on ecx before the add, you have a potential performance gain because the instruction can execute right away without waiting for the previous one to finish. Sometimes dependencies are more subtle and are based on the flags. This is why - for example - it is generally recommended to avoid instructions such as inc and dec that only affect flags partially.

I earlier wrote that the xor register, register instruction has got some special powers. It can indeed be used to break the dependencies we just examined. When the processor encounters this instruction, it will discard all dependencies related to the register you reset, preventing stalls.

And so much more...

Memory layout is another major topic that is very hard to sum up in few words, but I need to at least name it. Just know that aligning instructions and data on cache boundaries can yield a significant performance improvement in optimizing loads and avoiding false sharing (very important for multithreaded code).

And then, you have also the matter of mixing floating point and single instructions multiple data (SIMD) instructions, partial register stalls, vectorization, stack optimization, store forwarding, locality enhancement...

When you realize the amount of techniques that can be used to optimize a code fragment, finding the ideal (or close to ideal) set of parameters for your processor is extremely difficult. This is where we go back to instrumentation, without instrumentation, you (or most likely the compiler) optimize a priori for your given architecture. With instrumentation, you make decisions based on hard evidence.

You need to instrument representative runs of your program to make it worthwhile, and sometimes, there isn't much the optimizer can do to improve the speed of your software.

This morning's idea

What about using natural selection to optimize a routine?

If you inject random optimizations in thousands of instances of a routine, benchmark them, discard the slower ones and keep working on the faster ones, will it eventually output the fastest possible implementation?

Obviously the following requirements would have to be met:

  • The routine needs to be testable for correctness to a reasonable extent as optimizations might result in modified behavior
  • It must be possible to generate a representative set of inputs in order to evaluate performance realistically
  • Enough computing power to mutate, spawn and test the routine as waiting one month for the result might not be acceptable...

The routine mutator would need to know a comprehensive set of optimization techniques for the given architecture, but it could also simply inject random opcodes and let the test phase discard non-working versions.

The approach can also be used to generate the smallest possible program, or any feature that can easily and objectively be measured.

When would that approach be reasonable? Probably for fundamental software bricks, such as an operating system scheduler, the pager or some driver's routine.

That and God of War.