A collection of thoughts related to the challenges of software engineering

stay connected

Posts Tagged with ‘optimization’

March 27th, 2010

This morning was one of these mornings when you wake up with that kind of strange idea only sleep can bring you. It's a concept related to software optimization at the assembly level.

Before I explore this idea with you, I'd like to spend few paragraphs to talk about the vast and inspiring field of assembly optimization.

Optimizing code

Assembly optimization is such a huge topic that I won't even try to be exhaustive. This brief tour should help you understand where the idea comes from. Experts will forgive the generalities and approximations of this quick overview. Should you wish to know more, feel free to read the abundant and excellent documentation provided by Intel.

The natural approach

This first and most straightforward one consists in replacing a set of instructions by an equivalent, but faster, set of instructions.

An example of this is when you write

1
xor    eax, eax

instead of

1
mov    eax, 0

on the IA32.

These instructions both set the eax register to 0, except that the former is faster (This xor register, register instruction also has got other special powers I will tell you later about...)

Optimizing is in that case just a matter of search and replace, but unfortunately this technique will not lead you very far.

On our heavy-pipelined processors, you will additionally do everything you can to make the most of branch prediction.

Optimizing branches

Branch prediction is a set of heuristics built inside your CPU to avoid flushing the pipeline when branching. It determines the most likely branch when reaching a conditional instruction and preloads the guessed address. Unexplored branches are predicted using static heuristics, explored branches using statistics.

Making the most of branch predictions means disposing conditions in a way favorable for these heuristics, or even avoid branching in the first place through the use of conditional instructions. Unrolling loops is another way to tackle this as it replaces the loop by the equivalent repeated instructions. Unrolling has however to be done reasonably, as excessive use of this technique may result in a larger code that no longer fits in the trace cache.

Putting branches aside, arranging instructions also has got a great impact of performances, because of stalling.

Avoiding stalls

Nota bene : the following example oversimplifies the issue as it overlooks - amongst other things - parallelization on processors featuring multiple treatment units. In other words, rearranging the above instructions might not always yield a performance improvement, but it nevertheless shows the underlying concept.

1
2
3
4
mov    eax, 3
mov    ebx, 4
add    ebx, eax ; addition of 3 + 4
xor    ecx, ecx

The addition can only occur when the ebx register is ready, in other words, the addition will have to wait for the register to be fully set before doing anything. That's a stall. If you notice the instruction on ecx, you will see there is no dependence .

If you place the operation on ecx before the add, you have a potential performance gain because the instruction can execute right away without waiting for the previous one to finish. Sometimes dependencies are more subtle and are based on the flags. This is why - for example - it is generally recommended to avoid instructions such as inc and dec that only affect flags partially.

I earlier wrote that the xor register, register instruction has got some special powers. It can indeed be used to break the dependencies we just examined. When the processor encounters this instruction, it will discard all dependencies related to the register you reset, preventing stalls.

And so much more...

Memory layout is another major topic that is very hard to sum up in few words, but I need to at least name it. Just know that aligning instructions and data on cache boundaries can yield a significant performance improvement in optimizing loads and avoiding false sharing (very important for multithreaded code).

And then, you have also the matter of mixing floating point and single instructions multiple data (SIMD) instructions, partial register stalls, vectorization, stack optimization, store forwarding, locality enhancement...

When you realize the amount of techniques that can be used to optimize a code fragment, finding the ideal (or close to ideal) set of parameters for your processor is extremely difficult. This is where we go back to instrumentation, without instrumentation, you (or most likely the compiler) optimize a priori for your given architecture. With instrumentation, you make decisions based on hard evidence.

You need to instrument representative runs of your program to make it worthwhile, and sometimes, there isn't much the optimizer can do to improve the speed of your software.

This morning's idea

What about using natural selection to optimize a routine?

If you inject random optimizations in thousands of instances of a routine, benchmark them, discard the slower ones and keep working on the faster ones, will it eventually output the fastest possible implementation?

Obviously the following requirements would have to be met:

  • The routine needs to be testable for correctness to a reasonable extent as optimizations might result in modified behavior
  • It must be possible to generate a representative set of inputs in order to evaluate performance realistically
  • Enough computing power to mutate, spawn and test the routine as waiting one month for the result might not be acceptable...

The routine mutator would need to know a comprehensive set of optimization techniques for the given architecture, but it could also simply inject random opcodes and let the test phase discard non-working versions.

The approach can also be used to generate the smallest possible program, or any feature that can easily and objectively be measured.

When would that approach be reasonable? Probably for fundamental software bricks, such as an operating system scheduler, the pager or some driver's routine.

That and God of War.

April 28th, 2009

You probably heard at least once the hardware is free aphorism. Perhaps you even said it! The underlying notion is that optimizing code is generally more expensive than upgrading the machine or even buying a new one.

If you estimate the cost of an engineer at 500 € / day, 2 days of work are worth quite a nice upgrade. Spending one week to get a 20% performance boost can therefore sound crazy, notwithstanding the increased maintenance cost the optimization may incur.

Wrong.

First, your new computer has got maintenance costs. It needs to be installed and configured. It requires electricity to run. Even if this is just an upgrade you need to spend time making sure the upgrade works. And maybe upgrading the machine will break it and you will have to buy a new one. Or maybe your server room is full and you will need to find a new office, and finding a new office is something incredibly not fun.

Hardware is free means is that at some point, spending a lot of time optimizing your code is just a waste of time and money. When you start firing up assembly to get some more 10% extra "umph", it means the time to cool down has come.

The assertion is an intentional caricature of the underlying idea. Don't use it as an excuse to not improve your programs.

Are you going to request your customers to upgrade their machine?

Because, guess what, people love fast programs. Not only geeks: every user likes speed. Fast means well designed. It means clever. It means reliable. It means that the job will be done in a timely fashion. It means that if we were to be invaded by zombies, thanks to the fast program we would make it (the automatic shotgun would help as well though).

Every time I run a resource hogger I feel like I'm wasting my time. And then all sort of negative thoughts get associated with the resource hogger and the fate of the author. Think Clockwork Orange and you'll have a pretty good digest of what I mean by negative thoughts.

There is no good reason not to carefully review your design and your program when you hit a performance barrier. Maybe improving performances will require some head scratching, but hey, that's why you got two 30 inches screens and an awesome armchair. No, that wasn't so you could l33ch stuff all day while sharing with the rest of the world how awesome the last South Park episode was.

Engineering is hard. That's why it's expensive. That's why your customers pay premium for just a bunch of bytes on their hard disk.