This morning was one of these mornings when you wake up with that kind of strange idea only sleep can bring you. It's a concept related to software optimization at the assembly level.
Before I explore this idea with you, I'd like to spend few paragraphs to talk about the vast and inspiring field of assembly optimization.
Optimizing code
Assembly optimization is such a huge topic that I won't even try to be exhaustive. This brief tour should help you understand where the idea comes from. Experts will forgive the generalities and approximations of this quick overview. Should you wish to know more, feel free to read the abundant and excellent documentation provided by Intel.
The natural approach
This first and most straightforward one consists in replacing a set of instructions by an equivalent, but faster, set of instructions.
An example of this is when you write
1 | xor eax, eax |
instead of
1 | mov eax, 0 |
on the IA32.
These instructions both set the eax register to 0, except that the former is faster (This xor register, register instruction also has got other special powers I will tell you later about...)
Optimizing is in that case just a matter of search and replace, but unfortunately this technique will not lead you very far.
On our heavy-pipelined processors, you will additionally do everything you can to make the most of branch prediction.
Optimizing branches
Branch prediction is a set of heuristics built inside your CPU to avoid flushing the pipeline when branching. It determines the most likely branch when reaching a conditional instruction and preloads the guessed address. Unexplored branches are predicted using static heuristics, explored branches using statistics.
Making the most of branch predictions means disposing conditions in a way favorable for these heuristics, or even avoid branching in the first place through the use of conditional instructions. Unrolling loops is another way to tackle this as it replaces the loop by the equivalent repeated instructions. Unrolling has however to be done reasonably, as excessive use of this technique may result in a larger code that no longer fits in the trace cache.
Putting branches aside, arranging instructions also has got a great impact of performances, because of stalling.
Avoiding stalls
Nota bene : the following example oversimplifies the issue as it overlooks - amongst other things - parallelization on processors featuring multiple treatment units. In other words, rearranging the above instructions might not always yield a performance improvement, but it nevertheless shows the underlying concept.
1 2 3 4 | mov eax, 3 mov ebx, 4 add ebx, eax ; addition of 3 + 4 xor ecx, ecx |
The addition can only occur when the ebx register is ready, in other words, the addition will have to wait for the register to be fully set before doing anything. That's a stall. If you notice the instruction on ecx, you will see there is no dependence .
If you place the operation on ecx before the add, you have a potential performance gain because the instruction can execute right away without waiting for the previous one to finish. Sometimes dependencies are more subtle and are based on the flags. This is why - for example - it is generally recommended to avoid instructions such as inc and dec that only affect flags partially.
I earlier wrote that the xor register, register instruction has got some special powers. It can indeed be used to break the dependencies we just examined. When the processor encounters this instruction, it will discard all dependencies related to the register you reset, preventing stalls.
And so much more...
Memory layout is another major topic that is very hard to sum up in few words, but I need to at least name it. Just know that aligning instructions and data on cache boundaries can yield a significant performance improvement in optimizing loads and avoiding false sharing (very important for multithreaded code).
And then, you have also the matter of mixing floating point and single instructions multiple data (SIMD) instructions, partial register stalls, vectorization, stack optimization, store forwarding, locality enhancement...
When you realize the amount of techniques that can be used to optimize a code fragment, finding the ideal (or close to ideal) set of parameters for your processor is extremely difficult. This is where we go back to instrumentation, without instrumentation, you (or most likely the compiler) optimize a priori for your given architecture. With instrumentation, you make decisions based on hard evidence.
You need to instrument representative runs of your program to make it worthwhile, and sometimes, there isn't much the optimizer can do to improve the speed of your software.
This morning's idea
What about using natural selection to optimize a routine?
If you inject random optimizations in thousands of instances of a routine, benchmark them, discard the slower ones and keep working on the faster ones, will it eventually output the fastest possible implementation?
Obviously the following requirements would have to be met:
- The routine needs to be testable for correctness to a reasonable extent as optimizations might result in modified behavior
- It must be possible to generate a representative set of inputs in order to evaluate performance realistically
- Enough computing power to mutate, spawn and test the routine as waiting one month for the result might not be acceptable...
The routine mutator would need to know a comprehensive set of optimization techniques for the given architecture, but it could also simply inject random opcodes and let the test phase discard non-working versions.
The approach can also be used to generate the smallest possible program, or any feature that can easily and objectively be measured.
When would that approach be reasonable? Probably for fundamental software bricks, such as an operating system scheduler, the pager or some driver's routine.
That and God of War.
Generic algorithms for compiler optimizations are not new.
There are a few papers about a FORTRAN (yes, I know) compiler using them. There are a few talks about using them with Java bytecode.
There's even a few papers about using them for high level compilation strategies like register allocation.
One interesting idea to bring in this arena would be to do this at the opcode level, maybe putting some hardware in the mix like letting the algorithms evolve on pure, unshared CPU to prevent bias from cache effects.
In fine, I don't know if this can bring real new ideas deterministic optimization does not bring. Optimization for the examples you showed can be computed cheaper than using GAs.
Still the idea might be good to create optimization profiles specific to some CPUs.
You're right when you say the examples given would not benefit from GAs, they were given just to show that the choices are numerous (maybe not numerous enough).
To be honest, I actually expect the most from random insertions, as I believe it would bring unexpected results.
hmmmm,
Thanks Edouard, but isn't your idea behind what people call genetic programming? Tell me if I'm mistaken...