fivemack (fivemack) wrote,

Many eyes make fast code

I have just emerged from a five-hour optimisation trance, fuelled by sushi and plum wine.

Here is the code; it looks for numbers which can be written as the sum of three sixth powers in two different ways. It's multi-threaded (by constants in the code assuming you have four cores spare) and uses a cache-friendly blocked linked-list structure; it slices up the problem into chunks by sum-modulo-P and then into buckets by sum-modulo-Q, then sorts the bucket contents. It uses moderately prodigious amounts of memory - about 20N^2 bytes. It takes 1m15s wall-time on my machine (2.4GHz quad-core) to run 'sumsix_t4 2003', 10m35s wall-time for 'sumsix_t4 4007'. I'm surprised that the output from 'time' indicates that it spends non-trivial time in the OS: how? The only bits that look like OS calls are print statements and memory allocation, and it only does O(N) of those.

real 10m35.586s
user 31m20.462s
sys 5m20.408s

I'm sure it could be significantly faster - it's trivially parallel and I'm only getting x3 speedup on four CPUs - but I am not quite sure how. I've looked at profiles, I suspect I may have been foiled by out-of-order execution, where quick-op ( result of slow-op ) has to wait for slow-op to complete and so appears as a hotspot when the real hotspot is elsewhere. Hard-wiring the values of 'N' and 'bucket' so that the compiler can replace the modulo operations with multiplies by magic constants doesn't make a difference.

If there are people out there who like this sort of challenge, I'd like some input.
Tags: geek, maths
  • Post a new comment


    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.