Here is the code; it looks for numbers which can be written as the sum of three sixth powers in two different ways. It's multi-threaded (by constants in the code assuming you have four cores spare) and uses a cache-friendly blocked linked-list structure; it slices up the problem into chunks by sum-modulo-P and then into buckets by sum-modulo-Q, then sorts the bucket contents. It uses moderately prodigious amounts of memory - about 20N^2 bytes. It takes 1m15s wall-time on my machine (2.4GHz quad-core) to run 'sumsix_t4 2003', 10m35s wall-time for 'sumsix_t4 4007'. I'm surprised that the output from 'time' indicates that it spends non-trivial time in the OS: how? The only bits that look like OS calls are print statements and memory allocation, and it only does O(N) of those.
I'm sure it could be significantly faster - it's trivially parallel and I'm only getting x3 speedup on four CPUs - but I am not quite sure how. I've looked at profiles, I suspect I may have been foiled by out-of-order execution, where quick-op ( result of slow-op ) has to wait for slow-op to complete and so appears as a hotspot when the real hotspot is elsewhere. Hard-wiring the values of 'N' and 'bucket' so that the compiler can replace the modulo operations with multiplies by magic constants doesn't make a difference.
If there are people out there who like this sort of challenge, I'd like some input.