Browsed by
Category: My Research

Posts about my research on the way to a PhD.

How I Got a 327x Speedup Of Some Python Code

How I Got a 327x Speedup Of Some Python Code

I don’t usually post code stuff on this blog, but I had some fun working on this and I wanted to share!

My colleague, Abhi, is processing data from an instrument collecting actual data from the real world (you should know that this is something I have never done!) and is having some problems with how long his analysis is taking. In particular, it is taking him longer than a day to analyze a day’s worth of data, which is clearly an unacceptable rate of progress. One step in his analysis is to take the data from all ~30,000 frequency channels of the instrument, and calculate an average across a moving window 1,000 entries long. For example, if I had the list:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

and I wanted to find the average over a window of length three, I’d get:

[1, 2, 3, 4, 5, 6, 7, 8]

Notice that I started by averaging [0, 1, 2]=>1, as this is the first way to come up with a window of three values. Similarly, the last entry is [7, 8, 9]=>8. Here is a simplified version of how Ahbi was doing it (all of the code snippets shown here are also available here):

?View Code PYTHON
def avg0(arr, l):
    # arr - the list of values
    # l - the length of the window to average over
    new = []
    for i in range(arr.size - l + 1):
        s = 0
        for j in range(i, l + i):
            s += j
        new.append(s / l)
    return new

This function is correct, but when we time this simple function (using an IPython Magic) we get:

?View Code PYTHON
# a is [0, 1, 2, ..., 29997, 29998, 29999]
a = np.arange(30000)
%timeit avg0(a, 1000)
1 loops, best of 3: 3.19 s per loop

Which isn’t so horrible in the absolute sense – 3 seconds isn’t that long compared to how much time we all waste per day. However, I neglected to mention above that Abhi must do this averaging for nearly 9,000 chunks of data per collecting day. This means that this averaging step alone takes about 8 hours of processing time per day of collected data. This is clearly taking too long!

The next thing I tried is to take advantage of Numpy, which is a Python package that enables much faster numerical analysis than with vanilla Python. In this function I’m essentially doing the same thing as above, but using Numpy methods, including pre-allocating space, summing values using the accelerated Numpy .sum() function, and using the xrange() iterator, which is somewhat faster than plain range():

?View Code PYTHON
def avg1(arr, l):
    new = np.empty(arr.size - l + 1, dtype=arr.dtype)
    for i in xrange(arr.size - l + 1):
        new[i] = arr[i:l+i].sum() / l
    return new

This provides a healthy speed up of about a factor of eight:

?View Code PYTHON
%timeit avg1(a, 1000)
1 loops, best of 3: 405 ms per loop

But this still isn’t fast enough; we’ve gone from 8 hours to just under an hour. We’d like to run this analysis in at most a few minutes. Below is an improved method that takes advantage of a queue, which is a programming construct that allows you to efficiently keep track of values you’ve seen before.

?View Code PYTHON
def avg2(arr, l):
    new = np.empty(arr.size - l + 1, dtype=arr.dtype)
    d = deque()
    s = 0
    i = 0
    for value in arr:
        d.append(value)
        s += value
        if len(d) == l:
            new[i] = s / l
            i += 1
        elif len(d) > l:
            s -= d.popleft()
            new[i] = s / l
            i += 1
    return new

And here we can see that we’ve cut our time down by another factor of 4:

?View Code PYTHON
%timeit avg2(a, 1000)
10 loops, best of 3: 114 ms per loop

This means that from 8 hours we’re now down to roughly 13 minutes. Getting better, but still not great. What else can we try? I’ve been looking for an excuse to try out Numba, which is a tool that is supposed to help speed up numerical analysis in Python, so I decided to give it a shot. What makes Numba attractive is that with a single additional line, Numba can take a function and dramatically speed it up by seamlessly converting it into C and compiling it when needed. So let’s try this on the first averaging function:

?View Code PYTHON
@jit(argtypes=[int32[:], int32], restype=int32[:])
def avg0_numba(arr, l):
    new = []
    for i in range(arr.size - l + 1):
        s = 0
        for j in range(i, l + i):
            s += j
        new.append(s / l)
    return np.array(new)

In the line beginning with @jit, all I have to do is describe the input and the output types, and it handles the rest. And here’s the result:

?View Code PYTHON
%timeit avg0_numba(a, 1000)
10 loops, best of 3: 21.6 ms per loop

What is incredible here is that not only is this roughly 5 times faster than the queue method above, it’s a ridiculous 147 times faster than the original method and only one line has been added. We’ve now reduced 8 hours to about 4 minutes. Not bad!

Let’s try this on the second averaging method, which if you recall, was substantially better than the original method:

?View Code PYTHON
@jit(argtypes=[int32[:], int32], restype=int32[:])
def avg1_numba(arr, l):
    new = np.empty(arr.size - l + 1, dtype=arr.dtype)
    for i in xrange(arr.size - l + 1):
        new[i] = arr[i:l+i].sum() / l
    return new
%timeit avg1_numba(a, 1000)
1 loops, best of 3: 688 ms per loop

That’s interesting! For some reason I don’t understand, this is actually slower than the un-optimized version of avg1. Let’s see if Numba can speed up the queue method:

?View Code PYTHON
@jit(argtypes=[int32[:], int32], restype=int32[:])
def avg2_numba(arr, l):
    new = np.empty(arr.size - l + 1, dtype=arr.dtype)
    d = deque()
    s = 0
    i = 0
    for value in arr:
        d.append(value)
        s += value
        if len(d) == l:
            new[i] = s / l
            i += 1
        elif len(d) > l:
            s -= d.popleft()
            new[i] = s / l
            i += 1
    return new
%timeit avg2_numba(a, 1000)
10 loops, best of 3: 77.5 ms per loop

This is somewhat better than before, but still not as fast as avg0_numba, which comes in at roughly 20ms. But what if I really try hard to optimize the queue method by using only Numpy arrays?

?View Code PYTHON
@jit(argtypes=[int32[:], int32], restype=int32[:])
def avg2_numba2(arr, l):
    new = np.empty(arr.size - l + 1, dtype=arr.dtype)
    d = np.empty(l + 1, dtype=arr.dtype)
    s = 0
    i = 0
    left = 0
    right = 0
    full = False
    for j in xrange(arr.size):
        d[right] = arr[j]
        s += arr[j]
        right = (right + 1) % (l+1)
        if not full and right == l:
            new[i] = s / l
            i += 1
            full = True
        elif full:
            s -= d[left]
            left = (left + 1) % (l+1)
            new[i] = s / l
            i += 1
    return new
%timeit avg2_numba2(a, 1000)
100 loops, best of 3: 9.77 ms per loop

A ha! That’s even faster, and our 3.19s are now down to 9.77ms, an improvement of 327 times. The original 8 hours are now reduced to less than two minutes. I’ve put all my code into an IPython Notebook, which you can download and run for yourself. I hope you’ve enjoyed this as much as I have!

Update:

After chatting with a couple of my colleagues, this appears to be the fastest way to do this kind of operation:

?View Code PYTHON
from scipy.signal import fftconvolve
import numpy as np
a = np.arange(30000)
b = np.ones(1000) / 1000.
%timeit fftconvolve(a, b, 'valid')
100 loops, best of 3: 6.25 ms per loop
Working Conditions

Working Conditions

A Crane in front of Folsom Stadium

Of late, there has been a great deal of construction in and around my office. Beginning two weeks ago my office cluster has been repainted, doors refinished, window shades replaced, some carpet has been replaced, and some wires and pipes have been installed and moved. There is still some more work to be done.

Beginning yesterday, and continuing for the next four to six weeks, the scoreboards for the stadium are being replaced. Pictured above is a crane removing pieces of the old one (notice the asymmetry in solid black/shaded regions). This crane is more or less just outside my window, which, as you can imagine, is kind of neat for the first five minutes, and then gets tiresome and annoying after that. The construction has also closed off large amount of space in front of the offices which is inconvenient for us, and they even cut down two perfectly healthy trees, presumably to make space for the crane (insert tree-sitter joke here).

As a matter of fact, the conduit carrying the control wires for the new scoreboards passes by my office. The next time Stanfurd plays the Buffs, don’t be surprised if … wait, I’m saying too much.

CU Janus Supercomputer

CU Janus Supercomputer

Last night I had the opportunity to tour the supercomputer recently built here at CU named “Janus” that I’ve been using. It is a 16,000-core Dell cluster using 6-core Intel processors running RedHat Linux. It was built in an interesting way. Instead of building a machine room in a building and then filling it with cooling ducts, pipes, and power connections, the machine room is made up of standard shipping containers that had all those connections in place, similar to a pre-fab house. These were shipped from the factory (in Canada, I think) on trucks, and then dropped next to each other in a parking lot behind a campus building. Unfortunately, because it was nighttime, I don’t have a good picture of the outside, but the link above has a good picture of it.

Below are some pictures I took of Janus.

The machine racks. The door encloses the 'hot' side of the machines, where the air is sucked to the heat exchangers.
The machine racks. The door encloses the ‘hot’ side of the machines, where the air is sucked to the heat exchangers.
The cooling system.
The cooling system.
The blinky and hot end of the machines. Lots of wires!
The blinky and hot end of the machines. Lots of wires!
A close up of the back of a compute node. Notice that they have serial ports, which are based on a 40+ year old standard. At least they have USB ports, too.
A close up of the back of a compute node. Notice that they have serial ports, which are based on a 40+ year old standard. At least they have USB ports, too.
It was using 415 kW of power. I think it can go much higher than that when the machine is under heavy load on a hot day.
It was using 415 kW of power. I think it can go much higher than that when the machine is under heavy load on a hot day.
Blob Identification

Blob Identification

I have access to some of the fastest computers in the world. I can summon thousands of processors, petabytes of disk storage, and terabytes of memory with a few keystrokes. You’d think with that kind of power, analysis could be automated and done in massive pipelines. But that’s not always true. Sometimes things are so subtle that the most efficient method is still to use the human eye.

Today I was pretty ill, with a sore throat and congestion. It was the kind of day for lying on the couch and watching movies. Luckily, I was able to accomplish some work that didn’t require too much concentration. First, some context.

One of my current projects is looking at the centers of simulated galaxy clusters, and it is actually surprisingly difficult to find the centers of the clusters. Clusters are complicated places, with clumps of matter falling in, sloshing stuff around, making the core not exactly clear. In order to semi-automate the process, I wrote a script that makes pictures of the clusters (which have been previously identified in a automated fashion), that have density contours superimposed. The density contours are analogous to the lines on a topographical map.

The picture above is an example of the output of the script. The colors indicate gas density, blue to red is low to high. If you look at the full-sized image, you can see that there are two dense clumps numbered 0 and 1. I can’t just pick the most dense cluster because that might be a tiny blob of matter falling into the larger cluster; more care is needed. So I need to make the decision with my adaptable brain. The script spits out the image, and then I have to input which clump I think is the most central clump, and a record is made which I’ll use for the next step in my chain of analysis. I spent most of today looking at these pictures and entering numbers. Yay for tax payers!

This is very similar to the Galaxy Zoo project that aimed to identify the morphology of actual galaxies, but my effort is on a much smaller and simpler scale.

So – which clump do you think is the most central above?

Volume Rendered Movies

Volume Rendered Movies

It’s show-and-tell time! Below are two movies I made using the volume rendering tools of yt. I’ve been using yt for a few years to analyze and visualize the cosmological simulations I make with Enzo, and only recently have I had time to begin to play with the new volume rendering stuff.

The first movie is a slow rotation around the entire volume of a simulation at a contemporary epoch, which means that this image is produced from the state of simulation at its end, 13 billion years after it started. The colors correspond to the density of matter in the volume, from dark blue to white as density increases. The simulation is a periodic cube with dimensions 20 Mpc/h on a side. In comparison, the diameter of our galaxy is somewhere around one thousand times smaller. This means that the whitest areas correspond to clusters of galaxies, and our galaxy would be just a small part of one of the white blobs. Be sure to watch the movie full screen!

Loading the player …

(Quicktime version)

Below shows the time evolution of the simulation from beginning to end. This is a thin slab of the center of the simulation (10% thickness) viewed from a corner of the cube. Notice that early on the matter is very clumpy everywhere, but rapidly forms dense knots connected by thin filaments. This is how the real universe looks! After about the half-way point of the movie you’ll notice that not much happens. Again, this is how the real universe looks! Much of the large-scale evolution of the universe was finished about 7 billion years ago. This movie uses comoving coordinates, that compensate for the expansion of the universe. If I were to use proper coordinates, which are the kind we use every day to measure normal things with rulers, the movie would show the simulation starting very small and then blowing up. Again, use the full screen option for the best image.

Loading the player …

(Quicktime version)

Back in San Diego (Temporarily)

Back in San Diego (Temporarily)

I’m back in San Diego for the next week and a half in order to graduate. I defend in one week on the 27th. The photo above is of a tarantula that lives in the office that I used to sit in, and that I am sitting in again while I’m here. That is its full significance to this post.

If the tarantula isn’t big enough above, you can make it bigger by clicking on the image!

SDSC Optiportal

SDSC Optiportal

My adviser Professor Mike Norman, as part of his job at the San Diego Supercomputer Center, purchased an optiportal system for the new SDSC building which is opening today. An optiportal system is a wall of monitors powered by networked computers such that the screens behave as one monitor. Very high resolution images and movies can be tiled across the screens, as you can see below. Movies and animations can also be tiled across the screens.

IMG_5622

Read More Read More