A Polars Memory Leak Trick

I am a big proponent of Polars for data analysis. It combines the ease of Python with the speed of Rust. In many cases it allows working with very large datasets very convenient. I have even written about it before on this blog.

Despite the nice things I wrote above about Polars above, it is still a work in progress. One big issue it has is less than ideal memory management when working with very large dataframes as discussed in this ticket on github. I have seen similar issues with Polars, where for no explicable reason it will run my machine out of memory. Below is a simplified example of a situation that I have run into this problem.

This example is in the middle of a processing pipeline. I have created a set of Parquet files in which I am including an ff column because I want to be able to filter on it. The entire set of Parquet files in dirname would not fit in memory, but a subset will, and I can process these subsets separately and combine the reduced dataframes at the end to achieve the same result. The distribution of values in ff is very even, meaning that each subset uses very closely the same amount of memory. Because ff is in the Parquet files, Polars can use predicate pushdown to filter on ff at the read step, meaning that the full dataset is never read into memory.

Indeed, the first few loops of this process work quite well. The filtered subsets fit into memory, and reduced dataframes are calculated. However, eventually Polars stops managing memory correctly and things go bad, and the machine runs out of memory.

import polars as pl

data = pl.scan_parquet(dirname)

reduced = []

# this will eventually run out of memory
for ff in range(N):
    print("doing", ff)
    # note that data is untouched each cycle
    data1 = data.filter(pl.col("ff") == ff)
    # Do some various LazyFrame manipulations on data1; we don't call
    # .collect() until below
    reduced.append(data1.collect())

reduced_data = pl.concat(reduced)

If the situation is amenable, there is a way around this. By running the Polars steps in a subprocess using multiprocessing, we can force Polars to free memory after each cycle by shutting down the process that runs Polars every cycle. Of course, there is a little bit of additional overhead here. Starting the subprocess takes a little time, as does re-scanning the directory of files in scan_parquet(). There is also some cost in un/pickling the reduced dataframe between the main process and the Polars subprocess. But if the data is large enough to resort to chunking like this, these extra costs are rounding errors, and are not worth worrying about.

from multiprocessing import Process, Queue, set_start_method

def worker(dirname, ff, out_q):
    data = pl.scan_parquet(dirname)
    data = data.filter(pl.col("ff") == ff)
    # Do some various LazyFrame manipulations on data; we don't call
    # .collect() until below
    out_q.put(data.collect())

if __name__ == "__main__":
    set_start_method("spawn")

    out_q = Queue()

    # this works to completion
    for ff in range(N):
        print("doing", ff)
        p = Process(target=worker, args=(dirname, ff, out_q), daemon=True)
        p.start()
        reduced.append(out_q.get())
        p.join()

    reduced_data = pl.concat(reduced)

I think this trick very simply illustrates the way that Polars is currently mismanaging memory. The two examples above are doing the same basic thing, but one works and the other doesn't. In fact, if I run the two methods above and watch memory use in real time, the two methods have similar behavior for some number of cycles. The memory use goes up and down as the filtered data is processed and cleaned up each cycle. This indicates that Polars can do memory management correctly, but for some reason after too many cycles, Polars stops freeing memory. In the github ticket linked above, it's suggested that a fix might have a performance cost. Hopefully something can be done about this that both prevents the memory leak and doesn't cost too much in performance. For now, this trick can be useful in the right situations.


Junior M.A.F.I.A. - Conspiracy

It's been a busy week, and I am five days late on this review. Luckily I don't have much to say about this week's #8 album. Apparently there were a couple hits off of Conspiracy by Junior M.A.F.I.A., but listening to the album didn't jog any memories. I am not disappointed because I didn't find the album terribly entertaining. I cannot endorse listening to this album, and I will never listen to it again.


No Album This Week

There are two albums in the top 10 this week that I have not reviewed. At #4 is a compilation album The Show: The Soundtrack, and at #10 Games Rednecks Play by Jeff Foxworthy. I can find neither of them on any streaming service.

The Show was a documentary about hip-hop, and the album features on the order of 20 artists. Likely due to various byzantine licensing issues with so many artists, putting the soundtrack on a streaming service would be next to impossible.

Games Rednecks Play is a comedy album, and comedy has had many licensing issues with the online music streaming services. It's not surprising that I can't find it.

Because I want to review only popular albums from 30 years ago, I don't drop below the top ten. Therefore, I have nothing to review this week.


Dangerous Minds Soundtrack

If you were alive in 1995, you heard the song Gangsta's Paradise everywhere. The popularity of this song propelled the soundtrack to the movie Dangerous Minds to #1. The song made Coolio's career, although playing Kwanzaa-bot on Futurama was pretty great, too. I'm fairly certain I have never seen the movie Dangerous Minds, but I have heard Gangsta's Paradise many, many times.

None of the other songs on the album amount to much. I guarantee that you would not recognize any of the non-Coolio artists on the album. Gin & Juice by DeVanté is far, far inferior to the Gin and Juice by Snoop Dogg you're thinking of. A Message For Your Mind by Rappin' 4-Tay samples I Want You Back by The Jackson 5 and completely misses the mark, somehow having none of the joyous energy of the original song. The final song (which I suspect was played during the closing credits), This Is The Life, is a non sequitur. It's a ballad by two white women who were part of Prince's The Revolution band. There's nothing wrong with being female, white, or working with Prince, but none of those things match anyone nor anything that came previously on the album.

In summary, the album is defined by and entirely worth the value of Gangsta's Paradise, and the rest is worth forgetting.

Finally, check out this episode of Last Week Tonight with John Oliver from a few weeks ago. It's about law enforcement gang databases, and a connection to Gangsta's Paradise comes out of left field.


Alanis Morissette - Jagged Little Pill

Rising steadily in the charts, the #3 album this week will hit #1 in a few weeks. I'm not sure I have ever listened to the entirety of Jagged Little Pill by Alanis Morissette; certainly not since I started tracking plays on last.fm. I have, of course, heard all the big hits off the album, of which there are many. Jagged Little Pill was huge when it came out and You Oughta Know was everywhere on the radio in 1995 and 1996. The non-ironies in Ironic have been pointed out for decades. The album might have contributed to the English language. It's possible that the term "Friends with Benefits" originated in Head over Feet.

It's hard to form any new opinions about this album because the singles were so ubiquitous. They were earworms then, and they are earworms now. There's no harm in listening to the album, short of a Friends marathon, it's one of the best ways to transport yourself back to (or get a small taste of if you're too young) the mid-90s.


Raekown - Only Built 4 Cuban Linx...

The #4 album this week is the solo debut of Raekwon, a founding member of Wu-Tang Clan. Like the last album I reviewed by a Wu-Tang Clan member, my summary is that while I recognize that Only Built 4 Cuban Linx... is an important and well-regarded rap album, it's not my preference. I will almost certainly never listen to it again.


A Ride In The Life

I used to do a photography project I called "A Day In The Life," where I would take photos all throughout a single day of what I did and where I went. Photos of everyday things like buildings, street scenes, people, and the like. I haven't done one in quite a long time, mostly because my days now include my children, and I don't want to plaster them all over the internet.

I ride with a GPS cyclocomputer that I've configured to record a lap every kilometer, which includes an alerting beep and a brief status screen that shows me how long I took to ride that kilometer. I started doing this when I lived in San Diego because there was a section of rolling road in Rancho Santa Fe that I would challenge myself to maintain 30 kph (or greater) on. Instead of doing mental arithmetic while hypoxic, the cyclocomputer did the math for me.

I decided to combine these two things into a "Ride In The Life," where I would take a photo of the road ahead of me every kilometer when my GPS beeped. Well, not every kilometer precisely, only when I felt safe to pull out my phone and take a picture. I didn't take any photos while descending at high speed, and if cars were passing me closely, I delayed the photo until I felt safe. The ride was one week ago and it was into the mountains west of Boulder; here is the GPS trace. I'm not sure how interesting this is, but since I went through the effort to take dozens of photos, I'm seeing it through to post them here. I hope you find them interesting!


Bone Thugs-N-Harmony - E. 1999 Eternal

I'm a few days late on this review. Oh well!

There's only one song off the #1 album this week worth listening to: 1st of Tha Month. It's one of the songs I play to wake up anyone who needs waking up. My last.fm play history has only one play after 10am, and it was when I listened to this album for 30 Years On. It's fun that even after 30 years the overall last.fm play history for the song shows a spike in plays at the beginning of each month:

1st of Tha Month play chart

My advice is to ignore the album, but 1st of Tha Month is forever.


Updated Cycling History map

Some (many!) years ago I posted a few images showing my riding history in California (in 2008) and Colorado (in 2012). While they are interesting, these are static images and cannot be explored.

Thirteen years have passed and technology has improved. Today I added a new page showing my entire cycling history since I started using GPS over twenty years ago. Instead of separate static images, it uses dynamic web technology™️ on a single map. As before, color shows how frequently I ride past a point on the map. However, I changed the logic: my previous maps used counts of observations near points, my new map uses counts of distinct rides that pass each point on the map. Instead of where I've done the most laps or my GPS has recorded the most times (for whatever reason), this is more indicative of where I ride the most.

For the most immersive experience, here is the full screen version.


Selena - Dreaming Of You

Released four months after Selena's murder, Dreaming Of You shot to #1 upon release. I remember being aware of her murder from the news when it happened, but that really was the extent of my knowledge of her and her music.

The only interesting thing I can say about this album is that the single song I recognize is in Spanish, not English. It's Amor Prohibido, which originally appeared on her 1994 album of the same name, and therefore isn't really off this album.

I have no strong feelings about this album. I'll almost certainly never listen to it again.


Blues Traveler - four

I'm dismayed to discover that as of this writing there is no harmonica emoji. This feels like a huge omission and should be rectified with great haste. If there was a harmonica emoji, I could visually represent what it's like to listen to Blues Traveler by inserting it in this text. Instead, I'll have to make do with the musical instrument emoji we do have, and you'll have to imagine it's a harmonica.

Released in September 1994, four by Blues Traveler 🎸 is at #9 this week, one spot lower than 🪗 it will eventually peak at #8. If you were alive in 🥁 the mid-90s you will remember that 🪈 the two big singles off this album, Run-Around and Hook, were everywhere on the 🎹 radio to the point that I got a bit tired of them.

Thirty years later I am 🎻 not as bothered by the songs. I think 🎷 Blues Traveler is fine; their music 🎺 feels very much of the era in the mid-90s, along with Dave Matthews Band and Hootie & 🪕 the Blowfish. Blues Traveler wasn't my 🪘 favorite band at the time, and that 🪉 hasn't changed. I don't dislike their 🪇 music, and sometimes I'm in 🪕 the mood for it, but 🎸 not often. In summary, if 🪇 you're in the mood for 🪉 to Blues Traveler, this is 🪗 the album to listen to.


Goodbye Voltron

Voltron

July 2007 to July 23, 2025.

She was a good cat.


Shania Twain - The Woman In Me

The top un-reviewed album this week is not The Woman In Me by Shania Twain at #7, it's the Batman Forever Soundtrack at #5. However, I cannot find the full album on any streaming service. Many of the songs can be found on other albums, but not all, and I don't care enough to look any harder than that. Therefore, it's Shania Twain's debut album we'll listen to this week!

One of my favorite television shows is (the first three seasons of) Arrested Development. One of the characters, Tobias Fünke, wrote a book called The Man Inside Me. I can't help but think of that book, which is used in various funny ways, when I read the title for this album. This is not a complementary thing for the album.

Reading Shania Twain's Wiki page, it turns out that she's married to the ex-husband of her former best friend who had an affair with Twain's first husband. All that's missing from that soap opera is an evil twin, babies switched at birth, and someone appearing (with convenient dramatic timing) previously believed to be dead.

I have no strong opinions about the music on the album itself. It sold quite well, and I can believe that many people like it, but it's not for me. I'll almost certainly never listen to it again.


Big Boy

Today I took a drive to Greely, Colorado to see the biggest operational steam locomotive in the world. The Union Pacific 4014, aka "Big Boy", was on a short trip from Cheyenne to Denver and made a single "whistle stop" in Greely. The Big Boy is a remarkable locomotive. It was on static display for decades before being pulled to the Union Pacific shop in Cheyenne and returned to running condition in 2019. It's hard to overstate just how huge this thing is. As you can see from the photos and video below, this massive locomotive draws crowds wherever it goes. It was fun to see!

A few things to note as you look through the media:

  • There were Starlink antennae on a few of the passenger cars
  • A couple engineers wrote "Big Boy" using chalk on the front of the 4014, surely to reference the origin story of the name "Big Boy"
  • The Big Boy always travels with a diesel locomotive companion in case of malfunction so that the train will not be stranded on an operational freight line
  • The police seen in a few shots are railroad police, not local police
  • As far as I could tell, only a few of the passenger cars being pulled had anyone in it, most cars had their window shades drawn closed
  • The video does not translate just how loud the engine is, you have to be near it to experience it



Neil Young - Mirrorball

This week's #5 album is just one of Neil Young's 55 studio albums. While I like some of his work, such as After the Gold Rush and Harvest, I am not a big enough fan of his to listen to all of his work. As far as I can recall, I have not heard any of the songs on this album prior to today.

Apparently this album was recorded with Pearl Jam. I didn't learn this fact until after I listened and I couldn't have guessed it. I guess it's kind of grungey, but only if you think to listen for it.

Doing this project I've noticed that albums like this, made by musicians past their prime, have very short periods near the top of the charts. My guess is that because they were so well known, a fixed set of fans will always buy a copy as soon as it comes out. However, because the fan base isn't growing, there's no long tail of purchases. Many albums in the top sales list have been in the charts for a year or more. In one month, Mirror Ball will drop to #48, and in two months to #96, almost out of the listing entirely.

As far as this album goes, it's decent. According to this ranking, it's his thirteenth best album, which makes it above average. That list puts After the Gold Rush and Harvest at #1 and #2, which, duh. I guess if you are really into Neil Young you could do worse than Mirror Ball.