I’ve been playing around with Common Crawl data lately - training some hindi language models. So I had about 25~ gigs of text to process, which I suppose is tiny compared to English Common Crawl (in the TB range); but it’s right on the cusp of having to think a little bit on how to deal with it in the most efficient manner.
The server I was running this on had some 40 odd CPU cores and like 400 Gigs of RAM, so I wasn’t terribly worried. I just split up the files into a bunch of files at around 2gb each and just run them all in parallel on 1 core each since my code was fairly run of the mill, written to run in a single process. The file output a bunch of tab seperated data, and I needed to count unique instances of each of the tokens. My go-to method would have been to just treat each of them as tab seperated CSV files and just run
pd.concat to get a honking big dataframe to run groupby queries on.
Unfortunately as always, life isn’t quite that simple, and some of the files were malformed, they either had text which wasn’t unicode encoded, or like a bunch of characters that were messing with
pandas.read_csv so I ended up having to read them as text files, split them on tabs and then cast the entire list of lists of strings to a DataFrame. This is where I ran into the first problem, with the lists of strings from each file, the dataframe objects and the output of the pd.concat, I ran out of RAM super fast.
I figured that I could parse each tab seperated file individually, cast it to a dataframe and write it to disk. That way, even if my process got killed because I ran out of memory, I could just run it again on the data that hadn’t been written to disk and finish up the job in two batches. I also figured that list comprehension that iterated through every line of the file and split it on tabs would take a fair amount of time, so I could speed up my script a little by using multiprocessing. I’d heard of this library called Trio some time ago and thought I’d give it a go. In retrospect, this was a pretty dumb idea because all I did was run out of RAM faster than I did before, since the async code was
await-ing f.readlines() before it started trying to split on tabs.
At this point I realized that I needed to deal with the memory issue first - so I dropped the list comprehension, wrote a normal for loop, and started parsing and feathering each of the text files. I’d heard of feather being this cool file format that had super fast disk reads and that worked pretty well - Even though I was running this on a single process again, it wasn’t terribly slow. I ran into problems when I tried to concat all the disparate feathers together - also, it turns out you can’t store more than 2Gb of binary data per column in a feather file. Awesome. Well, HDF5 is another file format which is supposed to be pretty fast to write to disk so let’s use that for the combined dataframe. Nope. Some hours later, all I had to show for it was SegFault. (It’s been a hot minute since I’ve seen one of those). At this point, I ran out patience and decided to use SQLite. Which is probably what I should have done in the first place, since it only took a couple of lines of code, and lets me combine the individual dataframes on disk, instead of in memory. I can then also do my counts in SQL queries rather than pandas code, and I’m pretty sure SQL is faster than pandas in most situations.
Key Takeaways from this weekend for me -
- Just use SQL. It’s been around forever for a reason.
del dfwill actually release the memory that the dataframe was using back to the OS, without an explicit call to the garbage collector. This one is a little curious to me - del should theoretically just get rid of the reference, and the garbage collector should be along whenever it damn well pleases after the variable has gone out of scope. This might be because of numpy, but I’m not really sure why this happens at the moment.