SSDs

Speeding up grep of a 12.4 GB text file

I am working on a vocabulary-building book for SAT and GRE students. Below is a picture of the provisional cover of the book:

In order to have a wide corpus of classical texts to find word usage examples, I downloaded a massive ebook collection from the Gutenberg project and merged all of the text files into one big file that reached 12.4 gigabytes in size. I then wrote a PHP script that used the grep utility to search through about 250 billion lines of text1 to find the word usages I needed.

Here is an example of the results for the word “taciturn”:

In order to find interesting examples, I use the following regular expression:

egrep -aRh  -B 20 -A 20 "\b(she|her|hers|his|he)\b.*taciturn" merged.txt

This finds usages of the word that start with a pronoun such as “she”. This helps find usages that occur mostly in novels, rather than other types of books (the Gutenberg collection contains many non-novel files, such as encyclopedias and legal works).

My first step toward speeding up the grep was to move the file to an old SSD I have that is attached to my desktop. The SSD supports up to 200 MB/second read speeds. This was not good enough, so I eventually moved it to my main Samsung SSD which has over 500 MB/second read speeds. Below is a screenshot of the iotop utility reporting a read speed of 447 M/s while grep is running:

My first idea to speed up the grep was to use GNU parallel or xargs, both of which allow grep to make use of multiple CPU cores. This was misguided since the limiting factor in this grep was not CPU usage but disk usage. Since my SSD is being maxed out, there is no point in adding more CPU cores to the task.

Using the following grep command, it took a little over 30 seconds to finish grepping the entire file once:

Here is the output for the time command which tells how long a command takes to finish:

One of the first suggestions I found is to prefix the command with LC_ALL=C, this tells grep to avoid searching through non-ANSI-C characters.

That seemed to make the grep very slightly faster:

Just to see what happens, I next used the fmt utility to reformat the file. The file currently is made up of short lines all separated by new lines. Using fmt, I changed it to having lines of 500 characters each. This was likely going to make the grep slower since it was going to match a lot more lines since the lines were going to be longer:

But on the upside, I was going to get a lot more results. The fmt command decreased the number of lines from 246 million to only 37 million:

But actually what happened when I did the next grep was that the grep time decreased to only 23 seconds:

I guess the reason is that grep has a lot fewer lines to go through.

Unfortunately it looked like fmt had corrupted the text. Here is an example:

I think the reason was that some (or most, or all) of the text files were using Windows-style newlines rather than Unix-style ones which was perhaps confusing fmt. So I used this command to convert all Windows-style newlines into spaces:

After that operation and running fmt again on the result, grepping again seems to result in non-corrupt results:

And:

I also looked for the corrupted passage above to see how it looked now:

So it all seems fine now.

As far as I know there is no way to speed up the grep significantly further unless I get a lot of RAM and do the grep on a ramdisk, or get a much faster SSD. Just out of curiosity I decided to try out changing the fmt command to make lines of 1500 characters each to see how that affects the grep:

That didn’t actually do anything to speed up the grep further:

When will smartphones have 1 terabyte of storage? Probably around 2021

While I’m no fan of Apple, the iPhone has so far been the leader in performance and storage. The internal storage of Apple’s latest and greatest iPhone provides a good benchmark for the current level of storage of the entire smartphone industry. When an iPhone with a new level of storage comes out, every manufacturer plays catch-up with Apple releases a flagship phone of similar storage.

Here is a chart that shows the trend in iPhone in internal storage. It extrapolates the trend into the future to predict when the iPhone will likely have one terabyte of internal storage (blue is historical storage levels, orange is predicted, and the dotted line is the trendline):

The chart assumes an exponential trend, since storage density and prices have followed an exponential increase and reduction trend.

I assumed that from 2018 through 2020, the iPhone will stay at 512 GB, similar to how it remained at 64 GB from 2011 through 2013. It is possible that instead of this, the iPhone will stay at 256 GB from 2017 through 2019. This will not significantly affect the historical trend.

Here is the same chart with the forecast extended to 2030. The trendline predicts an internal storage of 5 terabytes in 2027 and 10 terabytes in 2029.

I know that 10 terabytes in a smartphone may seem unnecessarily high. But historical trends show that every age can find good (and frivolous) uses for all the storage it can get.

Below is a table of every iPhone release date, device name, highest offered storage and battery capacity:

Release Date
Device Storage (GB)
Battery (mAh)
June 29, 2007 iPhone 1 16 1400
July 11, 2008 iPhone 3G 16 1150
June 19 2009 iPhone 3GS 32 1219
June 24, 2010 iPhone 4 32 1420
October 14, 2011 iPhone 4S 64 1432
September 21, 2012 iPhone 5 64 1440
September 20, 2013 iPhone 5S 64 1560
September 19, 2014 iPhone 6 128 1810
September 25, 2015 iPhone 6S 128 1715
September 16, 2016 iPhone 7 256 1960