grep

Speeding up grep of a 12.4 GB text file

I am working on a vocabulary-building book for SAT and GRE students. Below is a picture of the provisional cover of the book:

In order to have a wide corpus of classical texts to find word usage examples, I downloaded a massive ebook collection from the Gutenberg project and merged all of the text files into one big file that reached 12.4 gigabytes in size. I then wrote a PHP script that used the grep utility to search through about 250 billion lines of text1 to find the word usages I needed.

Here is an example of the results for the word “taciturn”:

In order to find interesting examples, I use the following regular expression:

egrep -aRh  -B 20 -A 20 "\b(she|her|hers|his|he)\b.*taciturn" merged.txt

This finds usages of the word that start with a pronoun such as “she”. This helps find usages that occur mostly in novels, rather than other types of books (the Gutenberg collection contains many non-novel files, such as encyclopedias and legal works).

My first step toward speeding up the grep was to move the file to an old SSD I have that is attached to my desktop. The SSD supports up to 200 MB/second read speeds. This was not good enough, so I eventually moved it to my main Samsung SSD which has over 500 MB/second read speeds. Below is a screenshot of the iotop utility reporting a read speed of 447 M/s while grep is running:

My first idea to speed up the grep was to use GNU parallel or xargs, both of which allow grep to make use of multiple CPU cores. This was misguided since the limiting factor in this grep was not CPU usage but disk usage. Since my SSD is being maxed out, there is no point in adding more CPU cores to the task.

Using the following grep command, it took a little over 30 seconds to finish grepping the entire file once:

Here is the output for the time command which tells how long a command takes to finish:

One of the first suggestions I found is to prefix the command with LC_ALL=C, this tells grep to avoid searching through non-ANSI-C characters.

That seemed to make the grep very slightly faster:

Just to see what happens, I next used the fmt utility to reformat the file. The file currently is made up of short lines all separated by new lines. Using fmt, I changed it to having lines of 500 characters each. This was likely going to make the grep slower since it was going to match a lot more lines since the lines were going to be longer:

But on the upside, I was going to get a lot more results. The fmt command decreased the number of lines from 246 million to only 37 million:

But actually what happened when I did the next grep was that the grep time decreased to only 23 seconds:

I guess the reason is that grep has a lot fewer lines to go through.

Unfortunately it looked like fmt had corrupted the text. Here is an example:

I think the reason was that some (or most, or all) of the text files were using Windows-style newlines rather than Unix-style ones which was perhaps confusing fmt. So I used this command to convert all Windows-style newlines into spaces:

After that operation and running fmt again on the result, grepping again seems to result in non-corrupt results:

And:

I also looked for the corrupted passage above to see how it looked now:

So it all seems fine now.

As far as I know there is no way to speed up the grep significantly further unless I get a lot of RAM and do the grep on a ramdisk, or get a much faster SSD. Just out of curiosity I decided to try out changing the fmt command to make lines of 1500 characters each to see how that affects the grep:

That didn’t actually do anything to speed up the grep further:

Recovering from a SAXParseException error with no data loss

I was working on a book I’m writing in LibreOffice Writer. I am using the docx format for the book, as I plan to finish its formatting in Microsoft Word, not knowing that LibreOffice Writer has a tendency to corrupt docx files. After closing and opening the file again, I received the following error:

I extracted the docx file (which is just a zip file, on Windows you can rename it to something.zip to extract or, while Ubuntu Linux allows you to extract it without renaming it). Found the document.xml file and opened it in VIM. I used the following command to jump to position 791513 on line 2:

791513l

That is the position number followed by a lowercase L.

I don’t see any error there, so LibreOffice Writer is not telling the truth, the error is not there. I opened document.xml in Chromium, but it reported the same wrong error position. Since the error message I received was saying the “w:cstheme” attribute was redefined, I decided to use regular expressions to search for it. I spent a stressful hour trying to learn VIM’s ridiculous regular expression syntax, but couldn’t figure it out.

In the end, I decided to use egrep instead. I ran the following command on the command line, which looks for a “w:cstheme” attribute that is not separated by a forward slash from antoher “w:cstheme” attribute, meaning it will find tags that have duplicate “w:cstheme” attributes, which is the error that LibreOffice Writer is reporting:

egrep "w:cstheme[^/]*w:cstheme" document.xml

And voila! It highlighted the error:

I copied the highlighted text (using ctrl+shift+c), opened document.xml again in VIM, and pasted the text in VIM’s seach bar (first press forward slash to open the search bar, then ctrl+shift+v to paste):

Pressing enter twice, it jumped right to the line and position (“column”) where the error was, which was actually position 817157:

If you understand html/xml, you will see the issue. To correct it, change this:

<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cstheme="majorBidi"/>

To this, removing one of the unnecessary ‘w:cstheme=”majorbidi”‘ attributes:
<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" />

I searched again for the error in VIM, to make sure there were no repeated errors. I fixed multiple other occurrences of the error until I couldn’t find any more. Now, when opening the document.xml in Chromium, no error was reported:

This was a good sign. I made the mistake of compressing the parent folder of the document, renaming it to docx, and trying to open it. LibreOffice Writer said the document was corrupted and offered to fix it. It tried but failed. After a long time, I realized my error. I shouldn’t have compressed the parent folder, I should have compressed the files and folders inside the parent folder directly, as follows:

Above I have selected the files and folders that make up the docx document. I then right-clicked it and chose “Compress”, and chose the “zip” option. Below is the compressed file:

Next, I renamed the file to “occupy.docx_FILES.docx”:

Then I opened the file in LibreOffice Writer, and it worked!

To prevent this in the future, I will save the file in the ODF Text Document format (.odt), which is the native format used by LibreOffice that supposedly doesn’t suffer from this issue. Once the book is done, I will then save it as docx for use in Microsoft Word.