6 Islamic articles on: linux

Speeding up grep of a 12.4 GB text file

I am working on a vocabulary-building book for SAT and GRE students. Below is a picture of the provisional cover of the book:

In order to have a wide corpus of classical texts to find word usage examples, I downloaded a massive ebook collection from the Gutenberg project and merged all of the text files into one big file that reached 12.4 gigabytes in size. I then wrote a PHP script that used the grep utility to search through about 250 billion lines of text1 to find the word usages I needed.

Here is an example of the results for the word “taciturn”:

In order to find interesting examples, I use the following regular expression:

egrep -aRh  -B 20 -A 20 "\b(she|her|hers|his|he)\b.*taciturn" merged.txt

This finds usages of the word that start with a pronoun such as “she”. This helps find usages that occur mostly in novels, rather than other types of books (the Gutenberg collection contains many non-novel files, such as encyclopedias and legal works).

My first step toward speeding up the grep was to move the file to an old SSD I have that is attached to my desktop. The SSD supports up to 200 MB/second read speeds. This was not good enough, so I eventually moved it to my main Samsung SSD which has over 500 MB/second read speeds. Below is a screenshot of the iotop utility reporting a read speed of 447 M/s while grep is running:

My first idea to speed up the grep was to use GNU parallel or xargs, both of which allow grep to make use of multiple CPU cores. This was misguided since the limiting factor in this grep was not CPU usage but disk usage. Since my SSD is being maxed out, there is no point in adding more CPU cores to the task.

Using the following grep command, it took a little over 30 seconds to finish grepping the entire file once:

Here is the output for the time command which tells how long a command takes to finish:

One of the first suggestions I found is to prefix the command with LC_ALL=C, this tells grep to avoid searching through non-ANSI-C characters.

That seemed to make the grep very slightly faster:

Just to see what happens, I next used the fmt utility to reformat the file. The file currently is made up of short lines all separated by new lines. Using fmt, I changed it to having lines of 500 characters each. This was likely going to make the grep slower since it was going to match a lot more lines since the lines were going to be longer:

But on the upside, I was going to get a lot more results. The fmt command decreased the number of lines from 246 million to only 37 million:

But actually what happened when I did the next grep was that the grep time decreased to only 23 seconds:

I guess the reason is that grep has a lot fewer lines to go through.

Unfortunately it looked like fmt had corrupted the text. Here is an example:

I think the reason was that some (or most, or all) of the text files were using Windows-style newlines rather than Unix-style ones which was perhaps confusing fmt. So I used this command to convert all Windows-style newlines into spaces:

After that operation and running fmt again on the result, grepping again seems to result in non-corrupt results:

And:

I also looked for the corrupted passage above to see how it looked now:

So it all seems fine now.

As far as I know there is no way to speed up the grep significantly further unless I get a lot of RAM and do the grep on a ramdisk, or get a much faster SSD. Just out of curiosity I decided to try out changing the fmt command to make lines of 1500 characters each to see how that affects the grep:

That didn’t actually do anything to speed up the grep further:

Recovering from a SAXParseException error with no data loss

I was working on a book I’m writing in LibreOffice Writer. I am using the docx format for the book, as I plan to finish its formatting in Microsoft Word, not knowing that LibreOffice Writer has a tendency to corrupt docx files. After closing and opening the file again, I received the following error:

I extracted the docx file (which is just a zip file, on Windows you can rename it to something.zip to extract or, while Ubuntu Linux allows you to extract it without renaming it). Found the document.xml file and opened it in VIM. I used the following command to jump to position 791513 on line 2:

791513l

That is the position number followed by a lowercase L.

I don’t see any error there, so LibreOffice Writer is not telling the truth, the error is not there. I opened document.xml in Chromium, but it reported the same wrong error position. Since the error message I received was saying the “w:cstheme” attribute was redefined, I decided to use regular expressions to search for it. I spent a stressful hour trying to learn VIM’s ridiculous regular expression syntax, but couldn’t figure it out.

In the end, I decided to use egrep instead. I ran the following command on the command line, which looks for a “w:cstheme” attribute that is not separated by a forward slash from antoher “w:cstheme” attribute, meaning it will find tags that have duplicate “w:cstheme” attributes, which is the error that LibreOffice Writer is reporting:

egrep "w:cstheme[^/]*w:cstheme" document.xml

And voila! It highlighted the error:

I copied the highlighted text (using ctrl+shift+c), opened document.xml again in VIM, and pasted the text in VIM’s seach bar (first press forward slash to open the search bar, then ctrl+shift+v to paste):

Pressing enter twice, it jumped right to the line and position (“column”) where the error was, which was actually position 817157:

If you understand html/xml, you will see the issue. To correct it, change this:

<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cstheme="majorBidi"/>

To this, removing one of the unnecessary ‘w:cstheme=”majorbidi”‘ attributes:
<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" />

I searched again for the error in VIM, to make sure there were no repeated errors. I fixed multiple other occurrences of the error until I couldn’t find any more. Now, when opening the document.xml in Chromium, no error was reported:

This was a good sign. I made the mistake of compressing the parent folder of the document, renaming it to docx, and trying to open it. LibreOffice Writer said the document was corrupted and offered to fix it. It tried but failed. After a long time, I realized my error. I shouldn’t have compressed the parent folder, I should have compressed the files and folders inside the parent folder directly, as follows:

Above I have selected the files and folders that make up the docx document. I then right-clicked it and chose “Compress”, and chose the “zip” option. Below is the compressed file:

Next, I renamed the file to “occupy.docx_FILES.docx”:

Then I opened the file in LibreOffice Writer, and it worked!

To prevent this in the future, I will save the file in the ODF Text Document format (.odt), which is the native format used by LibreOffice that supposedly doesn’t suffer from this issue. Once the book is done, I will then save it as docx for use in Microsoft Word.

Setting the default search path for Catfish File Search

The Problem

It was surprisingly difficult to find out how to set a default search path for Catfish. On the command line, you can simply do this to always make Catfish search from the root directory:

alias catfish='/usr/bin/catfish --path=\/'

But adding this line to your ~/.bashrc file will not affect the default search path for Catfish unless you launch it from the command line. What I wanted was to be able to click on the Catfish icon on the Unity launcher and have it launch in my root directory (so that it would search all my hard drives). Typically for a Linux program, Catfish’s settings do not offer a way to set this.

The Solution

The solution is to open up the file /usr/share/applications/catfish.desktop as root, for example by typing this on the command line:

gksudo gedit /usr/share/applications/catfish.desktop

Once the file opens up, change the Exec line near the bottom as follows, adding a --path=/some_path/ to the end of the line. Below, I’ve only put a forward slash as a path, meaning I want catfish to search everything, including all mounted hard drives.

Now, when I click the Catfish icon on the Unity launcher, the default search path is “File System”, which is how the program refers to the root directory.

How to search a document on Linux while ignoring diacritics (harakat/accents)

The Problem

Most applications are not smart enough to ignore accents when searching through the text of a document. Here is a screenshot of LibreOffice 5.2 failing at finding the word Arabic word “bsm” because I didn’t type in every single diacritic:

This is an especially serious problem when searching through Arabic text because the usage of diacritics is totally inconsistent as they are not strictly necessary. Different levels of diacritics are added according to the level of user-friendliness that is desired by the document creator.

Firefox is equally miserable at searching Arabic text:

The Solution

The solution is to open the document in a WebKit-based web browser, which has sensible handling of diacritics. Below is a screenshot of the open source Midori browser succeeding at finding and highlighting the Arabic word I was searching for even though I didn’t type in the diacritics:

Other WebKit browsers include Chromium and Chrome, both by Google. I would rather use a non-Google browser personally, so Midori is my preferred option.

If your document is not in the HTML format (the format that browsers use), you can use LibreOffice or Microsoft Word (etc.) to save it as HTML.

Fixing washed out colors in Ubuntu 16.10

How I improved the colors and brightness/contrast and the appearance of fonts on my Ubuntu 16.10 PC monitor

After moving to Ubuntu from Windows, one thing that has been constantly annoying me was the washed out/stark colors on my monitor and ugly-looking fonts in Firefox. I have spent hours fiddling with my monitor’s settings, color profiles on the Color tool, and various random font-related hacks using the terminal to no benefit.

Today I finally found the solution. It was to connect my monitor to my PC using a DVI cable instead of HDMI. Apparently there is an issue with communication between the Linux Kernel and most (if not all) monitors when they use HDMI and DisplayPort, at least this is what I understood from this discussion on Kernel.org.

One other possible solution is using this command (replace HDMI1 with the name of your display as given by the command /usr/bin/xrandr -q --prop | grep ' connected'):

xrandr --output HDMI1 --set "Broadcast RGB" "Full"

When trying the above, I kept getting this error:

BadName (named color or font does not exist)

Not wanting to spend the rest of my day troubleshooting this error, I took the brute-force approach of using a DVI cable that I fortunately had lying around in a closet.

Fixing the kworker CPU usage / ACPI errors issue on a Skylake motherboard (ASRock Z170 Pro4)

In which ASRock bricks my motherboard and a random $10 Chinese device comes to the rescue, with the help of a German gentleman

Since I no longer trust the spyware that is Windows 10, I have wanted to move my main PC (6700K CPU, R9 290 graphic card, ASRock Z170 Pro4 motherboard) to Linux for months now and finally did it yesterday. Everything worked as expected until, while working inside Ubuntu, I started getting messages that the computer was low on disk space even though I had allocated 25 gigabytes to the root partition.

Using ncdu in the terminal, I found that the log folder was taking up all the space, and found that /var/log/kern.log and /var/log/syslog were being written to at what seemed to be a rate of 1 MB/second, with endless repetitions of:

ACPI Error: Method parse/execution failed [\_GPE._L6F] (Node ...), AE_NOT_FOUND (...)

Another issue was that the kworker process was using constantly 100% of one of the eight CPU cores.

Forums suggested this was a motherboard firmware issue. So I decided to do a firmware update. My firmware was a pretty early one, something like version 1.5, while the latest available firmware is 7.3. I went to the UEFI interface and tried using the “Internet Flash” utility provided by ASRock. It successfully retrieved the fact that there was a 7.5 version update available to the firmware, but when clicking on update, it would conveniently fail to connect to the internet. Somehow the geniuses at ASRock had created software that could connect to the internet to ask if an update was available, but on downloading the update it would fail to connect to the internet. Still, I am glad that we are light years ahead of the pain, anguish and days of wasted labor that we used to suffer in the 90’s to fix a simple hardware issue.

I downloaded the BIOS binary file from the ASRock website, put it on a USB flash drive, and went to UEFI interface again, this time using the “Instant Flash” utility. The first time I tried it, the computer instantly crashed and rebooted, and nothing else happened. I tried a second time. This time it seemed to work, until the firmware update got stuck at 10%. I waited for hours to see if it would finish, but it didn’t. I left my computer on overnight, thinking that there might still be a tiny chance it would eventually finish. In the morning it hadn’t. So I hard rebooted my PC, and then nothing. It would turn on, but it wouldn’t give any output, not even the ASRock logo that shows at the beginning.

Knowing that the BIOS chip had probably become corrupted from the update and that I had probably upgraded my motherboard from an ASRock to an ASBrick, I looked to see what could be done. After yanking on the BIOS chip on the motherboard for a while, I found that it was designed to come off, so I took it out. I then learned about devices that can reflash a corrupted BIOS chip. I found out about the the Chinese device CH341A  that sells for about $10. I ordered one made by a company called SMAKN on Amazon with overnight delivery. This morning it arrived.

At first I was dismayed to see that there were three unattached pieces, I thought they might need soldiering:

But after watching this video by UltraNSC, I found that I wouldn’t be needing those pieces. I installed the software provided in the description of the video on an old but working Windows 7 laptop that I have, inserted the device, tried installing all the drivers in the file, and still the software (CH341A.exe) wouldn’t detect the device. I unplugged the device and moved to another USB port, and this time the software detected it.

The software detected that the BIOS chip had a size of 16 megabytes, similar to the binary file provided by ASRock. This was a good sign. I erased the BIOS chip with the software, then tried to open the binary file with the software but it wouldn’t detect it because the file provided by ASRock doesn’t have a filetype extension. I renamed the file to have a .bin extension, and now the software could see it. I loaded the file and clicked “Program” to write it to the chip. Everything worked without a problem. I clicked “Verify” to make absolutely sure the data was copied without error and that came out positive.

I put the BIOS chip back into the motherboard and turned the computer on. A message by American Megatrends came up, and clicking F12, it took me to the UEFI interface. I rebooted and was immediately taken into Windows as the UEFI had forgotten my preferred boot device order. Windows tried to do some sort of automatic repair then restarted the computer, at which point I went into the UEFI interface and told it to use my main SSD as the boot device. Restarting, I was taken into grub, and from there went into Ubuntu. Logging in, I saw that kworker wasn’t acting up anymore, and that the logs weren’t being flooded.

Now it is time to install Windows 7 in a networking-disabled virtual machine inside Ubuntu so that I can continue using OneNote and Photoshop without sending all my data to Microsoft. I have also kept my Windows 10 installation on another partition just in case I ever need it, for example to play Battlefield 1, though it seems I’ve become enough of an adult that video games barely interest me anymore, though I still enjoy watching Stodeh on Twitch.