2 Islamic articles on: Arabic

How to search a document on Linux while ignoring diacritics (harakat/accents)

The Problem

Most applications are not smart enough to ignore accents when searching through the text of a document. Here is a screenshot of LibreOffice 5.2 failing at finding the word Arabic word “bsm” because I didn’t type in every single diacritic:

This is an especially serious problem when searching through Arabic text because the usage of diacritics is totally inconsistent as they are not strictly necessary. Different levels of diacritics are added according to the level of user-friendliness that is desired by the document creator.

Firefox is equally miserable at searching Arabic text:

The Solution

The solution is to open the document in a WebKit-based web browser, which has sensible handling of diacritics. Below is a screenshot of the open source Midori browser succeeding at finding and highlighting the Arabic word I was searching for even though I didn’t type in the diacritics:

Other WebKit browsers include Chromium and Chrome, both by Google. I would rather use a non-Google browser personally, so Midori is my preferred option.

If your document is not in the HTML format (the format that browsers use), you can use LibreOffice or Microsoft Word (etc.) to save it as HTML.

How to ignore accents and other diacritics in WordPress/MySQL search (Arabic, French, etc.)

On my new Asmaa.org website, which is an Arabic-language baby name resource, I use a simple loop to show the posts in alphabetical order. Each post title is a baby name:

$args = array( 'paged' => $paged, 'orderby'=> 'title', 'order' => 'ASC',  'cat' => $cat_id);
query_posts($args); ?>
while ( have_posts() ) : the_post()

Since the Arabic alphabet is an abjad, most vowels are added to a word as diacritical marks. This has the unfortunate consequence of causing علم and عَلَم, two words that should be shown very close next to each other, to be shown miles apart in an alphabetical sort.

I solved the issue with this WordPress filter:

add_filter('posts_orderby', 'cleanse_diacritics');

function cleanse_diacritics($d) { //$d is this string: 'wp_posts.post_title ASC' (or sth similar) in a default WordPress install
                          //assuming you are sorting alphabetically ascending
    if(strpos($d,'title') !== false) { //if the string 'title' is in the orderby query, we know that
                                       //we are dealing with an alphabetical sort.
                                       //no need to mess with other queries like order by post_date

// below we replace the default order query WordPress passes to MySQL by
// using a whole bunch of replaces to remove diacritics from the sorting
        $d = 'REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(HEX(REPLACE(
wp_posts.post_title, "-", "")), "D98E", ""), "D98B", ""), "D98F", ""), "D98C", 
""),"D991",""),"D992",""),"D990",""),"D98D","") ASC';
    }
    return $d;
}

I got the nested MySQL replace() functions from this StackOverflow answer.

Explanation: When you run a query_posts(array('orderby' => 'title') function or something similar, the posts_orderby filter can be used to modify the order by part of the MySQL query. We wrap the name of the relevant MySQL column in replace() functions to remove all diacritics using their hex UTF-8 code units, which results in a diacritic-insensitive sort.

If you are dealing with a language other than Arabic, you may need to replace a code with another code (é [C3A9] to e [65] for example) instead of replacing with an empty string.

Considerations

The filter posts_orderby does not seem to work with get_posts(). There is a workaround however; see: Using query_posts() as if it is get_posts().