How to ignore accents and other diacritics in WordPress/MySQL search (Arabic, French, etc.)

On my new Asmaa.org website, which is an Arabic-language baby name resource, I use a simple loop to show the posts in alphabetical order. Each post title is a baby name:

$args = array( 'paged' => $paged, 'orderby'=> 'title', 'order' => 'ASC',  'cat' => $cat_id);
query_posts($args); ?>
while ( have_posts() ) : the_post()

Since the Arabic alphabet is an abjad, most vowels are added to a word as diacritical marks. This has the unfortunate consequence of causing علم and عَلَم, two words that should be shown very close next to each other, to be shown miles apart in an alphabetical sort.

I solved the issue with this WordPress filter:

add_filter('posts_orderby', 'cleanse_diacritics');

function cleanse_diacritics($d) { //$d is this string: 'wp_posts.post_title ASC' (or sth similar) in a default WordPress install
                          //assuming you are sorting alphabetically ascending
    if(strpos($d,'title') !== false) { //if the string 'title' is in the orderby query, we know that
                                       //we are dealing with an alphabetical sort.
                                       //no need to mess with other queries like order by post_date

// below we replace the default order query WordPress passes to MySQL by
// using a whole bunch of replaces to remove diacritics from the sorting
        $d = 'REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(HEX(REPLACE(
wp_posts.post_title, "-", "")), "D98E", ""), "D98B", ""), "D98F", ""), "D98C", 
""),"D991",""),"D992",""),"D990",""),"D98D","") ASC';
    }
    return $d;
}

I got the nested MySQL replace() functions from this StackOverflow answer. I am not sure why I used that function name; it had something to do with the fact that I was logging the $d variable when I first created it, just to see what it was.</>

Explanation: When you run a query_posts(array('orderby' => 'title') function or something similar, the posts_orderby filter can be used to modify the order by part of the MySQL query. We wrap the name of the relevant MySQL column in replace() functions to remove all diacritics using their hex UTF-8 code units, which results in a diacritic-insensitive sort.

If you are dealing with a language other than Arabic, you may need to replace a code with another code (é [C3A9] to e [65] for example) instead of replacing with an empty string.

Considerations

The filter posts_orderby does not seem to work with get_posts(). There is a workaround however; see: Using query_posts() as if it is get_posts().

Leave a Reply

Your email address will not be published.