How to ignore accents and other diacritics in WordPress/MySQL search (Arabic, French, etc.)

On my new Asmaa.org website, which is an Arabic-language baby name resource, I use a simple loop to show the posts in alphabetical order. Each post title is a baby name:

$args = array( 'paged' => $paged, 'orderby'=> 'title', 'order' => 'ASC',  'cat' => $cat_id);
query_posts($args); ?>
while ( have_posts() ) : the_post()

Since the Arabic alphabet is an abjad, most vowels are added to a word as diacritical marks. This has the unfortunate consequence of causing علم and عَلَم, two words that should be shown very close next to each other, to be shown miles apart in an alphabetical sort.

I solved the issue with this WordPress filter:

add_filter('posts_orderby', 'cleanse_diacritics');

function cleanse_diacritics($d) { //$d is this string: 'wp_posts.post_title ASC' (or sth similar) in a default WordPress install
                          //assuming you are sorting alphabetically ascending
    if(strpos($d,'title') !== false) { //if the string 'title' is in the orderby query, we know that
                                       //we are dealing with an alphabetical sort.
                                       //no need to mess with other queries like order by post_date

// below we replace the default order query WordPress passes to MySQL by
// using a whole bunch of replaces to remove diacritics from the sorting
        $d = 'REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(HEX(REPLACE(
wp_posts.post_title, "-", "")), "D98E", ""), "D98B", ""), "D98F", ""), "D98C", 
""),"D991",""),"D992",""),"D990",""),"D98D","") ASC';
    }
    return $d;
}

I got the nested MySQL replace() functions from this StackOverflow answer.

Explanation: When you run a query_posts(array('orderby' => 'title') function or something similar, the posts_orderby filter can be used to modify the order by part of the MySQL query. We wrap the name of the relevant MySQL column in replace() functions to remove all diacritics using their hex UTF-8 code units, which results in a diacritic-insensitive sort.

If you are dealing with a language other than Arabic, you may need to replace a code with another code (é [C3A9] to e [65] for example) instead of replacing with an empty string.

Considerations

The filter posts_orderby does not seem to work with get_posts(). There is a workaround however; see: Using query_posts() as if it is get_posts().

My work is made possible by your kind donations. Donate securely via Stripe (no registration required):
Commenting rules: Politeness is the only rule. We respect your right to disagree with anything we say. But comments with profanity and insults will be deleted.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments