2/02/2010

A recent improvement for Arabic searches


This post is the latest in an ongoing series about how we harness the data we collect to improve our products and services for our users. - Ed.

We've learned that when performing a search on Google, people sometimes forget to separate words with spaces. Moreover, people often mistakenly repeat a letter within a single word. For instance, when writing the query [amazingly beautiful poem], you might write it as [amazingly beautiifullpoem].

These types of errors are much more common in languages like Arabic, where most of the letters are cursive. That means that the shapes of the letters change, based on the position of the letter in the word (initial, middle, final or isolated). Moreover, some Arabic letters are considered word breaks, meaning that the following letter must be in an "initial" shape. In other words, if the last letter of one word is a word break, the following word may not be separated with a space.

For example, the queries [وزارةالتعليم] and [وزارة التعليم] have an identical meaning (Ministry of Education) and they're both written in a common form for Arabic documents. But they have different, albeit correct, formats — the first query is written as a single word, while the second is written as two. Google needs to understand that while they're written differently, they mean the same thing and should yield the exact same search results. In this example, both queries were written correctly, just in different formats. But sometimes people just make errors — like repeating the same letter twice. For example, you might write [راائعة الجماال], repeating the letter "ا" twice in both query words. In this case the correct spelling should be [رائعة الجمال]. It's important that Google search recognizes your query — despite spelling errors.

To address issues like this, we recently developed a search ranking improvement that targets certain Arabic queries. Our algorithm employs rules of Arabic spelling and grammar along with signals from historical search data to decide when to leave out spaces between words or when to remove unnecessarily repeated letters. Now, when you type a query leaving out spaces or repeating a letter, we'll return better results based not only on what you typed, but also on what our algorithm understands is the "correct" query. For example, here's what happens when you type [قصيدة راائعةالجماال] ([amazingly beautiful poem] in Arabic) with repeated letters and dropped spaces between words.


As you can see, the Google results contain the corrected query, the terms قصيدة رائعة الجمال, in bold.

For most people, this might seem like a small enhancement. But for us, it’s a big change. Our tests show we've improved search for 10% of Arabic language queries. Which, when you think about it, is a lot of people.






No comments:

Post a Comment