Google uses statistical learning techniques, as opposed to a rule-based approach. From their FAQs:
Most state-of-the-art, commercial machine-translation systems in use today have been developed using a rule-based approach, and require a lot of work to define vocabularies and grammars.
Our system takes a different approach: we feed the computer billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model. We’ve achieved very good results in research evaluations.
It’s a very hard problem to solve and the quality can be so-so at times. However, I’m going to unveil the most ridiculous bug I’ve ever encountered using this system. Do you notice anything strange in the following translation from German to English first, and from German to French second?
GERMAN: Output: 4 – 600 Ohm Made in Austria!! Funktionstüchtig! Die Kopfhörer haben einen Spitzen Sound der unverfälscht wieder gegeben wird!! Die Qualität der Kopfhörer ist einfach Spitze.
ENGLISH: Output: 4 – 600 ohms Made in USA! Funktionstüchtig! The headphones have a peak sound of the genuine will be given again! The quality of the headphones is simple tip.
FRENCH: Output: 4 – 600 Ohm Made in France! Fonctionne! Les casques ont un peu de son authentique sera à nouveau! La qualité des écouteurs est facile de pointe.
Clearly you should see an issue here. In case you don’t, I’ll be more explicit:
“Google Translate” sometimes changes the country mentioned within the source language to the main country of the translation language. That’s a pretty big bug they have right there. Certain terms should be translated verbatim using dictionary mapping, especially something as simple and hardcoded as countries.
Thanks to my friend Ludo who noticed this bug.
Google Suggest’s racial oddity
While we are on the topic of Google bugs and anomalies, I’ll add a small oddity to the mix. I must prefix this part of my post by clarifying that I respect all ethnicities and colors and have good friends from all over the place. I am against racism, but not against discussions about racism. I won’t publish anyone’s racist or offensive comments, be warned. What this post does is merely point out Google Suggest’s selective behavior, which of course gets picked up by Firefox’s Google search box in the top corner, too.
Google suggestions are based on the number of queries received and the number of results for any given query. This means that entering words in Google Suggest will reveal the most likely queries starting with that given term(s). In Google’s own words:
Our algorithms use a wide range of information to predict the queries users are most likely to want to see.
For example, if I write “money is”, Google will suggest: “money is the root of all evil”, “money is debt”, “money is power”, money is everything” and so on.
I’m an Italian programmer, so I tried “programmers are” and got the hilarious suggestion that “programmers are lazy”. 🙂 Alright, what about “Italians are”? Here are the results:
Some people are racist, that’s nothing new. These are stereotypes, for and against Italians, and this shouldn’t surprise anyone. And you can’t really blame Google either for what people have been typing in the most. Google is suggesting, automatically, based on the most popular queries. Okay, that’s for Italians. What about other nationalities? The most common stereotypes are all well represented. Americans, French, German, Spanish, Chinese, Indians, etc… What about “whites” in general?
Sad, I know. The picture doesn’t change too much if you are looking for “Christians are”, “Muslims are”, “Jews are”, “gays are”, “Cops are”, “men are”, “women are” and so on.
Google won’t suggest anything if the queries are not popular enough. This means that “Caucasians are” is not going to yield any suggestions, but “Caucasians” (alone) will. Google could do a couple of things: either blacklist the few dozen racial terms which are popular enough to show up in the suggestions, or simply decide that by policy, suggestions are automated and therefore, if you are looking for stupid queries based on race, you shouldn’t get offended by the suggestions that you receive back.
A few years ago there used to be very reprehensible suggestions against black people, as one would expect given the results for the other ethnicities and nationalities, and the racism that unfortunately still exists today. A while ago though, Google did something rather odd. They removed “blacks” from the list, possibly after receiving complaints, and left everyone else in the suggestion engine. If you search for “blacks are” you won’t find any suggestions. And I’m pretty sure it’s still as popular as it ever was, and just as queries containing “whites are”, “Greeks are” or “Christians are” are. On top of that, even if you just search for “blacks” the engine will not suggest anything. To further convince you, even if you were to search for an unusual term like “purples” you’d still get two suggestions: “purples 80s” and “purples wxsand”. If this exclusion was the right thing to do, then Google should do the same for other groups as well. If it wasn’t, then why favor only one group?
I don’t know if we should consider this a form of “selective racism”, but it’s odd and I thought I’d point it out even if the subject is very delicate and risky. If you think about it, it’s not even a racial problem, it’s more of a question of how to make software engineering decisions that properly and equally handle potentially offensive outcomes for some of your users.