Cross-lingual query classification: a preliminary study

The non-English Web is growing at breakneck speed, but available language processing tools are mostly English based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English Web, taxonomies for other languages are either not available or of very limited quality. Given that building taxonomies in all non-English languages is prohibitively expensive, it is natural to ask whether existing English taxonomies can be leveraged, possibly via machine translation, to enable information processing tasks in other languages. Prelimi- nary results presented in this paper indicate that the an- swer is armative with respect to query classication, a task which is essential both for understanding the user intent and thus providing better search results, and for better target- ing of search-based advertising, the economic underpinning of commercial Web search engines. We propose a robust method for classifying non-English queries against an En- glish taxonomy using widely available, o-the-shelf machine translation systems. In particular, we show that by viewing the search results in the query's original language as inde- pendent sources of information, we can alleviate the impact of poor quality or erroneous machine translations. Empirical results for Chinese queries show that we achieve remarkably encouraging results.
