Academic
Publications
Can chinese web pages be classified with english data source?
Can chinese web pages be classified with english data source?   (Citations: 11)
BibTex | RIS | RefWorks Download
As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chi-nese labeled data. However, there are relatively su cient English labeled Web pages. These labeled data, though in di erent linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this pa-per, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to En-glish. Then, all the Web pages, including Chinese and En-glish ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is in-clined to be encoded to the same code (i. e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several exist-ing supervised and semi-supervised classifiers.
Conference: World Wide Web Conference Series - WWW , pp. 969-978, 2008
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
Order by: