Parallel corpora are a crucial resource in research fields such as cross-lingual infor-mation retrieval and statistical machine translation, but only a few parallel corpora with high quality are publicly available nowadays. In this paper, we try to solve the problem by developing a system that can automatically mine high quality parallel corpora from the World Wide Web. The system contains a three-step process. The system uses a web spider to crawl certain hosts at first. Then candidate parallel web page pairs are prepared from the downloaded page set. At last, each candi-date pair is examined based on multiple standards. We develop novel strategies for the implementation of the system, which are then proved to be rather effective by the ex...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
Parallel corpus are valuable resource for machine translation, multi-lingual text retrieval, languag...
Parallel corpora have become an essential resource for work in multilingual natural language process...
In this thesis, we propose a content-based method of mining bilingual parallel documents from websit...
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 2007conference pape
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora are indispensable resources for a variety of multilingual natural language processi...
Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub....
Parallel corpora are a valuable resource for machine translation, but at present their availability ...
Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their tran...
Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their tran...
Discovering parallel corpora on the web is a challenging task. In this paper, we use cross-language ...
Multilingual resources are useful for linguistic studies, translation, and many other tasks. Unfortu...
This paper describes a system that automatically mines English-Chinese translation pairs from large ...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
Parallel corpus are valuable resource for machine translation, multi-lingual text retrieval, languag...
Parallel corpora have become an essential resource for work in multilingual natural language process...
In this thesis, we propose a content-based method of mining bilingual parallel documents from websit...
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 2007conference pape
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora are indispensable resources for a variety of multilingual natural language processi...
Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub....
Parallel corpora are a valuable resource for machine translation, but at present their availability ...
Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their tran...
Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their tran...
Discovering parallel corpora on the web is a challenging task. In this paper, we use cross-language ...
Multilingual resources are useful for linguistic studies, translation, and many other tasks. Unfortu...
This paper describes a system that automatically mines English-Chinese translation pairs from large ...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
Parallel corpus are valuable resource for machine translation, multi-lingual text retrieval, languag...