Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel cor...
Parallel corpora are indispensable resources for a variety of multilingual natural language processi...
Multilingual resources are useful for linguistic studies, translation, and many other tasks. Unfortu...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora are a valuable resource for machine translation, but at present their availability ...
In this thesis, we propose a content-based method of mining bilingual parallel documents from websit...
STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel t...
STRAND Resnik is a language independent system for automatic discovery of text in parallel transl...
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 2007conference pape
Parallel corpora are a crucial resource in research fields such as cross-lingual infor-mation retrie...
Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub....
This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in l...
Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
Parallel corpora are indispensable resources for a variety of multilingual natural language processi...
Multilingual resources are useful for linguistic studies, translation, and many other tasks. Unfortu...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora have become an essential resource for work in multilingual natural language process...
Parallel corpora are a valuable resource for machine translation, but at present their availability ...
In this thesis, we propose a content-based method of mining bilingual parallel documents from websit...
STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel t...
STRAND Resnik is a language independent system for automatic discovery of text in parallel transl...
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 2007conference pape
Parallel corpora are a crucial resource in research fields such as cross-lingual infor-mation retrie...
Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub....
This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in l...
Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
Parallel corpora are indispensable resources for a variety of multilingual natural language processi...
Multilingual resources are useful for linguistic studies, translation, and many other tasks. Unfortu...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...