Using wikipedia links to construct word segmentation corpora

David Gabay
Ziv Ben-eliahu
Michael Elhadad

Publication date

January 2008

Abstract

Tagged corpora are essential for evaluating and training nat-ural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a sim-ple method to automatically create a partially tagged cor-pus, using Wikipedia hyperlinks. The resulting corpus con-tains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available a

Extracted data

We use cookies to provide a better user experience.

Data Protection

Using wikipedia links to construct word segmentation corpora

Abstract

Extracted data

Using wikipedia links to construct word segmentation corpora

Abstract

Extracted data

Related items

Related items