Tagged corpora are essential for evaluating and training nat-ural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a sim-ple method to automatically create a partially tagged cor-pus, using Wikipedia hyperlinks. The resulting corpus con-tains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available a
The unavailability of very large corpora with semantically disambiguated words is a major limitation...
The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We descr...
A basic task in first language acquisition likely involves discovering the boundaries between words ...
We present in this work a method of creating high-quality corpora from collections of user generated...
The hyperlink structure of Wikipedia constitutes a key resource for many Natural Language Processing...
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of resear...
Dione CMB, Kuhn J, Zarrieß S. Design and Development of Part-of-Speech-Tagging Resources for Wolof (...
We present a novel paradigm for obtaining large amounts of training data for computational linguisti...
We propose the framework of a Machine Translation (MT) bootstrapping method by using multilingual Wi...
We describe an innovative computer interface designed to assist annotators in the efficient selectio...
A major architectural decision in designing a disambiguation model for segmentation and Part-of-Spee...
In this paper we propose a new methodology to ex-ploit Wikipedia features and structure to automati-...
We present a constituency parsing system for Modern Hebrew. The system is based on the PCFG-LA parsi...
Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As wit...
International audienceExtracting hypernym relations from text is one of the key steps in the automat...
The unavailability of very large corpora with semantically disambiguated words is a major limitation...
The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We descr...
A basic task in first language acquisition likely involves discovering the boundaries between words ...
We present in this work a method of creating high-quality corpora from collections of user generated...
The hyperlink structure of Wikipedia constitutes a key resource for many Natural Language Processing...
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of resear...
Dione CMB, Kuhn J, Zarrieß S. Design and Development of Part-of-Speech-Tagging Resources for Wolof (...
We present a novel paradigm for obtaining large amounts of training data for computational linguisti...
We propose the framework of a Machine Translation (MT) bootstrapping method by using multilingual Wi...
We describe an innovative computer interface designed to assist annotators in the efficient selectio...
A major architectural decision in designing a disambiguation model for segmentation and Part-of-Spee...
In this paper we propose a new methodology to ex-ploit Wikipedia features and structure to automati-...
We present a constituency parsing system for Modern Hebrew. The system is based on the PCFG-LA parsi...
Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As wit...
International audienceExtracting hypernym relations from text is one of the key steps in the automat...
The unavailability of very large corpora with semantically disambiguated words is a major limitation...
The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We descr...
A basic task in first language acquisition likely involves discovering the boundaries between words ...