We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s law. In our research, Chinese word segmentation is chosen as the study case and a series of experiments are specially conducted for it, in which two types of segmentation techniques, statistical learning and rule-based methods, are examined. The empirical results show that a linear performance improvement in statistical learning requires an exponential increasing of training corpus size at least. As for the rule-based method, an approximate negative inverse relationship between the performance and the size of the input lexicon can be observed. 1
We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Copyright © 2014 Longyue Wang et al.This is an open access article distributed under the Creative Co...
Studies of computational models of language acquisition depend to a large part on the input availabl...
Traditionally, it has been assumed that rules are necessary to explain language acquisition. Recentl...
Almost all Chinese language processing tasks involve word segmentation of the language input as thei...
The ability to discover groupings in continuous stimuli on the basis of distributional information i...
We conducted a preliminary study to examine whether Chinese readers' spontaneous word segmentati...
We report experiments on automatic learning of an English-Chinese translation lexicon, through stati...
We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the...
Over the past decade, rapid technological evolution has revolutionised the study of language; we hav...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
Chinese texts are renowned for the lack of physical spaces between words in a sentence. Reading thes...
Chinese word segmentation (CWS) is a necessary step in Chinese-English statisti-cal machine translat...
This paper is a comparative study on representing units in Chinese text categorization. Several kind...
We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Copyright © 2014 Longyue Wang et al.This is an open access article distributed under the Creative Co...
Studies of computational models of language acquisition depend to a large part on the input availabl...
Traditionally, it has been assumed that rules are necessary to explain language acquisition. Recentl...
Almost all Chinese language processing tasks involve word segmentation of the language input as thei...
The ability to discover groupings in continuous stimuli on the basis of distributional information i...
We conducted a preliminary study to examine whether Chinese readers' spontaneous word segmentati...
We report experiments on automatic learning of an English-Chinese translation lexicon, through stati...
We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the...
Over the past decade, rapid technological evolution has revolutionised the study of language; we hav...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
Chinese texts are renowned for the lack of physical spaces between words in a sentence. Reading thes...
Chinese word segmentation (CWS) is a necessary step in Chinese-English statisti-cal machine translat...
This paper is a comparative study on representing units in Chinese text categorization. Several kind...
We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Copyright © 2014 Longyue Wang et al.This is an open access article distributed under the Creative Co...