Web provides a large-scale corpus for researchers to study the language usages in real world. Developing a web-scale corpus needs not only a lot of computation resources, but also great efforts to handle the large variations in the web texts, such as character encoding in processing Chinese web texts. In this paper, we aim to develop a web-scale Chinese word N-gram corpus with parts of speech information called NTU PN-Gram corpus using the ClueWeb09 dataset. We focus on the character encoding and some Chinese-specific issues. The statistics about the dataset is reported. We will make the resulting corpus a public available resource to boost the Chinese language processing
Abstract In this paper, we propose an unsupervised seg-mentation approach, named "n-gram mutual...
In Taiwan, most people speak Mandarin, Southern Min, or Hakka. Not only are the three Chinese dialec...
It has been shown through a number of experiments that neural networks can be used for a phonetic ty...
This paper reveals some important properties of CFSs and applications in Chinese natural language pr...
Chinese script is non-alphabetic and a Chinese graph is basically syllabic which may consist of phon...
In the Chinese language, words consist of characters each of which is composed of one or more compon...
We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observati...
Tagging as the most crucial annotation of language resources can still be challenging when the corpu...
Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
Corpora are excellent resources for learning, particularly considering research showing the importan...
This is a project note on the first stage of the con-struction of a comprehensive corpus of both Mod...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natu...
This thesis proposes an approach to generating n-gram features for Conditional Random Fields (CRFs) ...
Abstract In this paper, we propose an unsupervised seg-mentation approach, named "n-gram mutual...
In Taiwan, most people speak Mandarin, Southern Min, or Hakka. Not only are the three Chinese dialec...
It has been shown through a number of experiments that neural networks can be used for a phonetic ty...
This paper reveals some important properties of CFSs and applications in Chinese natural language pr...
Chinese script is non-alphabetic and a Chinese graph is basically syllabic which may consist of phon...
In the Chinese language, words consist of characters each of which is composed of one or more compon...
We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observati...
Tagging as the most crucial annotation of language resources can still be challenging when the corpu...
Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
Corpora are excellent resources for learning, particularly considering research showing the importan...
This is a project note on the first stage of the con-struction of a comprehensive corpus of both Mod...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natu...
This thesis proposes an approach to generating n-gram features for Conditional Random Fields (CRFs) ...
Abstract In this paper, we propose an unsupervised seg-mentation approach, named "n-gram mutual...
In Taiwan, most people speak Mandarin, Southern Min, or Hakka. Not only are the three Chinese dialec...
It has been shown through a number of experiments that neural networks can be used for a phonetic ty...