Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus contains roughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenge...
Abstract. This paper describes our system designed for the NLPCC 2015 shared task on Chinese word se...
corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firs...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...
Tagging as the most crucial annotation of language resources can still be challenging when the corpu...
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and pa...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part...
From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical rela...
Chinese word segmentation and part-of-speech tagging (S&T) are fundamental steps for more advanc...
At present most of corpora are annotated mainly with syntactic knowledge. In this paper, we attemp...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, par...
Web provides a large-scale corpus for researchers to study the language usages in real world. Develo...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, par...
In recent years more and more NLP packages become available to the pub-lic, and many of them are imp...
Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natu...
This document describes the Part-of-Speech (POS) tagging guidelines for the Penn Chinese Treebank Pr...
Abstract. This paper describes our system designed for the NLPCC 2015 shared task on Chinese word se...
corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firs...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...
Tagging as the most crucial annotation of language resources can still be challenging when the corpu...
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and pa...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part...
From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical rela...
Chinese word segmentation and part-of-speech tagging (S&T) are fundamental steps for more advanc...
At present most of corpora are annotated mainly with syntactic knowledge. In this paper, we attemp...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, par...
Web provides a large-scale corpus for researchers to study the language usages in real world. Develo...
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, par...
In recent years more and more NLP packages become available to the pub-lic, and many of them are imp...
Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natu...
This document describes the Part-of-Speech (POS) tagging guidelines for the Penn Chinese Treebank Pr...
Abstract. This paper describes our system designed for the NLPCC 2015 shared task on Chinese word se...
corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firs...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...