Uniform and Effective Tagging of a Heterogeneous Giga-word

Wei-yun Ma
Chu-ren Huang

Publication date

January 2006

Abstract

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus contains roughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenge...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Uniform and Effective Tagging of a Heterogeneous Giga-word

Abstract

Extracted data

Uniform and Effective Tagging of a Heterogeneous Giga-word

Abstract

Extracted data

Related items

Related items