Abstract In this paper, we propose an unsupervised seg-mentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach allevi-ates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for train-ing purposes, and manually maintaining ever expanding lex-icons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results
In this article, we assign Chinese n-gram sequences to different types by their statistical properti...
In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired b...
This paper presents a bilingual semi-supervised Chinese word segmentation (CWS) method that leverage...
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", ...
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We...
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
As the amount of online Chinese contents grows, there is a critical need for effective Chinese word ...
This thesis proposes an approach to generating n-gram features for Conditional Random Fields (CRFs) ...
In order to analyze security and terrorism related content in Chinese, it is important to perform wo...
This paper describes our participation\ud in the Chinese word segmentation task\ud of CIPS-SIGHAN 20...
A Chinese sentence is typically written as a sequence of characters. However, a word, a logical sema...
Chinese texts do not contain spaces as word separators like Eng-lish and many alphabetic languages. ...
In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired b...
It is often assumed that MinimumDescrip-tion Length (MDL) is a good criterion for unsupervised word ...
In this article, we assign Chinese n-gram sequences to different types by their statistical properti...
In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired b...
This paper presents a bilingual semi-supervised Chinese word segmentation (CWS) method that leverage...
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", ...
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We...
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We...
Textual information written in Chinese now represents a huge knowledge repository. The first step of...
As the amount of online Chinese contents grows, there is a critical need for effective Chinese word ...
This thesis proposes an approach to generating n-gram features for Conditional Random Fields (CRFs) ...
In order to analyze security and terrorism related content in Chinese, it is important to perform wo...
This paper describes our participation\ud in the Chinese word segmentation task\ud of CIPS-SIGHAN 20...
A Chinese sentence is typically written as a sequence of characters. However, a word, a logical sema...
Chinese texts do not contain spaces as word separators like Eng-lish and many alphabetic languages. ...
In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired b...
It is often assumed that MinimumDescrip-tion Length (MDL) is a good criterion for unsupervised word ...
In this article, we assign Chinese n-gram sequences to different types by their statistical properti...
In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired b...
This paper presents a bilingual semi-supervised Chinese word segmentation (CWS) method that leverage...