This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with larger feature spaces result in better performance than those of lower level or with smaller feature spaces, whereas when the training data is limited the conclusion may be the reverse. In general, representations of higher-level or with larger feature spaces need more training data to reach the best performance. But, as to a specific representation, the size of training data and the cat...
Abstract: Giving further consideration on linguistic feature, this study proposes an algorithm of Ch...
Text pre-processing is an important component of a Chinese text classification. At present, however,...
Text pre-processing is an important component of a Chinese text classification. At present, however,...
This paper is a comparative study on representing units in Chinese text categorization. Several kind...
Words and n-grams are commonly used Chinese text representing units and are proved to be good featur...
Words and n-grams are commonly used Chinese text representing units and are proved to be good featur...
Text categorization task always suffers from a high dimension problem, which leads the learning syst...
[[abstract]]The process of text categorization involves some understanding of the content of the doc...
[[abstract]]The process of text categorization involves some understanding of the content of the doc...
In modern Chinese, units of meaning are predominantly (more than 80%) bi-morphemic. However, in the...
We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s...
In the processing of Chinese documents and queries in information retrieval (IR), one has to identif...
[[abstract]]In this paper, we propose and evaluate approaches to categorizing Chinese texts, which c...
In this paper we propose a novel word representation for Chinese based on a state-of-the-art word em...
Automatic text classification (ATC) is the task of automatically assigning one or more appropriate c...
Abstract: Giving further consideration on linguistic feature, this study proposes an algorithm of Ch...
Text pre-processing is an important component of a Chinese text classification. At present, however,...
Text pre-processing is an important component of a Chinese text classification. At present, however,...
This paper is a comparative study on representing units in Chinese text categorization. Several kind...
Words and n-grams are commonly used Chinese text representing units and are proved to be good featur...
Words and n-grams are commonly used Chinese text representing units and are proved to be good featur...
Text categorization task always suffers from a high dimension problem, which leads the learning syst...
[[abstract]]The process of text categorization involves some understanding of the content of the doc...
[[abstract]]The process of text categorization involves some understanding of the content of the doc...
In modern Chinese, units of meaning are predominantly (more than 80%) bi-morphemic. However, in the...
We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s...
In the processing of Chinese documents and queries in information retrieval (IR), one has to identif...
[[abstract]]In this paper, we propose and evaluate approaches to categorizing Chinese texts, which c...
In this paper we propose a novel word representation for Chinese based on a state-of-the-art word em...
Automatic text classification (ATC) is the task of automatically assigning one or more appropriate c...
Abstract: Giving further consideration on linguistic feature, this study proposes an algorithm of Ch...
Text pre-processing is an important component of a Chinese text classification. At present, however,...
Text pre-processing is an important component of a Chinese text classification. At present, however,...