We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with χ2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets
Abstract. Comparing frequency counts over texts or corpora is an im-portant task in many application...
Formulaic sequences in language use are often studied by means of the automatic identification of fr...
National audienceThe discovery of frequent patterns is a famous problemin data mining. While plenty ...
International audienceThe discovery of frequent patterns is a famous problem in data mining. While p...
This report describes a series of exploratory experiments to establish whether terms of different se...
Predefined categories can be assigned to the natural language text using for text classification. It...
Corpus-level term statistics are valuable for numerous text analysis activities, such as term weight...
Comparing frequency counts over texts or corpora is an important task in many applications and scien...
ABSTRACT This paper describes a method of comparing routine language use in different corpora, and p...
Most of the complexity of common data mining tasks is due to the unknown amount of information conta...
Abstract. Comparing frequency counts over texts or corpora is an im-portant task in many application...
A number of content management tasks, including term categorization, term clustering, and automated ...
International audienceWe present a system for mapping the structure of research topics in a corpus. ...
Natural language is a remarkable example of a complex dynamical system which combines variation and ...
International audienceIn this paper, we review statistical techniques for the direct evaluation of d...
Abstract. Comparing frequency counts over texts or corpora is an im-portant task in many application...
Formulaic sequences in language use are often studied by means of the automatic identification of fr...
National audienceThe discovery of frequent patterns is a famous problemin data mining. While plenty ...
International audienceThe discovery of frequent patterns is a famous problem in data mining. While p...
This report describes a series of exploratory experiments to establish whether terms of different se...
Predefined categories can be assigned to the natural language text using for text classification. It...
Corpus-level term statistics are valuable for numerous text analysis activities, such as term weight...
Comparing frequency counts over texts or corpora is an important task in many applications and scien...
ABSTRACT This paper describes a method of comparing routine language use in different corpora, and p...
Most of the complexity of common data mining tasks is due to the unknown amount of information conta...
Abstract. Comparing frequency counts over texts or corpora is an im-portant task in many application...
A number of content management tasks, including term categorization, term clustering, and automated ...
International audienceWe present a system for mapping the structure of research topics in a corpus. ...
Natural language is a remarkable example of a complex dynamical system which combines variation and ...
International audienceIn this paper, we review statistical techniques for the direct evaluation of d...
Abstract. Comparing frequency counts over texts or corpora is an im-portant task in many application...
Formulaic sequences in language use are often studied by means of the automatic identification of fr...
National audienceThe discovery of frequent patterns is a famous problemin data mining. While plenty ...