International audienceIn this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and Transeg, a sentence-based system. Two main characteristics of corpora have been investigated: Data quality (clean vs raw corpora), corpora manipulation (natural vs artificial data sets). The corpus size has also been subject to variation, and experiments related in this paper have shown that corpora characteristics highly impact recall and precision values for both algorithms
Topic segmentation classically relies on one of two criteria, either finding areas with co-herent vo...
The recent explosion of available audio-visual media is the new challenge for information retrieval ...
Choi, Wiemer-Hastings, and Moore (2001) proposed to use Latent Semantic Analysis (LSA) to extract se...
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific res...
The impact of corpus quality and type on topic based text segmentation evaluatio
In this paper, the work done includes the extraction of information from image datasets which contai...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Topic segmentation is essential for a lot of Natural Language Processing (NLP) applications, such as...
We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s...
Neural sentence encoders (NSE) are effective in many NLP tasks, including topic segmentation. Howeve...
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. ...
We consider here the task of linear thematic segmentation of text documents, by using features based...
Most documents are about more than one subject, but the majority of natural language processing algo...
The aim of the research presented here is to report on a corpus-based method for discourse analysis ...
International audienceThe LDA topic model describes a corpus on the basis of its vocabulary. Our exp...
Topic segmentation classically relies on one of two criteria, either finding areas with co-herent vo...
The recent explosion of available audio-visual media is the new challenge for information retrieval ...
Choi, Wiemer-Hastings, and Moore (2001) proposed to use Latent Semantic Analysis (LSA) to extract se...
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific res...
The impact of corpus quality and type on topic based text segmentation evaluatio
In this paper, the work done includes the extraction of information from image datasets which contai...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Topic segmentation is essential for a lot of Natural Language Processing (NLP) applications, such as...
We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s...
Neural sentence encoders (NSE) are effective in many NLP tasks, including topic segmentation. Howeve...
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. ...
We consider here the task of linear thematic segmentation of text documents, by using features based...
Most documents are about more than one subject, but the majority of natural language processing algo...
The aim of the research presented here is to report on a corpus-based method for discourse analysis ...
International audienceThe LDA topic model describes a corpus on the basis of its vocabulary. Our exp...
Topic segmentation classically relies on one of two criteria, either finding areas with co-herent vo...
The recent explosion of available audio-visual media is the new challenge for information retrieval ...
Choi, Wiemer-Hastings, and Moore (2001) proposed to use Latent Semantic Analysis (LSA) to extract se...