The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu-ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low-dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function- known as the Fisher kernel- is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter-esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.
Large volumes of text is being generated every minute which necessitates effective and robust tools ...
The notion of similarity between texts is fundamental for many applications of Natural Language Proc...
Abstract. Improving accuracy in Information Retrieval tasks via se-mantic information is a complex p...
The project pursued in this paper is to develop from rst information-geometric principles a general ...
In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text ca...
Document Clustering is an issue of measuring similarity between documents and grouping similar docum...
Many kinds of texts are now available in various types of databases, and it has been requested to de...
This thesis follows up text categorization. In the first part are described several chosen algorithm...
Web-mediated access to distributed informa-tion is a complex problem. Before any learn-ing can start...
This paper presents a novel framework for discriminatively training spoken document similarity model...
The volume of textual information that we encounter on a daily ba-sis continues to grow at an impres...
Document similarity search aims to find documents similar to a query document in a text corpus and r...
Measuring document similarity has shown its fundamental utilization in various text mining applicati...
The multi-label text categorization is supervised learning, where a document is associated with mult...
We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar te...
Large volumes of text is being generated every minute which necessitates effective and robust tools ...
The notion of similarity between texts is fundamental for many applications of Natural Language Proc...
Abstract. Improving accuracy in Information Retrieval tasks via se-mantic information is a complex p...
The project pursued in this paper is to develop from rst information-geometric principles a general ...
In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text ca...
Document Clustering is an issue of measuring similarity between documents and grouping similar docum...
Many kinds of texts are now available in various types of databases, and it has been requested to de...
This thesis follows up text categorization. In the first part are described several chosen algorithm...
Web-mediated access to distributed informa-tion is a complex problem. Before any learn-ing can start...
This paper presents a novel framework for discriminatively training spoken document similarity model...
The volume of textual information that we encounter on a daily ba-sis continues to grow at an impres...
Document similarity search aims to find documents similar to a query document in a text corpus and r...
Measuring document similarity has shown its fundamental utilization in various text mining applicati...
The multi-label text categorization is supervised learning, where a document is associated with mult...
We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar te...
Large volumes of text is being generated every minute which necessitates effective and robust tools ...
The notion of similarity between texts is fundamental for many applications of Natural Language Proc...
Abstract. Improving accuracy in Information Retrieval tasks via se-mantic information is a complex p...