We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexiconacquisition facility, which remedies this incompleteness and mak...
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a ...
Tagged corpora are essential for evaluating and training nat-ural language processing tools. The cos...
The paper describes an approach to expedite the process of manual annotation of a Hindi dependency t...
This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology. ...
The work was accepted in Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, S...
This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology. ...
This paper describes the efforts at MILE lab, IISc, to create a 100,000-word database each in Kannad...
This is a Sanskrit corpus developed at the Mangalam Research Center (Berkeley, California) for the s...
In this paper we present our efforts the first time of its kind in the history of Sanskrit to design...
The paper introduces a dependency annotation effort which aims to fully annotate a million word Hind...
The paper describes an approach to automati-cally annotate a Hindi Treebank using Pan-inian dependen...
A robust chunker can drastically reduce the complexity of parsing of natural language text. Chunking...
The Sanskrit WordNet is a resource currently under development, whose core was induced from a Vedic ...
Lexical datasets containing annotated concordances of words pertaining to the conceptual domains of ...
Large textual resources are the basis for a variety of applications in the field of corpus linguisti...
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a ...
Tagged corpora are essential for evaluating and training nat-ural language processing tools. The cos...
The paper describes an approach to expedite the process of manual annotation of a Hindi dependency t...
This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology. ...
The work was accepted in Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, S...
This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology. ...
This paper describes the efforts at MILE lab, IISc, to create a 100,000-word database each in Kannad...
This is a Sanskrit corpus developed at the Mangalam Research Center (Berkeley, California) for the s...
In this paper we present our efforts the first time of its kind in the history of Sanskrit to design...
The paper introduces a dependency annotation effort which aims to fully annotate a million word Hind...
The paper describes an approach to automati-cally annotate a Hindi Treebank using Pan-inian dependen...
A robust chunker can drastically reduce the complexity of parsing of natural language text. Chunking...
The Sanskrit WordNet is a resource currently under development, whose core was induced from a Vedic ...
Lexical datasets containing annotated concordances of words pertaining to the conceptual domains of ...
Large textual resources are the basis for a variety of applications in the field of corpus linguisti...
Sanskrit has a rich source of lexical resources in the form of various kinds of dictionaries, and a ...
Tagged corpora are essential for evaluating and training nat-ural language processing tools. The cos...
The paper describes an approach to expedite the process of manual annotation of a Hindi dependency t...