In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and t...
This paper proposes a simple mechanism for supporting multiple overlapping layers of annotations for...
We classify and review current approaches to software infrastructure for research, development and d...
In this paper, we describe a new system to extract, index, search, and visualize entities on Wikiped...
In this paper, we describe Docforia, a multilayer document model and application programming interfa...
In this paper, we describe Docforia, a multilayer document model and application programming interfa...
In this paper, we describe Langforia, a multilingual processing pipeline to annotate texts with mult...
This thesis explores methods for generating proposition databases in a large-scale and multilingual ...
The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one sti...
In this paper, I describe systems and prototypes we created in the natural language processing group...
The increasing diversity of languages used on the web introduces a new level of complexity to Inform...
In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset...
The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. F...
While NLP tools are now widely available, their use can be problematic considering the lack of homog...
This paper describes SW1, the first version of a semantically annotated snapshot of the EnglishWikip...
AbstractThe online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked art...
This paper proposes a simple mechanism for supporting multiple overlapping layers of annotations for...
We classify and review current approaches to software infrastructure for research, development and d...
In this paper, we describe a new system to extract, index, search, and visualize entities on Wikiped...
In this paper, we describe Docforia, a multilayer document model and application programming interfa...
In this paper, we describe Docforia, a multilayer document model and application programming interfa...
In this paper, we describe Langforia, a multilingual processing pipeline to annotate texts with mult...
This thesis explores methods for generating proposition databases in a large-scale and multilingual ...
The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one sti...
In this paper, I describe systems and prototypes we created in the natural language processing group...
The increasing diversity of languages used on the web introduces a new level of complexity to Inform...
In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset...
The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. F...
While NLP tools are now widely available, their use can be problematic considering the lack of homog...
This paper describes SW1, the first version of a semantically annotated snapshot of the EnglishWikip...
AbstractThe online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked art...
This paper proposes a simple mechanism for supporting multiple overlapping layers of annotations for...
We classify and review current approaches to software infrastructure for research, development and d...
In this paper, we describe a new system to extract, index, search, and visualize entities on Wikiped...