In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tool
International audienceMonolingual corpora which are aligned with similar text segments (paragraphs, ...
This paper describes a system of terminological extraction capable of handling multi-word expression...
We present a web service-based environment for the use of linguistic resources and tools to address ...
The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched wi...
Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left...
This paper describes crawling and corpus processing in a distributed framework. We present new tools...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The Web contains vast amounts of linguistic data. One key issue for linguists and language technolog...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
Abstract Many translation scholars have proposed the use of corpora to allow professional translator...
This paper deals with multilingual database generation from parallel corpora. The idea is to contrib...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
This paper describes the Translational English Corpus (TEC) and the software tools developed in orde...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
International audienceMonolingual corpora which are aligned with similar text segments (paragraphs, ...
This paper describes a system of terminological extraction capable of handling multi-word expression...
We present a web service-based environment for the use of linguistic resources and tools to address ...
The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched wi...
Collected from newspaper texts, webcrawling, etc.: words (+frequency), cooccurrences (+graph), left...
This paper describes crawling and corpus processing in a distributed framework. We present new tools...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The Web contains vast amounts of linguistic data. One key issue for linguists and language technolog...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
Abstract Many translation scholars have proposed the use of corpora to allow professional translator...
This paper deals with multilingual database generation from parallel corpora. The idea is to contrib...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
This paper describes the Translational English Corpus (TEC) and the software tools developed in orde...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
International audienceMonolingual corpora which are aligned with similar text segments (paragraphs, ...
This paper describes a system of terminological extraction capable of handling multi-word expression...
We present a web service-based environment for the use of linguistic resources and tools to address ...