Finding large amounts of text data for use in natural language technology is difficult for under-resourced languages such as Swahili. The corpora that are readily accessible for these languages are not sufficient to be used in language technologies, whose requirements can run into the hundreds of millions of words. This paper describes how we can take advantage of search engines such as Google together with crawling tools to collect Swahili text from the Web. We also share the experience of cleaning up and normalising the resulting text data. Finally, we show some preliminary results of the evaluation of the language models built from our corpus as well as results of how they compare to those built from the Helsinki Corpus
AbstractLingala is now the most widespread language in Congo. The Internet provides a great amount o...
The paper discovers and describes generalised usage patterns meant for assisting second language Swa...
In the technical report No. 67, I described how accurate information retrieval from large corpora is...
Finding large amounts of text data for use in natural language technology is difficult for under-res...
A Project Report Submitted to the School of Science and Technology in Partial Fulfillment of the Req...
A corpus is a large collection of language data either in written form or spoken form or both. It ca...
Research in machine translation and corpus annotation has greatly benefited from the increasing avai...
In Technical Report 602, I described the process of converting printed text into machine-readable fo...
isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written docu...
In this article we survey four different electronic bilingual dictionaries for the language pair Swa...
In this article the potential of the multilingual Web to function as a corpus, in addition to a sour...
Abstract: In this article we survey four different electronic bilingual dictionaries for the lan-gu...
Computational morphological analysis is an important first step in the automatic treatment of natura...
This paper explores the review of Swahili text and speech databases/corpus in different dimensions i...
As far as traditionally published Swahili language dictionaries are concerned, throughout the long h...
AbstractLingala is now the most widespread language in Congo. The Internet provides a great amount o...
The paper discovers and describes generalised usage patterns meant for assisting second language Swa...
In the technical report No. 67, I described how accurate information retrieval from large corpora is...
Finding large amounts of text data for use in natural language technology is difficult for under-res...
A Project Report Submitted to the School of Science and Technology in Partial Fulfillment of the Req...
A corpus is a large collection of language data either in written form or spoken form or both. It ca...
Research in machine translation and corpus annotation has greatly benefited from the increasing avai...
In Technical Report 602, I described the process of converting printed text into machine-readable fo...
isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written docu...
In this article we survey four different electronic bilingual dictionaries for the language pair Swa...
In this article the potential of the multilingual Web to function as a corpus, in addition to a sour...
Abstract: In this article we survey four different electronic bilingual dictionaries for the lan-gu...
Computational morphological analysis is an important first step in the automatic treatment of natura...
This paper explores the review of Swahili text and speech databases/corpus in different dimensions i...
As far as traditionally published Swahili language dictionaries are concerned, throughout the long h...
AbstractLingala is now the most widespread language in Congo. The Internet provides a great amount o...
The paper discovers and describes generalised usage patterns meant for assisting second language Swa...
In the technical report No. 67, I described how accurate information retrieval from large corpora is...