AbstractThe need of complete corpus nowadays is very crucial, especially for linguist. In order to assist linguist to construct corpus, a tool for collecting text in a specific language from the Internet is needed. This paper describes an approach to collecting Javanese and Sundanese text from the Internet. We have modified a focused crawler named WebSPHINX such that it can be useful for crawling the text. In order to determine which pages are crawled, the focused crawler needs a language classifier. In this research, we used the dictionary algorithm for classifying the text. In order to determine the next links to visit, we employed 2 crawling methods, i.e. Breadth First and By Page Length. The purpose of our research is to observe how the...