As the web keeps growing, identifying and retrieving useful in-formation from this huge amount of data continues to be a prob-lem. One difficulty in web retrieval is that web pages not only contain useful content, but also navigation bars, advertisements, disclaimers, and other boilerplate material. Removing this boiler-plate material, i.e. extracting useful content from web documents, has been shown to improve performance of linguistic applications such as classification and clustering. However, it is not clear how content extraction influences retrieval performance. In this paper we systematically evaluate the impact of a sim-ple content extraction method on retrieval performance on two web collections: (i) a blog collection and (ii) an i...
In this paper we discuss the possible application of new concepts in web content extraction: utility...
In this chapter we discuss the possible application of new concepts in web content extraction: utili...
The content of a webpage is usually contained within a small body of text and images, or perhaps sev...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
Currently we are facing an overburdening growth of the number of reliable information sources on the...
Social media collections are becoming increasingly important in the everyday life of Internet users....
In many domains there are specific attributes in documents that carry more weight than the general w...
Previous work on content extraction utilized various heuristics such as link to text ratio, prominen...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
As technology grows everyday and the amount of research done in various fields rises exponentially t...
The demand for search engines that return precise answers to flexible information queries raises int...
We witness a growing interest and capabilities of automatic content recognition (often referred to a...
Abstract. In recent years topic retrieval has become a core component in blog information retrieval....
We are experiencing an unprecedented increase of content contributed by users in forums such as blog...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
In this paper we discuss the possible application of new concepts in web content extraction: utility...
In this chapter we discuss the possible application of new concepts in web content extraction: utili...
The content of a webpage is usually contained within a small body of text and images, or perhaps sev...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
Currently we are facing an overburdening growth of the number of reliable information sources on the...
Social media collections are becoming increasingly important in the everyday life of Internet users....
In many domains there are specific attributes in documents that carry more weight than the general w...
Previous work on content extraction utilized various heuristics such as link to text ratio, prominen...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
As technology grows everyday and the amount of research done in various fields rises exponentially t...
The demand for search engines that return precise answers to flexible information queries raises int...
We witness a growing interest and capabilities of automatic content recognition (often referred to a...
Abstract. In recent years topic retrieval has become a core component in blog information retrieval....
We are experiencing an unprecedented increase of content contributed by users in forums such as blog...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
In this paper we discuss the possible application of new concepts in web content extraction: utility...
In this chapter we discuss the possible application of new concepts in web content extraction: utili...
The content of a webpage is usually contained within a small body of text and images, or perhaps sev...