© 2020 Yimeng DaiThe number of webpages is growing exponentially, which results in a great volume of unstructured information on the web. It takes time either to fully comprehend a webpage or to retrieve relevant information from a complex webpage. Analyzing unstructured webpage and extracting structured information from the webpage automatically is crucial. In this study, we aim to develop algorithms for multi-granular webpage information extraction and analysis to facilitate webpage information understanding. We investigate the problem at three levels of granularity, i.e., micro, meso and macro levels. For every level, we focus on one extraction and analysis task, although the algorithms we developed are general and can be applied to many...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Web pages consist of not only actual content, but also other ele-ments such as branding banners, nav...
We consider the problem of content extraction from on-line news webpages. To explore to what extent ...
The web is recognized as the largest data source in the world. The nature of such data is characteri...
With the rapid development of Internet technology, people have more and more access to a variety of ...
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured ...
We present a novel method for open domain named entity extraction by exploiting the collective hidde...
The explosion of data has made it crucial to analyze the data and distill important information effe...
This thesis explores information extraction (IE) in \textit{low-resource} conditions, in which the q...
Part 3: Ontology-Web and Social Media AI Modeling (OWESOM)International audienceWeb wrappers are sys...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised ...
This paper presents deepKnowNet, a new fully automatic method for building highly dense and accurate...
The clustering of topic-related web pages has been recognized as a foundational work in exploiting l...
An important aspect of research for Web information extraction relates to the inference of complex r...
Large pre-trained neural networks are ubiquitous and critical to the success of many downstream task...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Web pages consist of not only actual content, but also other ele-ments such as branding banners, nav...
We consider the problem of content extraction from on-line news webpages. To explore to what extent ...
The web is recognized as the largest data source in the world. The nature of such data is characteri...
With the rapid development of Internet technology, people have more and more access to a variety of ...
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured ...
We present a novel method for open domain named entity extraction by exploiting the collective hidde...
The explosion of data has made it crucial to analyze the data and distill important information effe...
This thesis explores information extraction (IE) in \textit{low-resource} conditions, in which the q...
Part 3: Ontology-Web and Social Media AI Modeling (OWESOM)International audienceWeb wrappers are sys...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised ...
This paper presents deepKnowNet, a new fully automatic method for building highly dense and accurate...
The clustering of topic-related web pages has been recognized as a foundational work in exploiting l...
An important aspect of research for Web information extraction relates to the inference of complex r...
Large pre-trained neural networks are ubiquitous and critical to the success of many downstream task...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Web pages consist of not only actual content, but also other ele-ments such as branding banners, nav...
We consider the problem of content extraction from on-line news webpages. To explore to what extent ...