The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance recently in light of crowdsourced relevance judgments, where it is customary to gather many relevance labels for each topic-document pair. In a crowdsourcing setting, agreement is often even used as a proxy for quality, although without any systematic verification of the conjecture that higher agreement corresponds to higher quality. In this paper we address this issue and we study in particular: the effect of topic on assessor agreement; the relationship between assessor agreement and judgment qua...
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where o...
As the use of machine learning techniques in IR increases, the need for a sound empirical methodolog...
Relevance judgment of human assessors is inherently subjective and dynamic when evaluation datasets ...
The agreement between relevance assessors is an important but understudied topic in the Information ...
The agreement between relevance assessors is an important but understudied topic in the Information ...
In recent years, gathering relevance judgments through non-topic originators has become an increasin...
© 2018 ACM. While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack...
Abstract. Crowdsourcing relevance judgments for test collection con-struction is attractive because ...
Information Retrieval (IR) researchers have often used existing IR evaluation collections and transf...
In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the as...
ABSTRACT This paper investigates the agreement of relevance assessments between official TREC judgme...
The batch evaluation of information retrieval systems typically makes use of a testbed consisting of...
Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks to t...
Consistency of relevance judgments is a vital issue for the construction of test collections in info...
Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks to t...
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where o...
As the use of machine learning techniques in IR increases, the need for a sound empirical methodolog...
Relevance judgment of human assessors is inherently subjective and dynamic when evaluation datasets ...
The agreement between relevance assessors is an important but understudied topic in the Information ...
The agreement between relevance assessors is an important but understudied topic in the Information ...
In recent years, gathering relevance judgments through non-topic originators has become an increasin...
© 2018 ACM. While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack...
Abstract. Crowdsourcing relevance judgments for test collection con-struction is attractive because ...
Information Retrieval (IR) researchers have often used existing IR evaluation collections and transf...
In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the as...
ABSTRACT This paper investigates the agreement of relevance assessments between official TREC judgme...
The batch evaluation of information retrieval systems typically makes use of a testbed consisting of...
Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks to t...
Consistency of relevance judgments is a vital issue for the construction of test collections in info...
Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks to t...
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where o...
As the use of machine learning techniques in IR increases, the need for a sound empirical methodolog...
Relevance judgment of human assessors is inherently subjective and dynamic when evaluation datasets ...