This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learn-ing task. These tests are compared experimentally to determine their prob-ability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar’s test, is shown to have lo...
In this paper we explore several issues relevant to the benchmarking and comparison of machine learn...
Most research in statistical learning (SL) has focused on the mean success rates of participants in ...
This paper discusses the issue of comparing multiple classifiers, applied to the same test dataset o...
This paper reviews five statistical tests for determining whether one learning algorithm outperforms...
The mean result of machine learning models is determined by utilizing k-fold cross-validation. The a...
Consistently checking the statistical significance of experimental results is the first mandatory st...
Abstract—Given a data set and a number of supervised learning algorithms, we would like to find the ...
Developing state-of-the-art approaches for specific tasks is a major driving force in our research c...
<p>A comparison of the training performance, test accuracy, and uncertainty among classifiers in var...
The misclassification error which is usually used in tests to compare classification algorithms, doe...
One of the greatest machine learning prob-lems of today is an intractable number of new algorithms b...
Wide-ranging digitalization has made it possible to capture increasingly larger amounts of data. In ...
The assessment of the performance of learners by means of benchmark experiments is an established ex...
The assessment of the performance of learners by means of benchmark experiments is an established ex...
Three factors are related in analyses of performance curves such as learning curves: the amount of t...
In this paper we explore several issues relevant to the benchmarking and comparison of machine learn...
Most research in statistical learning (SL) has focused on the mean success rates of participants in ...
This paper discusses the issue of comparing multiple classifiers, applied to the same test dataset o...
This paper reviews five statistical tests for determining whether one learning algorithm outperforms...
The mean result of machine learning models is determined by utilizing k-fold cross-validation. The a...
Consistently checking the statistical significance of experimental results is the first mandatory st...
Abstract—Given a data set and a number of supervised learning algorithms, we would like to find the ...
Developing state-of-the-art approaches for specific tasks is a major driving force in our research c...
<p>A comparison of the training performance, test accuracy, and uncertainty among classifiers in var...
The misclassification error which is usually used in tests to compare classification algorithms, doe...
One of the greatest machine learning prob-lems of today is an intractable number of new algorithms b...
Wide-ranging digitalization has made it possible to capture increasingly larger amounts of data. In ...
The assessment of the performance of learners by means of benchmark experiments is an established ex...
The assessment of the performance of learners by means of benchmark experiments is an established ex...
Three factors are related in analyses of performance curves such as learning curves: the amount of t...
In this paper we explore several issues relevant to the benchmarking and comparison of machine learn...
Most research in statistical learning (SL) has focused on the mean success rates of participants in ...
This paper discusses the issue of comparing multiple classifiers, applied to the same test dataset o...