To enable research on automated alignment/interpretability evaluations, we release the experimental results of our paper "Scale Alone Does not Improve Mechanistic Interpretability in Vision Models" as a separate dataset. Note that this is the first dataset containing interpretability measurements obtained through psychophysical experiments for multiple explanation methods and models. The dataset contains >120'000 anonymized human responses, each consisting of the final choice, a confidence score, and a reaction time. Out of these >120'000 responses, > 69'000 passed all our quality assertions - this is the main data (see responses_main.csv). The other responses failed (some) quality assertions and might be of lower quality - they should be ...
Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) ...
Recent severe failures of black box models in high stakes decisions have increased interest in inter...
AI alignment refers to models acting towards human-intended goals, preferences, or ethical principle...
International audienceDeep learning methods have become very popular for the processing of natural i...
Explainable models in machine learning are increas- ingly popular due to the interpretability-favori...
In this thesis we investigate different interpretability methods for evaluating predictions from Con...
Deep neural networks have achieved near-human accuracy levels in various types of classification and...
International audienceThis book compiles leading research on the development of explainable and inte...
Saliency methods provide post-hoc model interpretation by attributing input features to the model ou...
The recent surge in highly successful, but opaque, machine-learning models has given rise to a dire ...
Trustworthy machine learning is driving a large number of ML community works in order to improve ML ...
Despite their potential unknown deficiencies and biases, the takeover of critical tasks by AI machin...
Safety-critical applications (e.g., autonomous vehicles, human-machine teaming, and automated medica...
Machine-learning models have demonstrated great success in learning complex patterns that enable the...
Deep Neural Networks (DNNs) have recently demonstrated remarkable performance that is comparable to ...
Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) ...
Recent severe failures of black box models in high stakes decisions have increased interest in inter...
AI alignment refers to models acting towards human-intended goals, preferences, or ethical principle...
International audienceDeep learning methods have become very popular for the processing of natural i...
Explainable models in machine learning are increas- ingly popular due to the interpretability-favori...
In this thesis we investigate different interpretability methods for evaluating predictions from Con...
Deep neural networks have achieved near-human accuracy levels in various types of classification and...
International audienceThis book compiles leading research on the development of explainable and inte...
Saliency methods provide post-hoc model interpretation by attributing input features to the model ou...
The recent surge in highly successful, but opaque, machine-learning models has given rise to a dire ...
Trustworthy machine learning is driving a large number of ML community works in order to improve ML ...
Despite their potential unknown deficiencies and biases, the takeover of critical tasks by AI machin...
Safety-critical applications (e.g., autonomous vehicles, human-machine teaming, and automated medica...
Machine-learning models have demonstrated great success in learning complex patterns that enable the...
Deep Neural Networks (DNNs) have recently demonstrated remarkable performance that is comparable to ...
Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) ...
Recent severe failures of black box models in high stakes decisions have increased interest in inter...
AI alignment refers to models acting towards human-intended goals, preferences, or ethical principle...