The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multidisciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more...
Modern epidemiological analyses to understand and combat the spread of disease depend critically on ...
There is global attention on new data analytic methods. Machine learning (essentially pattern recogn...
The successful development and deployment of AI systems depends on access to data which is used to t...
This paper contributes to a project that maps the concept of ‘data provenance’ into qualitative data...
There is global attention on new data analytic methods. Artificial Intelligence (AI) is seen as a cr...
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license...
Data management is growing in complexity as large-scale applications take advantage of the loosely c...
The impact of artificial intelligence (AI) expands relentlessly despite well documented examples of ...
Data provenance, a record that describes the origins and processing of data, offers new promises in ...
Our research makes a contribution by exemplifying what controls the freedom-to-operate for a company...
The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife pre...
There is global attention on new data analytic methods. Data scraping (a typical first step for adva...
The World Wide Web evolves into a Web of Data, a huge, globally distributed dataspace that contains ...
International audienceModern epidemiological analyses to understand and combat the spread of disease...
A growing community of researchers has been investigating the equity of algorithms, advancing the un...
Modern epidemiological analyses to understand and combat the spread of disease depend critically on ...
There is global attention on new data analytic methods. Machine learning (essentially pattern recogn...
The successful development and deployment of AI systems depends on access to data which is used to t...
This paper contributes to a project that maps the concept of ‘data provenance’ into qualitative data...
There is global attention on new data analytic methods. Artificial Intelligence (AI) is seen as a cr...
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license...
Data management is growing in complexity as large-scale applications take advantage of the loosely c...
The impact of artificial intelligence (AI) expands relentlessly despite well documented examples of ...
Data provenance, a record that describes the origins and processing of data, offers new promises in ...
Our research makes a contribution by exemplifying what controls the freedom-to-operate for a company...
The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife pre...
There is global attention on new data analytic methods. Data scraping (a typical first step for adva...
The World Wide Web evolves into a Web of Data, a huge, globally distributed dataspace that contains ...
International audienceModern epidemiological analyses to understand and combat the spread of disease...
A growing community of researchers has been investigating the equity of algorithms, advancing the un...
Modern epidemiological analyses to understand and combat the spread of disease depend critically on ...
There is global attention on new data analytic methods. Machine learning (essentially pattern recogn...
The successful development and deployment of AI systems depends on access to data which is used to t...