Datasets Features Support remote data files #2616 (@albertvillanova) This allows to pass URLs of remote data files to any dataset loader: load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]}) This works for all these dataset loaders: text csv json parquet pandas Streaming from remote text/json/csv/parquet/pandas files: When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions: Streaming for the Pandas loader #2636 (@lhoestq) Streaming for the CSV loader #2635 (@lhoestq) Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq) Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzali...
Fix minimum tqdm version and import on Colab #2697 (@nateraw) Fix OSCAR Esperanto #2693 (@lhoestq
Dataset changes Update: Adapt all audio datasets #3081 (@patrickvonplaten) Bug fixes Update BibTe...
Improvements Make decoding of Audio and Image feature optional by @mariosasko in https://github.com...
Datasets Changes New: C4 #2575 #2592 (@lhoestq) New: mC4 #2576 (@lhoestq) New: MasakhaNER #2465...
Dataset changes New: CaSiNo #2867 (@kushalchawla) New: Mostly Basic Python Problems #2893 (@lvwe...
New documentation New documentation structure #2718 (@stevhliu): New: Tutorials New: Hot-to...
Dataset Changes New: NLU evaluation data #2238 (@dkajtoch) New: Add SLR32, SLR52, SLR53 to OpenS...
Datasets Changes New: Add Russian SuperGLUE #2668 (@slowwavesleep) New: Add Disfl-QA #2473 (@bha...
Bug fixes Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra) Fix TestCo...
Datasets fixes Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://g...
Dataset changes Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalki...
Datasets Changes New: Microsoft CodeXGlue Datasets #2357 (@ncoop57) New: KLUE benchmark #2416 (@...
Bug fixes Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/hug...
Bug fixes Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq...
Datasets bug fixes Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/hu...
Fix minimum tqdm version and import on Colab #2697 (@nateraw) Fix OSCAR Esperanto #2693 (@lhoestq
Dataset changes Update: Adapt all audio datasets #3081 (@patrickvonplaten) Bug fixes Update BibTe...
Improvements Make decoding of Audio and Image feature optional by @mariosasko in https://github.com...
Datasets Changes New: C4 #2575 #2592 (@lhoestq) New: mC4 #2576 (@lhoestq) New: MasakhaNER #2465...
Dataset changes New: CaSiNo #2867 (@kushalchawla) New: Mostly Basic Python Problems #2893 (@lvwe...
New documentation New documentation structure #2718 (@stevhliu): New: Tutorials New: Hot-to...
Dataset Changes New: NLU evaluation data #2238 (@dkajtoch) New: Add SLR32, SLR52, SLR53 to OpenS...
Datasets Changes New: Add Russian SuperGLUE #2668 (@slowwavesleep) New: Add Disfl-QA #2473 (@bha...
Bug fixes Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra) Fix TestCo...
Datasets fixes Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://g...
Dataset changes Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalki...
Datasets Changes New: Microsoft CodeXGlue Datasets #2357 (@ncoop57) New: KLUE benchmark #2416 (@...
Bug fixes Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/hug...
Bug fixes Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq...
Datasets bug fixes Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/hu...
Fix minimum tqdm version and import on Colab #2697 (@nateraw) Fix OSCAR Esperanto #2693 (@lhoestq
Dataset changes Update: Adapt all audio datasets #3081 (@patrickvonplaten) Bug fixes Update BibTe...
Improvements Make decoding of Audio and Image feature optional by @mariosasko in https://github.com...