Extraction: new content hashes and default file names (#314) fix deprecation warning with @sdondley in #321 fix for metadata image by @andremacola in #328 fix potential unicode issue in third-party extraction with @Korben00 in #331 review logging levels (#347) Command-line interface: more efficient sitemap processing (#326) more efficient downloads (#338) fix for single URL processing (#324) and URL blacklisting (#339) Navigation additional safety check on domain similarity for feeds and sitemaps new function is_live test() using HTTP HEAD request (#327) code parts supported by new courlan version Maintenance allow urllib3 version 2.0+ minor code simplification and fixes Full Changelog: https://github.com/adbar/trafilatura/compare/v...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
Extraction: fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #3...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
Extraction: fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #3...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...