customizable configuration file to parametrize extraction and downloads better handling of feeds and sitemaps additional CLI options: crytographic hash for file name, use Internet Archive as backup more precise extraction faster downloads: requests replaced with bare urllib3 and custom decoding consolidation: bug fixes and improvements, many thanks to the issues reporters
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
extended and more convenient command-line options output in JSON format bug fixe
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
Extraction: new content hashes and default file names (#314) fix deprecation warning with @sdondley...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
extended and more convenient command-line options output in JSON format bug fixe
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
Extraction: new content hashes and default file names (#314) fix deprecation warning with @sdondley...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
extended and more convenient command-line options output in JSON format bug fixe