better handling of formatting, links and images, title type as attribute in XML formats more robust sitemaps and feeds processing more accurate extraction further consolidation: code simplified and bugs fixe
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
extended and more convenient command-line options output in JSON format bug fixe
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
Extraction: fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #3...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
extended and more convenient command-line options output in JSON format bug fixe
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
Extraction: fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #3...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
extended and more convenient command-line options output in JSON format bug fixe