extended and more convenient command-line options output in JSON format bug fixe
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
efficiency: replaced module readability-lxml by trimmed fork bugs fixed: (#179, #180, #183, #184) im...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
extended and more convenient command-line options output in JSON format bug fixe
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
efficiency: replaced module readability-lxml by trimmed fork bugs fixed: (#179, #180, #183, #184) im...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
extended and more convenient command-line options output in JSON format bug fixe
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
efficiency: replaced module readability-lxml by trimmed fork bugs fixed: (#179, #180, #183, #184) im...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...