link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication now optional bug fixe
improved exhaustive search simplified code bug fixes removed support for Python 3.
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
extended and more convenient command-line options output in JSON format bug fixe
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
Extraction: extraction bugs fixed (#263, #266), more robust HTML doctype parsing XML output improve...
improved exhaustive search simplified code bug fixes removed support for Python 3.
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
extended and more convenient command-line options output in JSON format bug fixe
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
Extraction: extraction bugs fixed (#263, #266), more robust HTML doctype parsing XML output improve...
improved exhaustive search simplified code bug fixes removed support for Python 3.
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...