extraction trade-off: slightly better recall code robustness: requests, configuration and navigation bugfixes: image data extractio
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
extended and more convenient command-line options output in JSON format bug fixe
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
more efficient rules for extraction metadata: further attributes used (with @felipehertzer) better b...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
extended and more convenient command-line options output in JSON format bug fixe
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
more efficient rules for extraction metadata: further attributes used (with @felipehertzer) better b...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
link discovery in sitemaps compatibility with Python 3.9 extraction coverage improved deduplication ...