better metadata extraction and integration (XML & XML-TEI) more efficient processing output directory as CLI-optio
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
extended and more convenient command-line options output in JSON format bug fixe
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
efficiency: replaced module readability-lxml by trimmed fork bugs fixed: (#179, #180, #183, #184) im...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...
better metadata extraction and integration (XML & XML-TEI) more efficient processing output director...
faster and more robust text and metadata extraction more efficient batch processing (parallel proces...
extraction trade-off: slightly better recall code robustness: requests, configuration and navigation...
better handling of formatting, links and images, title type as attribute in XML formats more robust ...
extended and more convenient command-line options output in JSON format bug fixe
optional language detector changed: langid → pycld3 helper function bare_extraction() optional dedup...
improved author extraction (thanks @felipehertzer!) bugs fixed: HTML element handling, HTML meta att...
customizable configuration file to parametrize extraction and downloads better handling of feeds and...
improved link discovery and handling fixes in metadata extraction, feeds and sitemaps processing bre...
focused crawling functions including politeness rules more efficient multi-threaded downloads + use ...
efficiency: replaced module readability-lxml by trimmed fork bugs fixed: (#179, #180, #183, #184) im...
First release used in production and meant to be archived on Zenodo for reproducibility and citabili...
better, faster encoding detection: replaced chardet with charset_normalizer faster execution: update...
first precision- and recall-oriented presets defined improvements in authorship extraction (thanks @...
added bare_extraction function returning Python variables improved link discovery in feeds and sitem...