Buzzbang crawler 0.0.4 alpha available

This is a minor release, with small improvements to the indexer (still in
lieu of using a proper JSON-LD parsing library), the crawler (avoid failure
on bad sitemaps, don't choke on blank lines in URL listing files, and
configuration (allow alternative locations for the crawl database and
Solr).  More details at [1].  Many thanks to @innovationchef
<https://github.com/innovationchef>, @aswanipranjal
<https://github.com/aswanipranjal> and @HaoPatrick
<https://github.com/haopatrick> for contributions.

For overhauling the crawler, I am now leaning considerably towards
Scrapy/Frontera, for the reasons listed at [2]

[1] https://github.com/justinccdev/bsbang-crawler/releases/tag/0.0.4
[2]
https://github.com/justinccdev/bsbang-crawler/wiki/Transition-to-an-established-crawler-package

--
Justin Clark-Casey
Research Software Engineer, InterMine life sciences data integration, U of
Cambridge
http://justincc.org

Received on Monday, 26 February 2018 16:38:43 UTC