W3C home > Mailing lists > Public > public-bioschemas@w3.org > February 2018

Buzzbang crawler 0.0.4 alpha available

From: Justin Clark-Casey <justinccdev@gmail.com>
Date: Mon, 26 Feb 2018 16:38:13 +0000
Message-ID: <CAME9NR-PYE0rP74Ps_xvV8+nZV3NO5Oi1LL0_Gt0iuHMY8uU0A@mail.gmail.com>
To: public-bioschemas@w3.org
This is a minor release, with small improvements to the indexer (still in
lieu of using a proper JSON-LD parsing library), the crawler (avoid failure
on bad sitemaps, don't choke on blank lines in URL listing files, and
configuration (allow alternative locations for the crawl database and
Solr).  More details at [1].  Many thanks to @innovationchef
<https://github.com/innovationchef>, @aswanipranjal
<https://github.com/aswanipranjal> and @HaoPatrick
<https://github.com/haopatrick> for contributions.

For overhauling the crawler, I am now leaning considerably towards
Scrapy/Frontera, for the reasons listed at [2]

[1] https://github.com/justinccdev/bsbang-crawler/releases/tag/0.0.4
[2]
https://github.com/justinccdev/bsbang-crawler/wiki/Transition-to-an-established-crawler-package

--
Justin Clark-Casey
Research Software Engineer, InterMine life sciences data integration, U of
Cambridge
http://justincc.org
Received on Monday, 26 February 2018 16:38:43 UTC

This archive was generated by hypermail 2.3.1 : Monday, 26 February 2018 16:38:44 UTC