W3C home > Mailing lists > Public > public-bioschemas@w3.org > February 2018

bsbang-crawler 0.0.3 and bsbang-frontend 0.0.3 released

From: Justin Clark-Casey <jc955@cam.ac.uk>
Date: Wed, 14 Feb 2018 18:22:45 +0000
To: public-bioschemas@w3.org
Message-ID: <fb2d4ab8-87d3-2a2a-6d72-5a55d0014a6b@cam.ac.uk>
This is mainly a packaging up of work from some time ago so slightly hazy in my memory.  Highlights:

bsbang-crawler [1]

* Implemented crawling of optional schema properties
* Implemented remapping of properties, so that for example, PhysicalEntity.biologicalType is remapped to PhysicalEntity.additionalType (I know that's not very 
applicable but biosamples were doing this at one stage :).  So not so useful now but the kind of thing needed in the future.
* Crawling and indexing are now 3 separate stages (crawl, extract, index) to make staged data processing easier.

More details at [2].  Still extremely alpha and early, very shallow processing of schemas, etc.

However, now having done this work, the next big item is to look at replacing much of it with a proper crawler like Apache Nutch or similar.  Arguably this is 
what I should have done in the first place, but I took the short term fun of hacking something together in python and now I might be paying for it ^_^

bsbang-frontend [3]

Even more primitive frontend for the Solr index generated by bsbang-crawler.  Very few changes, mainly

* Displaying Thing.alternativeName (now an optional crawled property)
* Making some properties links if they are urls.

More details at [4].  Hope to entice a GSoC student to actually make it not butt ugly.

A frontend example with crawl of a few small sites still up at [5].

[1] https://github.com/justinccdev/bsbang-crawler
[2] https://github.com/justinccdev/bsbang-crawler/releases/tag/0.0.3
[3] https://github.com/justinccdev/bsbang-frontend
[4] https://github.com/justinccdev/bsbang-frontend/releases/tag/0.0.3
[5] bsbang.science

Justin Clark-Casey
Research Software Engineer, InterMine life sciences data integration, U of Cambridge
http://twitter.com/justincc http://justincc.org
Received on Wednesday, 14 February 2018 18:23:12 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 14 February 2018 18:23:12 UTC