RE: Capturing the discussion (was Re: NY Property Tax Explorer)

Phil, you wrote on 29 March:

> 
> Suppose I create a PDF and embed within that a bunch of metadata, have I
> done the job?
> 
> Well, it depends on the context. As far as Google is concerned, yes. As far as
> a less sophisticated portal or catalogue is concerned, usually no. In other
> words, that's only enough *if* there is a machine to read that embedded
> metadata. And I believe this is not (currently) true in CKAN and CKAN-like
> portals for example (dunno about Socrata).
> 

In my mind, there are indeed two broad scenarios in harvesting: (1) data harvesting and (2) metadata harvesting.

The most common scenario is data harvesting which is what general search engines do. They access websites, copy the data, then do some magic with it, either by analysing the data itself or extracting whatever embedded metadata in 'standard' formats they find. Then they use the sheer volume of the collection to rank search results for a general public.
For this scenario, embedding metadata is a very sensible thing to do. It is actually what schema.org and RDFa are supposed to do for HTML pages.

A less common scenario is metadata harvesting which is what aggregators like Europeana, the Datahub and the Pan-European Data Portal are doing. They do not grab the data, but only the metadata, usually based on a data model and in a format that they specify -- EDM, CKAN, DCAT etc. -- and then build a metadata portal from aggregated metadata, in some cases aimed at a specialised audience.
For this scenario, metadata must be published separately. 

The way you formulate it, it sounds like the metadata harvesting scenario is the more common. I think it is not, so my best practice advice would be first to embed metadata into the data format, if that allows it, in order to help SEO and maximise visibility for a general public, and secondly, and only if the provider wants to play a role in an existing metadata harvesting ecosystem, create separate metadata that can be harvested by an aggregator.

Makx.

Received on Saturday, 4 April 2015 09:50:45 UTC