Re: [ANN] JSON Datsets / any HTML to RDF

On Wed, 2010-10-13 at 12:15 +0100, Martin McEvoy wrote:
> I am pleased to announce JSON Datasets[1] a way to extract RDF from
> any HTML document using JSON. 

Hi Martin,

I seem to remember reading some of your early work on this concept some
months ago. Can't remember how I stumbled upon it.

Anyhow, it's an interesting idea. It seems to be that it's quite
GRDDL-like, in that an HTML file can link to a file that contains a set
of rules which, once applied to the original file, produce RDF as
output.

I know that you're quite the XSLT transformation guru, and are probably
quite familiar with GRDDL. Are you also aware that GRDDL allows
transformations to be written using languages other than XSLT? Your
JSON-based language seems like it would make a good transformation
language for GRDDL.

Essentially all that would need to be done to make JSON Datasets
conformant to GRDDL would be to replace this method of linking:

 <link rel="dataset"
       href="http://example.com/my-dataset.json"
       type="application/json">

With the GRDDL methods of linking to a transformation. There are two
such methods that are relevant to HTML (there are another two which are
XML-based) - firstly a direct link from the document to the JSON file:

 <link rel="transformation"
       href="http://example.com/my-dataset.json"
       type="application/json">

And secondly, an indirect link from the document to a profile document

 <head profile="http://example.com/profile">

Where the profile document contains a link to the JSON:

 <link rel="profileTransformation"
       href="http://example.com/my-dataset.json"
       type="application/json">

If those were used by JSON datasets instead of rel="dataset" then you
might find that your apprach receives wider support. For example, my
Perl implementation of GRDDL supports pluggable transformation
languages; adding support for your JSON-based format would not be
especially tough. Adding support for rel="dataset" though, I would
consider to be out of scope for the project.

Some critiques of the JSON format itself:

The use of the term "where" is a little confusing. The terminology of
the query syntax seems to borrow from SQL and SPARQL, but the behaviour
of "where" seems totally different. In SQL and SPARQL, "where" is
essentially used to perform joins, and to narrow down criteria. In your
language it seems to be a mapping from one structure (a graph) to
another (RDF triples) - that seems to be more similar to SQL's "SELECT
foo AS bar". Perhaps this:

{
  "select":  {
    "from": "http://example.com/",
    "prefix": {
      "dc": "http://purl.org/dc/elements/1.1/"
    },
    "where": {
      "title": {  "label": "dc:title" }
    }
  }
}

Might be better expressed as:

{
  "prefix": {
    "dc": "http://purl.org/dc/elements/1.1/"
  },
  "select":  {
    "title": {  "label": "dc:title" }
  },
  "from": "http://example.com/"
}

And actually, "label" might be better is called "as":

{
  "prefix": {
    "dc": "http://purl.org/dc/elements/1.1/"
  },
  "select":  {
    "title": {  "as": "dc:title" }
  },
  "from": "http://example.com/"
}

Are prefixes required, or just a shortcut? Could the above be written as
the following?

{
  "select":  {
    "title": {  "as": "http://purl.org/dc/elements/1.1/title" }
  },
  "from": "http://example.com/"
}

It's not clear whether selectors may be combined. "h1", ".example" and
"#heading" are all valid selectors, but what about "h1.example" and
"#heading h1.example". If you're going to use a subset of CSS, you need
to be awfully clear about what subset you're specifying, otherwise
people coming to your spec, knowing CSS already, are going to say,
"well, it's like CSS, so I must be able to do foo."

You might consider switching to, or at least allowing XPath for
selectors. It's mighty powerful, and should be able to handle useful
idioms like class=fn which is inside class=vcard, but not inside a
nested class=vcard.

Lastly in your spec, you use a lot of XML terminology when describing
the output. Personally I found that quite confusing. You might want to
consider explaining how the output is constructed in terms of the
abstract triples, or if you want to describe it in more concrete terms,
in terms of N-Triples.

I think if you did that, it might even help clarify the format in your
own mind and further improve it - for example, you may not have noticed,
but because you've defined the "label" property in XML terms, you've
ended up with a property which sometimes ends up setting an RDF
property, and at other times an RDF class, as in the case of
<http://weborganics.co.uk/dataset/#query-rev> where it's used to set a
class of "Person". How it sometimes sets one and sometimes sets the
other seems to happen via magic (perhaps using the same rule as RDF/XML
where the same also happens, and is similarly confusing).

-- 
Toby A Inkster
<mailto:mail@tobyinkster.co.uk>
<http://tobyinkster.co.uk>

Received on Wednesday, 13 October 2010 15:10:01 UTC