Re: Experimental RDFa extractor in JS from Niklas Lindström on 2012-04-20 (public-rdfa-wg@w3.org from April 2012)

From: Niklas Lindström <lindstream@gmail.com>
Date: Fri, 20 Apr 2012 12:50:56 +0200
To: Ivan Herman <ivan@w3.org>
Cc: public-rdfa-wg <public-rdfa-wg@w3.org>
Message-ID: <CADjV5jeR9tvQJ_4YM3ECAy6M=woh2-0QXeem8XWXc1n5OT82fQ@mail.gmail.com>
Hi Ivan,

2012/4/20 Ivan Herman <ivan@w3.org>:
> Niklas,
>
> I think this is a great idea and I am very excited to see that. I think that a system that returns JSON to application developers is the best possible choice for now, as we do not have any RDF API. And, maybe, that is all what WebApp developers need.

Thanks! I agree. It felt like a valuable way forward with fairly
little effort (though piggybacking on the JSON-LD design work of
course). It'll be very interesting to evaluate the usability of the
resulting data in various scenarios.


> I think the first goal should be (if you can) is to cover the whole of Lite, plus possibly fully cover @about. That would be a major first step. Then it could be completed.

Definitely. I think Lite is basically covered already. It's the
interplay of many attributes in the same element that I haven't got to
yet (to e.g. fully cover @about). Right now I'm short on time, but I
hope to continue down this path some time next week.


> What kind of JSON-LD do you produce? For pyRdfa I tried to push as much as I could into @context; mainly in the case of @vocab usage that meant that the rest of the JSON part really looked very simple. That is a major plus for WebApp developers.

Yes, that's what I do too, for exactly those reasons. The shape of the
output is entirely based on the form of the input, i.e. using the same
terms and CURIEs (populating @context as needed). One thing I haven't
yet done, but plan to, is to merge descriptions about the same
resource even if they're dispersed throughout the page. While that
does deviate from the actual shape in the source page, it is so much
better for consumption, and I think is to be expected. Another thing I
don't do is any kind of coercion. Literals with datatype or deviating
from any given @language are represented in expanded JSON-LD form.
I've yet to decide whether to change that or make it configurable.

(You should really try out the bookmarklet [1] in a Firefox (ideally
with the JSONView [2] plugin installed). :) I tried it on your own
FOAF page for instance, which is rich in data and really interesting
to examine this way. (Note that @xmlns:* aren't captured yet though,
so the result here isn't really correct.))

It should be noted that, of course, graph cycles aren't possible to
follow directly in a tree. So any time a reference to an already
created resource description (i.e. a JSON object @id:d with the
resource IRI) is referenced, I just put a link there (an object with
just the @id). While I plan to expose the idMap I'll use for the
aforementioned dispersed resource merging, trying to solve this in
general means veering into the API design again. While I have many
ideas for how to get there from here eventually, I'll focus on the
basic JSON-LD tree for now. Hopefully we'll se how valuable that
becomes in itself in various scenarios. (As we know and have seen
before, there are many intricate tradeoffs possible regarding e.g.
graph vs. tree and data details vs. strings.)


> The only, though insignificant, issue is that you won't be able to run the official test suite directly. Nevertheless, I think it would be hugely important to have whatever you have be part of the official report (via a manually edited EARL file, for example).

Absolutely. Actually, I think I'll manage to set up an extractor
service for this eventually. I'm already using Node to run it on the
command-line against test files, so it should be straightforward. The
remaining thing then is whether the test runner accepts JSON-LD (I
actually think it might – Gregg?), or if I should plug this into
Antonio Garrote's rdfstore-js [3]. Either way it should be quite
doable.


> Niklas, this could be very important...
>
> Thanks

Thanks for the positive feedback!

Best regards,
Niklas

[1]: http://niklasl.github.com/rdfa-lab/
[2]: http://jsonview.com/
[3]: https://github.com/antoniogarrote/rdfstore-js


> Ivan
>
>
> On Apr 20, 2012, at 01:58 , Niklas Lindström wrote:
>
>> Hi all!
>>
>> The last couple of days I've been experimenting with a different kind
>> of approach to implementing an RDFa extractor. The result so far is a
>> draft with admittedly rather partial coverage. However, I hope some
>> aspects of it will be of interest even at this stage:
>>
>> 1. It is implemented in pure Javascript. (Well, actually, in some 170
>> lines of CoffeeScript, but the generated result is the same.)
>> 2. It runs both in the browser and on Node (used with jsdom).
>> 3. It does not produce triples. It directly creates a JSON-LD extract
>> (corresponding in shape to the RDFa). This is the difference, and the
>> fun part.
>>
>> Now, it really doesn't handle anything but the most simple RDFa 1.1.
>> Possibly all of Lite, plus @datatype, @rel (including hanging),
>> @inlist, @rev and perhaps one or two more. It only copes with @about
>> if it's alone, it doesn't handle combinations of @rel and @property,
>> and so on. I'll strive to make it a lot more compliant given time of
>> course.
>>
>> - You can check out the code at: https://github.com/niklasl/rdfa-lab
>> - Or enjoy the bookmarklet (only tested in Firefox), available at:
>> http://niklasl.github.com/rdfa-lab/
>>
>> (Just add the latter to your bookmarks and apply on any page
>> containing RDFa. I recommend the JSONView [1] browser add-on for a
>> good experience.)
>>
>> I hope you'll enjoy the little things it can do. (For one, using the
>> resulting JSON-LD directly in a JS application should prove
>> interesting.)
>>
>> Best regards,
>> Niklas
>>
>> [1]: http://jsonview.com/
>>
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
Received on Friday, 20 April 2012 10:51:57 UTC