Re: Experimental RDFa extractor in JS from Niklas Lindström on 2012-04-20 (public-rdfa-wg@w3.org from April 2012)

From: Niklas Lindström <lindstream@gmail.com>
Date: Fri, 20 Apr 2012 17:23:34 +0200
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Ivan Herman <ivan@w3.org>, public-rdfa-wg <public-rdfa-wg@w3.org>
Message-ID: <CADjV5jcVW52GhiH-zkDu_ZUZ=x8tiPJ4Q4WUZY4FP0b3KZSYoQ@mail.gmail.com>
Hi Gregg,

2012/4/20 Gregg Kellogg <gregg@greggkellogg.net>:
> On Apr 20, 2012, at 3:52 AM, "Niklas Lindström" <lindstream@gmail.com> wrote:
[...]
>> Definitely. I think Lite is basically covered already. It's the
>> interplay of many attributes in the same element that I haven't got to
>> yet (to e.g. fully cover @about). Right now I'm short on time, but I
>> hope to continue down this path some time next week.
>
> I've made my own fork, and I might try to improve coverage as well, and possibly be able to run through test cases without a distiller.

Sounds great. However, I've done one major refactoring already, and
have added some features. So if you do any work on the current code
right now, it'll be tricky to merge. I hope to have some time tonight
to finalize and push those things. (It's still a very young code base,
so for us to collaborate on it may require a lot of synchronization.
I'd love to try though.)


>> Yes, that's what I do too, for exactly those reasons. The shape of the
>> output is entirely based on the form of the input, i.e. using the same
>> terms and CURIEs (populating @context as needed). One thing I haven't
>> yet done, but plan to, is to merge descriptions about the same
>> resource even if they're dispersed throughout the page.
>
> Note that you can leave such merging to JSON-LD framing, which does this anyway.
>
>> While that
>> does deviate from the actual shape in the source page, it is so much
>> better for consumption, and I think is to be expected. Another thing I
>> don't do is any kind of coercion. Literals with datatype or deviating
>> from any given @language are represented in expanded JSON-LD form.
>> I've yet to decide whether to change that or make it configurable.
>
> This might also be left to JSON-LD API methods. For instance, the "automatic" flag to compaction could generate the best context for you to use, and coerce your data for you. It can be expensive, though, and for any real application, a JSON-LD context matching the data could be provided to compact or frame.

At this point I'd like to stick to a strict and very simple solution,
with one predicable result tree (based on the source RDFa structure,
but merging anything dispersed). I'd like this to be lightweight and
simple, with close to no API. The fact that this solution produces
JSON-LD is a benefit, but it is basically skimmed data, mainly usable
for simple things. I think of it mostly as an RDFa equivalent to the
microdata-to-JSON approach. (And the merging I speak of is roughly
corresponding to how that handles the @itemref stuff.)


>> (You should really try out the bookmarklet [1] in a Firefox (ideally
>> with the JSONView [2] plugin installed). :) I tried it on your own
>> FOAF page for instance, which is rich in data and really interesting
>> to examine this way. (Note that @xmlns:* aren't captured yet though,
>> so the result here isn't really correct.))
>
> Running in-browser, access to xmlns* might be challenging.

Indeed. It's quite far down on my priority list; we'll see how it
fares once we get there.


>> It should be noted that, of course, graph cycles aren't possible to
>> follow directly in a tree. So any time a reference to an already
>> created resource description (i.e. a JSON object @id:d with the
>> resource IRI) is referenced, I just put a link there (an object with
>> just the @id).
>
> Perfect! This is what framing is for, to turn such references into object embeds.
>
>> While I plan to expose the idMap I'll use for the
>> aforementioned dispersed resource merging, trying to solve this in
>> general means veering into the API design again.
>
> If you d this at all, you might just automatically create a frame matching the existing document structure.

As per above, for the time being, I'd like to stay very close to "just
JSON", but of course in the shape of JSON-LD. While certainly
interesting, I'll defer things like extracting the frame itself for
now.

.. I mostly believe that if people need to use the data as a proper
graph, a full RDF API, possibly with context-like features for compact
coding, is the way to go. Operating on semi-fixed data using frames
may be too much for the simple scenarios, and too little for the
advanced ones. Of course, nothing prevents someone from using a
JSON-LD API in conjunction with this; on the contrary. I just don't
want to combine the two at this stage of the game. (Not the least
since my own ideas relating to the RDF API and the JSON-LD API are
sort of an intersection of them both, so it'd lead to a *lot* of
design discussions beyond the scope I've set for this.)


>> Absolutely. Actually, I think I'll manage to set up an extractor
>> service for this eventually. I'm already using Node to run it on the
>> command-line against test files, so it should be straightforward. The
>> remaining thing then is whether the test runner accepts JSON-LD (I
>> actually think it might – Gregg?), or if I should plug this into
>> Antonio Garrote's rdfstore-js [3]. Either way it should be quite
>> doable.
>
> The distiller does accept JSON-LD, but probably needs a small update. You could also use jsonld.js and use the toRDF method to get n-triples out of it in the page.

Nice! Yes, that's a good idea. It seems to be a simpler solution if I
just need the n-triples.


>>> Niklas, this could be very important...
>
> Agreed!

Great! We should definitely continue working on what to make of this
in the long run. It would be great to collaborate fully on it, just as
soon as I've stabilized the current code a bit; and to form a common
scope and goal. (And if for whatever reason our intents diverge, it's
perfectly alright with a friendly fork of course, to explore
differences in approach.)

Best regards,
Niklas


> Gregg
>
>>> Thanks
>>
>> Thanks for the positive feedback!
>>
>> Best regards,
>> Niklas
>>
>> [1]: http://niklasl.github.com/rdfa-lab/
>> [2]: http://jsonview.com/
>> [3]: https://github.com/antoniogarrote/rdfstore-js
>>
>>
>>> Ivan
>>>
>>>
>>> On Apr 20, 2012, at 01:58 , Niklas Lindström wrote:
>>>
>>>> Hi all!
>>>>
>>>> The last couple of days I've been experimenting with a different kind
>>>> of approach to implementing an RDFa extractor. The result so far is a
>>>> draft with admittedly rather partial coverage. However, I hope some
>>>> aspects of it will be of interest even at this stage:
>>>>
>>>> 1. It is implemented in pure Javascript. (Well, actually, in some 170
>>>> lines of CoffeeScript, but the generated result is the same.)
>>>> 2. It runs both in the browser and on Node (used with jsdom).
>>>> 3. It does not produce triples. It directly creates a JSON-LD extract
>>>> (corresponding in shape to the RDFa). This is the difference, and the
>>>> fun part.
>>>>
>>>> Now, it really doesn't handle anything but the most simple RDFa 1.1.
>>>> Possibly all of Lite, plus @datatype, @rel (including hanging),
>>>> @inlist, @rev and perhaps one or two more. It only copes with @about
>>>> if it's alone, it doesn't handle combinations of @rel and @property,
>>>> and so on. I'll strive to make it a lot more compliant given time of
>>>> course.
>>>>
>>>> - You can check out the code at: https://github.com/niklasl/rdfa-lab
>>>> - Or enjoy the bookmarklet (only tested in Firefox), available at:
>>>> http://niklasl.github.com/rdfa-lab/
>>>>
>>>> (Just add the latter to your bookmarks and apply on any page
>>>> containing RDFa. I recommend the JSONView [1] browser add-on for a
>>>> good experience.)
>>>>
>>>> I hope you'll enjoy the little things it can do. (For one, using the
>>>> resulting JSON-LD directly in a JS application should prove
>>>> interesting.)
>>>>
>>>> Best regards,
>>>> Niklas
>>>>
>>>> [1]: http://jsonview.com/
>>>>
>>>
>>>
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>
>>>
>>>
>>>
>>>
>>
Received on Friday, 20 April 2012 15:24:37 UTC