Re: RDFa DOM API feedback

On 07/04/2011 09:41 AM, Philip Jägenstedt wrote:
>> We can chat more about this during our conversation tomorrow.
> 
> I'd prefer to discuss this in public email threads, so that anyone
> that has something to add can do so.

Ah, that's not what I meant - just merely was going to tell you where
the latest resources are and ask you about how you plan to review and
give feedback on the documents. Yes, the discussion should happen on the
mailing lists. :)

More below...

>> == HTML ==
>>
>> The spec only references the 2008 RDFa in XHTML REC, not
>> <http://dev.w3.org/html5/rdfa/>. Is this an oversight?
>>
>> Note that just deferring to the two RDFa specs implies different
>> processing requirements depending on XHTML and HTML. Different
>> behavior of API's depending on XHTML/HTML is not going to be well
>> received by browser implementors, as it creates all kinds of problems.
>> (Consider, for example, what happens when a DOM is created entirely by
>> script or when a subtree of a text/html document is moved by script to
>> a application/xhtml+xml document. It just doesn't make sense to switch
>> between different modes here.)
>>
>> I'm going to assume in the following that the intention is for there
>> to be a single API spec covering both serializations.
> 
> AFAICT, the issue remains.

Yes, the intent is to have a single API spec that covers as many
serializations as possible in a generic way. In general, the RDFa
processing rules are written in such a way as to be as syntax agnostic
as possible. That is, it is a set of instructions that operate on a
document tree, however, that document tree does not necessarily need to
be a DOM. You can use a SAX-based parser to implement an RDFa processor,
and you could theoretically use a SAX-based parser to implement the RDFa
API.

>> == getTriplesByType order ==
>>
>> Which order should triples be returned in? This must be well-defined
>> in order for the API to be interoperably implementable.
> 
> The same issue now applies to Graph.toArray, although it now says
> "Note: the order of the Triples within the returned sequence is
> arbitrary, since a Graph is an unordered set."
> 
> As an anecdote, the ECMAScript spec has always said that when
> enumerating properties of objects, the order is undefined. (Properties
> are conceptually an unordered set.) In practice, implementations do
> use a particular order (insertion order, more or less) and this has
> required reverse-engineering between browsers because scripts rely on
> that order. The same thing will happen with Graph.toArray if it
> becomes widely deployed.

That's good feedback. Graph ordering is not a simple problem. We /could/
do something like insertion order because that is a deterministic part
of the RDFa processing algorithm.

We do have a general mechanism for ordering triples in a graph, but that
requires graph normalization. There is currently no known graph
normalization algorithm that will run in polynomial time for degenerate
cases. The difficult part is the implementation of the blank node
labeling problem. We have it working for all real-world use cases... but
there are some theoretical examples that cannot be solved in polynomial
time. For example: rings of 1000 blank nodes that all look the same all
connected to the next node. This is called the graph isomorphism problem
and while getting an ordering can be done, the solution cannot always be
done in a reasonable time frame.

We're dealing with this problem with the JSON-LD work. However,
normalization is not something that you want in a frequently used code path.

Why can't you tell script developers that they can't depend on order?
Have you tried going the other way? Shuffling the array on output to
ensure that they don't get the same order twice? I realize that this
isn't ideal, but neither is having to ensure that some order is kept.

>> == Merging of triples ==
>>
>> If the same triple is expressed several times in the document, are
>> they merged, or will two instances of the same triple returned by
>> getTriplesByType?
> 
> Issue still applies to DataParser.parse

DataParser.parse() extracts the information from the document and places
the information into a Graph. The Graph can only have one instance of
the same triple (storing duplicates is a logical nop). We say this in
the latest spec under section 2.2.2: Graphs:

"Graphs must not contain duplicate triples."

DataParser.process() would send duplicate triples through to the
callback, but we believe that is the proper behavior. That interface is
more raw and low-level. If somebody wanted to detect duplicate triples,
for example: for linting purposes, they could use .process() to do so.

>> == Triple.language ==
>>
>> Is the language normalized, and how? If the language is given with
>> lang="sv_FI" in markup, what does .language return? If there is any
>> merging of triples, does that happen before or after language
>> normalization?
> 
> Triple.language is no more, it seems.

Literals do have a language attribute:

http://www.w3.org/2010/02/rdfa/sources/rdf-interfaces/#literals

I don't quite understand what you mean by "language normalization". That
is, if somebody specifies lang="sv_FI", and a triple is generated like this:

<http://blog.foolip.org/about/philip>
   foaf:name
      "Philip Jägenstedt"@sv_FI .

Then the .language attribute would contain "sv_FI".

>> == Dynamic changes ==
>>
>> getTriplesByType returns a static array of Triples. In contrast, a lot
>> of HTML APIs return a live NodeList, so that changes to the document
>> are reflected in that NodeList. This is the case with e.g.
>> getElementsByTagName and the Microdata getItems API. If returning a
>> static array is intentional, can you make that more explicit by saying
>> that it is the triples that are in the document at the time of
>> invocation that are returned?
> 
> It appears that DataParser.parse is used to generate a Graph from a
> Document, which would "solve" the problem of dynamic changes. I don't
> think that it is acceptable solution because it is extremely expensive
> to have to re-parse the entire document after any change, but it does
> answer my original question.

Doesn't Microdata have the same issue? If the DOM changes in a way that
an item is added or deleted, the NodeList returned by getItems() becomes
outdated. You must re-process the entire DOM in order to ensure that
you're working with the most up-to-date information. What am I missing?

>> RDFa Profiles
>>
>> 3. If a profile actually becomes widely used, aren't you worried about
>> the DDoS that will result? Compare to the problems of DTD outlined in
>> <http://lists.w3.org/Archives/Public/public-html/2008Jul/0269.html>.

There was quite a bit of discussion on this issue in the RDFa WG. Yes,
we were concerned about DDoS on RDFa Profiles. After discussing it for a
while, though, a few things became clear.

There is no "one authoritative source" for an RDFa Profile - they're
just like CSS, JavaScript files, images, video, and CSS. They are not
like a DTD. Web authors are free to copy a profile to their servers and
serve it from there. Profile authors are free to publish their profile
on a CDN, much like jquery is distributed by Google.

We even allow the default RDFa Profiles to be hard coded by implementers
in their implementations. In the RDFa Core spec, Section 9: RDFa
Profiles states:

"RDFa Processor developers are permitted and encouraged to cache the
relevant triples retrieved via this mechanism, including embedding
definitions for well known vocabularies in the implementation if
appropriate."

Lastly, if Web vocabulary developers are concerned about a DDoS attack,
they probably shouldn't create a profile for their vocabulary or
application.

>> 4. Should browsers have Turtle and RDF/XML parsers to handle the case
>> where the profile is using those syntaxes? MAY is a keyword for
>> interoperability disaster, at least in the context of web browsers...

Browsers don't need TURTLE and RDF/XML parsers to handle profiles in
those syntaxes. The only requirement is an XHTML+RDFa processor to
process RDFa Profiles. The spec states:

"RDFa Profiles are optional external documents that define collections
of terms and/or prefix mappings. These documents must be defined in an
approved RDFa Host Language (currently XHTML+RDFa [XHTML-RDFA]). They
may also be defined in other RDF serializations as well (e.g., RDF/XML
[RDF-SYNTAX-GRAMMAR] or Turtle [TURTLE])."

The second sentence is the important one here - RDFa Profiles MUST be
provided in XHTML+RDFa at a minimum. Web developers may optionally
provide them in different serializations (such as TURTLE, RDF/XML or
even JSON-LD).

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: PaySwarm Developer Tools and Demo Released
http://digitalbazaar.com/2011/05/05/payswarm-sandbox/

Received on Sunday, 10 July 2011 20:22:14 UTC