Re: HTML datatype proposal (ISSUE-63) from Richard Cyganiak on 2012-05-10 (public-rdf-wg@w3.org from May 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 10 May 2012 16:55:07 +0100
To: Steve Harris <steve.harris@garlik.com>
Cc: Ivan Herman <ivan@w3.org>, RDF Working Group WG <public-rdf-wg@w3.org>
Message-Id: <BD48C0FF-11D5-482D-8078-D04A433154AD@cyganiak.de>
Hi Steve,

On 10 May 2012, at 15:23, Steve Harris wrote:
> I'm unsure about what the usecase is for (semi-)canonicalised equality.
> 
> Can someone give me an example?

Some parsers may not have access to the original HTML5 string but only to a DOM, and thus may not be able to reproduce *exactly* the HTML5 string as it was in the input file.

If we define a more forgiving value space, where two literals are considered to have the same value if they encode the same normalized DOM tree, then parser implementers can declare that their parser isn't broken even if it re-orders attributes or replaces single quotes with double quotes. That's all, really.

Best,
Richard



> 
> Without that it's hard to know if we should be trying for full C14N, lexical comparison, or something in between.
> 
> - Steve
> 
> On 10 May 2012, at 01:28, Ivan Herman wrote:
> 
>> Richard,
>> 
>> I think this is the right approach, and I am in favour of doing it. However...
>> 
>> http://www.w3.org/TR/html5/the-end.html#serializing-html-fragments
>> 
>> does not seem to be 100% canonicalization algorithm. What I see right away is that it does not say anything about the order in which attributes should appear in the element (C14N requires them to be in alphabetical order). I have not checked all the details but that is enough to say that the canonical forms would not be enough to make a string comparison for equality. If so, what is the purpose of having it?
>> 
>> Because, at the moment, I do not know of any work happening in HTML5 land in direction of HTML5 signature, I do not see that the concern of exact canonicalization will be on the agenda for the months/years to come. As a consequence, I would propose not to define any canonical form at all in this case, maybe adding a note that when that issue will be solved by the HTML5 community then this datatype might adopt that.
>> 
>> Ivan
>> 
>> 
>> On May 10, 2012, at 02:32 , Richard Cyganiak wrote:
>> 
>>> See below for a proposal for an HTML datatype. The lexical space is all Unicode strings (HTML5 explains how to parse any gunk into a DOM tree). The value space is normalized DOM DocumentFragments, like in the new rdf:XMLLiteral. The L2V mapping is HTML5's “fragment parsing algorithm”. The canonical mapping (from values to canonical lexical forms) is HTML5's “fragment serialization algorithm”.
>>> 
>>> Like the new rdf:XMLLiteral (and all other datatypes), this datatype is entirely optional.
>>> 
>>> Best,
>>> Richard
>>> 
>>> 
>>> == The rdf:HTML Datatype ==
>>> 
>>> RDF provides for HTML content as a possible literal value. This allows markup in literal values. Such content is indicated in an RDF graph using a literal whose datatype is a special built-in datatype rdf:HTML.
>>> 
>>> rdf:HTML is defined as follows.
>>> 
>>> === An IRI denoting this datatype ===
>>> is http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML.
>>> 
>>> === The lexical space ===
>>> is the set of Unicode strings.
>>> 
>>> === The value space ===
>>> is a set of DOM DocumentFragment nodes [DOM4:1]. Two DocumentFragment nodes A and B are considered equal if and only if the DOM method A.isEqualNode(B) [DOM4:2] returns true.
>>> 
>>> === The lexical-to-value mapping ===
>>> is defined as:
>>> 
>>> 1. Let domnodes be the list of DOM nodes [DOM4:3] that result from applying the HTML fragment parsing algorithm [HTML5:1] to the literal's lexical form, without a context element.
>>> 2. Let domfrag be a DOM DocumentFragment [DOM4:1] whose childNodes attribute is equal to domnodes
>>> 3. Return domfrag.normalize() [DOM4:4]
>>> 
>>> === The canonical mapping ===
>>> defines a canonical lexical form [XMLSCHEMA11-2:1] for each member of the value space. The rdf:HTML canonical mapping is the HTML fragment serialization algorithm [HTML5:2].
>>> 
>>> NOTE: Any language annotation desired in the HTML content must be included explicitly in the HTML literal (@lang="…").
>>> 
>>> NOTE: RDF applications may use additional equivalence relations, such as that which relates an xsd:string with an rdf:HTMLLiteral corresponding to a single text node of the same string.
>>> 
>>> == References ==
>>> [DOM4:1] http://www.w3.org/TR/dom/#interface-documentfragment
>>> [DOM4:2] http://www.w3.org/TR/dom/#dom-node-isequalnode
>>> [DOM4:3] http://www.w3.org/TR/dom/#node
>>> [HTML5:1] http://www.w3.org/TR/html5/the-end.html#parsing-html-fragments
>>> [DOM4:4] http://www.w3.org/TR/dom/#dom-node-normalize
>>> [HTML5:2] http://www.w3.org/TR/html5/the-end.html#serializing-html-fragments
>>> [XMLSCHEMA11-2:1] http://www.w3.org/TR/xmlschema11-2/#dt-canonical-mapping
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Steve Harris, CTO
> Garlik, a part of Experian 
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 653331 VAT # 887 1335 93
> Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
> 
>
Received on Thursday, 10 May 2012 15:55:46 UTC