Re: HTML datatype proposal (ISSUE-63)

Hi Nathan,

On 10 May 2012, at 17:28, Nathan wrote:
> Also perhaps worth noting that attribute values (not just attributes) may need canonicalised, if you consider trailing semicolons and white space in style attribute values, or the order of white space separated tokens in class and rel attributes.

But with @style, @class and @rel, I can still get the exact string value as present in the source HTML using Element.getAttribute()?

That should be sufficient for the use cases that we have considered. They don't require re-serialization to a canonical string.

Best,
Richard


> Thus it would appear to me that C14N may be impossible, or at least impractical given that every possible attribute value would need canonicalization rules too.
> 
> Best,
> 
> Nathan
> 
>> If we define a more forgiving value space, where two literals are considered to have the same value if they encode the same normalized DOM tree, then parser implementers can declare that their parser isn't broken even if it re-orders attributes or replaces single quotes with double quotes. That's all, really.
>> Best,
>> Richard
>>> Without that it's hard to know if we should be trying for full C14N, lexical comparison, or something in between.
>>> 
>>> - Steve
>>> 
>>> On 10 May 2012, at 01:28, Ivan Herman wrote:
>>> 
>>>> Richard,
>>>> 
>>>> I think this is the right approach, and I am in favour of doing it. However...
>>>> 
>>>> http://www.w3.org/TR/html5/the-end.html#serializing-html-fragments
>>>> 
>>>> does not seem to be 100% canonicalization algorithm. What I see right away is that it does not say anything about the order in which attributes should appear in the element (C14N requires them to be in alphabetical order). I have not checked all the details but that is enough to say that the canonical forms would not be enough to make a string comparison for equality. If so, what is the purpose of having it?
>>>> 
>>>> Because, at the moment, I do not know of any work happening in HTML5 land in direction of HTML5 signature, I do not see that the concern of exact canonicalization will be on the agenda for the months/years to come. As a consequence, I would propose not to define any canonical form at all in this case, maybe adding a note that when that issue will be solved by the HTML5 community then this datatype might adopt that.
>>>> 
>>>> Ivan
>>>> 
>>>> 
>>>> On May 10, 2012, at 02:32 , Richard Cyganiak wrote:
>>>> 
>>>>> See below for a proposal for an HTML datatype. The lexical space is all Unicode strings (HTML5 explains how to parse any gunk into a DOM tree). The value space is normalized DOM DocumentFragments, like in the new rdf:XMLLiteral. The L2V mapping is HTML5's “fragment parsing algorithm”. The canonical mapping (from values to canonical lexical forms) is HTML5's “fragment serialization algorithm”.
>>>>> 
>>>>> Like the new rdf:XMLLiteral (and all other datatypes), this datatype is entirely optional.
>>>>> 
>>>>> Best,
>>>>> Richard
>>>>> 
>>>>> 
>>>>> == The rdf:HTML Datatype ==
>>>>> 
>>>>> RDF provides for HTML content as a possible literal value. This allows markup in literal values. Such content is indicated in an RDF graph using a literal whose datatype is a special built-in datatype rdf:HTML.
>>>>> 
>>>>> rdf:HTML is defined as follows.
>>>>> 
>>>>> === An IRI denoting this datatype ===
>>>>> is http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML.
>>>>> 
>>>>> === The lexical space ===
>>>>> is the set of Unicode strings.
>>>>> 
>>>>> === The value space ===
>>>>> is a set of DOM DocumentFragment nodes [DOM4:1]. Two DocumentFragment nodes A and B are considered equal if and only if the DOM method A.isEqualNode(B) [DOM4:2] returns true.
>>>>> 
>>>>> === The lexical-to-value mapping ===
>>>>> is defined as:
>>>>> 
>>>>> 1. Let domnodes be the list of DOM nodes [DOM4:3] that result from applying the HTML fragment parsing algorithm [HTML5:1] to the literal's lexical form, without a context element.
>>>>> 2. Let domfrag be a DOM DocumentFragment [DOM4:1] whose childNodes attribute is equal to domnodes
>>>>> 3. Return domfrag.normalize() [DOM4:4]
>>>>> 
>>>>> === The canonical mapping ===
>>>>> defines a canonical lexical form [XMLSCHEMA11-2:1] for each member of the value space. The rdf:HTML canonical mapping is the HTML fragment serialization algorithm [HTML5:2].
>>>>> 
>>>>> NOTE: Any language annotation desired in the HTML content must be included explicitly in the HTML literal (@lang="…").
>>>>> 
>>>>> NOTE: RDF applications may use additional equivalence relations, such as that which relates an xsd:string with an rdf:HTMLLiteral corresponding to a single text node of the same string.
>>>>> 
>>>>> == References ==
>>>>> [DOM4:1] http://www.w3.org/TR/dom/#interface-documentfragment
>>>>> [DOM4:2] http://www.w3.org/TR/dom/#dom-node-isequalnode
>>>>> [DOM4:3] http://www.w3.org/TR/dom/#node
>>>>> [HTML5:1] http://www.w3.org/TR/html5/the-end.html#parsing-html-fragments
>>>>> [DOM4:4] http://www.w3.org/TR/dom/#dom-node-normalize
>>>>> [HTML5:2] http://www.w3.org/TR/html5/the-end.html#serializing-html-fragments
>>>>> [XMLSCHEMA11-2:1] http://www.w3.org/TR/xmlschema11-2/#dt-canonical-mapping
>>>> 
>>>> ----
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> -- 
>>> Steve Harris, CTO
>>> Garlik, a part of Experian 1-3 Halford Road, Richmond, TW10 6AW, UK
>>> +44 20 8439 8203  http://www.garlik.com/
>>> Registered in England and Wales 653331 VAT # 887 1335 93
>>> Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
>>> 
>>> 
> 
> 

Received on Thursday, 10 May 2012 18:18:12 UTC