Re: [Action Item] IST 2.0 and html5 from Robin Berjon on 2013-07-10 (public-multilingualweb-lt-comments@w3.org from July 2013)

From: Robin Berjon <robin@w3.org>
Date: Wed, 10 Jul 2013 15:36:41 +0200
To: Daniel Glazman <daniel.glazman@disruptive-innovations.com>
CC: HTML WG <public-html@w3.org>, Richard Ishida <ishida@w3.org>, Philippe Le Hegaret <plh@w3.org>, "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>
Message-ID: <51DD6369.1090707@w3.org>

On 10/07/2013 14:59 , Daniel Glazman wrote:
> The issue we are hitting is related to the parsing of such a document.
> In the html serialization of html5, the DOM will show a text
> node inside the script element, that node containing the textual
> representation of the whole contents of the script element; on another
> hand, the parsing of the xml serialization of the same document will
> generate a script element containing a its-namespaced subtree...
>
> I see this as problematic for two reasons:
>
> 1. I don't think the OM should change depending on the serialization
>     used
> 2. this has an impact on implementations forced to use html-flavor
>     switches for creation/edition/manipulation/serialization of inline
>     ITS rules....

That ship has sailed. There is code relying on the content of scripts 
being text (that requires parsing) in HTML. The only way of aligning the 
two that *might* work would be to require XHTML processors to treat 
markup inside <script> to be kept as text in the DOM. I'm not sure 
anyone wants to go there.

What you're actually looking for is XML data islands. It's something 
that IE supports (supported?) using an <xml> element inside of which it 
switches to XML parsing. I don't believe that there's overwhelming 
interest in supporting that.

> We would like to have your opinion on the above. Do you think the OM
> for both html and xml serialization of a html5 document containing
> inline ITS 2.0 rules should be the same or you don't see it as an
> issue?

I won't dispute that it's unpleasant; but the alternatives are worse.

One possible way of aligning everything would be to have a JSON 
serialisation for ITS. Given the language, it might not be all that hard.

<script type='application/its+json'>
{
   "namespaces": {
     "tei": "http://blah/tei"
   }
, "rules": [
     { "selector": "//tei:term", "translate": "yes" }
   ]
}
</script>

That pretty much just works, and you can define the ITS-JSON spec as a 
simple mapping from JSON to XML. I realise that might not be practical, 
just saying it's actually a viable option.

> If you think it should be the same, do you think encapsulating
> inline 2.0 rules inside a CDATA section is a workable solution or do
> you have another suggestion?

That's one option, but you have to keep in mind that you still won't get 
the same result in both serialisations. For <![CDATA[foo]]> XML parsing 
will give you a node containing "foo", whereas for HTML parsing you'll 
get "<![CDATA[foo]]>". Easy to strip, but still requires special-casing 
(at which point I reckon you're no better off than you are now).

Also keep in mind that CDATA sections don't nest. ITS isn't text-heavy 
so the risks are low, but if someone uses <its:param 
name='whevs'><![CDATA[foo]]></its:param> then it won't embed.

Another option is comments. But they don't nest either.

For interop, I reckon that the best option you have (short of a JSON 
serialisation) would be to keep things as they are today, but to write a 
clear algorithm that processes the content properly in all cases.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Wednesday, 10 July 2013 13:36:54 UTC