Re: 12. Are C1 controls Unicode non-characters disallowed? from Michael Sokolov on 2012-09-11 (public-microxml@w3.org from September 2012)

From: Michael Sokolov <sokolov@falutin.net>
Date: Mon, 10 Sep 2012 21:39:50 -0400
To: David Lee <David.Lee@marklogic.com>
Cc: Uche Ogbuji <uche@ogbuji.net>, "public-microxml@w3.org" <public-microxml@w3.org>
Message-id: <504E9666.6040101@falutin.net>

On 9/10/2012 9:27 PM, David Lee wrote:
>
> I can buy that ... its splitting hairs at this point and consistency 
> with both Unicode and legacy XML win.
>
> What is at the back of my mind is the painpoint of JSON allowing a 
> greater range of codepoints which really bites the big one sometimes 
> even if they are invalid.  For example in my tests of sucking twitter 
> feeds, I get about 1:1000 documents with an invalid XML character (but 
> "valid" in JSON ... well "valid" as in , Twitter feed produces 
> "JSON"and the character got in there ...).
>
> That character itself is bogus usually and not "useful" but what is 
> painful is typical bulk processing XML tools dieing a flaming death at 
> that point ...   But I digress MicroXML must suffer/benefit from the 
> same decision wrt to Unicode as XML ... although to put a camels nose 
> in the tent we might want to open the issue 'waffer thin' to allow 
> processors to toss or substitute invalid characters instead of drop dead.
>
>
Yes - some kind of recovery process would be a boon; +1 for allowing 
parsers to replace these disallowed codepoints with the special Unicode 
character reserved to mean "unknown or unrepresentable character": FFFD./

/-Mike

Received on Tuesday, 11 September 2012 01:40:41 UTC