RE: 12. Are C1 controls Unicode non-characters disallowed? from David Lee on 2012-09-11 (public-microxml@w3.org from September 2012)

From: David Lee <David.Lee@marklogic.com>
Date: Mon, 10 Sep 2012 18:27:02 -0700
To: Uche Ogbuji <uche@ogbuji.net>, "public-microxml@w3.org" <public-microxml@w3.org>
Message-ID: <EB42045A1F00224E93B82E949EC6675E16B09B28E8@EXCHG-BE.marklogic.com>

I can buy that ... its splitting hairs at this point and consistency with both Unicode and legacy XML win.
What is at the back of my mind is the painpoint of JSON allowing a greater range of codepoints which really bites the big one sometimes even if they are invalid.  For example in my tests of sucking twitter feeds, I get about 1:1000 documents with an invalid XML character (but "valid" in JSON ... well "valid" as in , Twitter feed produces "JSON"and the character got in there ...).
That character itself is bogus usually and not "useful" but what is painful is typical bulk processing XML tools dieing a flaming death at that point ...   But I digress MicroXML must suffer/benefit from the same decision wrt to Unicode as XML ... although to put a camels nose in the tent we might want to open the issue 'waffer thin' to allow processors to toss or substitute invalid characters instead of drop dead.



-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
dlee@marklogic.com
Phone: +1 650-287-2531
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.

From: Uche Ogbuji [mailto:uche@ogbuji.net]
Sent: Monday, September 10, 2012 6:22 PM
To: public-microxml@w3.org
Subject: Re: 12. Are C1 controls and Unicode non-characters disallowed?

On Mon, Sep 10, 2012 at 3:24 PM, David Lee <David.Lee@marklogic.com<mailto:David.Lee@marklogic.com>> wrote:
How does adding to the list of characters a parser can handle simplify the language ?

But that's not what it's doing.  It's actually reducing the valid characters.


To my reading that makes the spec, and the language more complex (it has higher information count because it takes more rules to define what not to do ...

No actually.  The MicroXML production has the same number of rules.  The proposed change is to make the MicroXML rules consistent with those of Unicode, which is the basis of XML's and MicroXML's.  That means defining fewer exceptions from the Unicode rules in MicroXML, which is simpler.  It's also simpler in practical implementation because it would mean MicroXML becomes consistent with most Unicode tools to be used in association with MicroXML, including those used to write parsers.


--
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Received on Tuesday, 11 September 2012 01:27:27 UTC