Re: 12. Are C1 controls Unicode non-characters disallowed? from James Clark on 2012-09-11 (public-microxml@w3.org from September 2012)

From: James Clark <jjc@jclark.com>
Date: Tue, 11 Sep 2012 08:37:32 +0700
To: David Lee <David.Lee@marklogic.com>
Cc: Uche Ogbuji <uche@ogbuji.net>, "public-microxml@w3.org" <public-microxml@w3.org>
Message-ID: <CANz3_EY3FR6ZbJr5J3DCyJUqUHCVNmZ3Mpo=fM6vkrmsoQX7Yg@mail.gmail.com>

I am hopeful that non-draconian error handling in MicroXML parsers will
ease the pain here.

James

On Tue, Sep 11, 2012 at 8:27 AM, David Lee <David.Lee@marklogic.com> wrote:

> I can buy that … its splitting hairs at this point and consistency with
> both Unicode and legacy XML win.****
>
> What is at the back of my mind is the painpoint of JSON allowing a greater
> range of codepoints which really bites the big one sometimes even if they
> are invalid.  For example in my tests of sucking twitter feeds, I get about
> 1:1000 documents with an invalid XML character (but “valid” in JSON … well
> “valid” as in , Twitter feed produces “JSON”and the character got in there
> …).****
>
> That character itself is bogus usually and not “useful” but what is
> painful is typical bulk processing XML tools dieing a flaming death at that
> point …   But I digress MicroXML must suffer/benefit from the same decision
> wrt to Unicode as XML … although to put a camels nose in the tent we might
> want to open the issue ‘waffer thin’ to allow processors to toss or
> substitute invalid characters instead of drop dead.****
>
> ** **
>
> ** **
>
> ** **
>
>
> -----------------------------------------------------------------------------
> ****
>
> David Lee
> Lead Engineer
> MarkLogic Corporation
> dlee@marklogic.com
> Phone: +1 650-287-2531
> Cell:  +1 812-630-7622
> www.marklogic.com
>
> This e-mail and any accompanying attachments are confidential. The
> information is intended solely for the use of the individual to whom it is
> addressed. Any review, disclosure, copying, distribution, or use of this
> e-mail communication by others is strictly prohibited. If you are not the
> intended recipient, please notify us immediately by returning this message
> to the sender and delete all copies. Thank you for your cooperation.****
>
> ** **
>
> *From:* Uche Ogbuji [mailto:uche@ogbuji.net]
> *Sent:* Monday, September 10, 2012 6:22 PM
> *To:* public-microxml@w3.org
> *Subject:* Re: 12. Are C1 controls and Unicode non-characters disallowed?*
> ***
>
> ** **
>
> On Mon, Sep 10, 2012 at 3:24 PM, David Lee <David.Lee@marklogic.com>
> wrote:****
>
> How does adding to the list of characters a parser can handle simplify the
> language ?****
>
> ** **
>
> But that's not what it's doing.  It's actually reducing the valid
> characters.****
>
> ** **
>
>  ****
>
> To my reading that makes the spec, and the language more complex (it has
> higher information count because it takes more rules to define what not to
> do ... ****
>
> ** **
>
> No actually.  The MicroXML production has the same number of rules.  The
> proposed change is to make the MicroXML rules consistent with those of
> Unicode, which is the basis of XML's and MicroXML's.  That means defining
> fewer exceptions from the Unicode rules in MicroXML, which is simpler.
>  It's also simpler in practical implementation because it would mean
> MicroXML becomes consistent with most Unicode tools to be used in
> association with MicroXML, including those used to write parsers.****
>
> ** **
>
> ** **
>
> --
> Uche Ogbuji                       http://uche.ogbuji.net
> Founding Partner, Zepheira        http://zepheira.com
> http://wearekin.org
> http://www.thenervousbreakdown.com/author/uogbuji/
> http://copia.ogbuji.net
> http://www.linkedin.com/in/ucheogbuji
> http://twitter.com/uogbuji****
>

Received on Tuesday, 11 September 2012 01:38:20 UTC