- From: James Clark <jjc@jclark.com>
- Date: Tue, 11 Sep 2012 08:37:32 +0700
- To: David Lee <David.Lee@marklogic.com>
- Cc: Uche Ogbuji <uche@ogbuji.net>, "public-microxml@w3.org" <public-microxml@w3.org>
- Message-ID: <CANz3_EY3FR6ZbJr5J3DCyJUqUHCVNmZ3Mpo=fM6vkrmsoQX7Yg@mail.gmail.com>
I am hopeful that non-draconian error handling in MicroXML parsers will ease the pain here. James On Tue, Sep 11, 2012 at 8:27 AM, David Lee <David.Lee@marklogic.com> wrote: > I can buy that … its splitting hairs at this point and consistency with > both Unicode and legacy XML win.**** > > What is at the back of my mind is the painpoint of JSON allowing a greater > range of codepoints which really bites the big one sometimes even if they > are invalid. For example in my tests of sucking twitter feeds, I get about > 1:1000 documents with an invalid XML character (but “valid” in JSON … well > “valid” as in , Twitter feed produces “JSON”and the character got in there > …).**** > > That character itself is bogus usually and not “useful” but what is > painful is typical bulk processing XML tools dieing a flaming death at that > point … But I digress MicroXML must suffer/benefit from the same decision > wrt to Unicode as XML … although to put a camels nose in the tent we might > want to open the issue ‘waffer thin’ to allow processors to toss or > substitute invalid characters instead of drop dead.**** > > ** ** > > ** ** > > ** ** > > > ----------------------------------------------------------------------------- > **** > > David Lee > Lead Engineer > MarkLogic Corporation > dlee@marklogic.com > Phone: +1 650-287-2531 > Cell: +1 812-630-7622 > www.marklogic.com > > This e-mail and any accompanying attachments are confidential. The > information is intended solely for the use of the individual to whom it is > addressed. Any review, disclosure, copying, distribution, or use of this > e-mail communication by others is strictly prohibited. If you are not the > intended recipient, please notify us immediately by returning this message > to the sender and delete all copies. Thank you for your cooperation.**** > > ** ** > > *From:* Uche Ogbuji [mailto:uche@ogbuji.net] > *Sent:* Monday, September 10, 2012 6:22 PM > *To:* public-microxml@w3.org > *Subject:* Re: 12. Are C1 controls and Unicode non-characters disallowed?* > *** > > ** ** > > On Mon, Sep 10, 2012 at 3:24 PM, David Lee <David.Lee@marklogic.com> > wrote:**** > > How does adding to the list of characters a parser can handle simplify the > language ?**** > > ** ** > > But that's not what it's doing. It's actually reducing the valid > characters.**** > > ** ** > > **** > > To my reading that makes the spec, and the language more complex (it has > higher information count because it takes more rules to define what not to > do ... **** > > ** ** > > No actually. The MicroXML production has the same number of rules. The > proposed change is to make the MicroXML rules consistent with those of > Unicode, which is the basis of XML's and MicroXML's. That means defining > fewer exceptions from the Unicode rules in MicroXML, which is simpler. > It's also simpler in practical implementation because it would mean > MicroXML becomes consistent with most Unicode tools to be used in > association with MicroXML, including those used to write parsers.**** > > ** ** > > ** ** > > -- > Uche Ogbuji http://uche.ogbuji.net > Founding Partner, Zepheira http://zepheira.com > http://wearekin.org > http://www.thenervousbreakdown.com/author/uogbuji/ > http://copia.ogbuji.net > http://www.linkedin.com/in/ucheogbuji > http://twitter.com/uogbuji**** >
Received on Tuesday, 11 September 2012 01:38:20 UTC