Re: 12. Are C1 controls and Unicode non-characters disallowed?

On 09/10/2012 12:11 PM, Uche Ogbuji wrote:
> On Mon, Sep 10, 2012 at 10:02 AM, Stephen D Green 
> <stephengreenubl@gmail.com <mailto:stephengreenubl@gmail.com>> wrote:
>
>
>     On 10 September 2012 14:26, Uche Ogbuji <uche@ogbuji.net
>     <mailto:uche@ogbuji.net>> wrote:
>
>         ...
>         Even if it were an added burden on Web devs I have no idea how
>         it breaks the whole point of having MicroXML.  Certainly it
>         would contradict nothing in the stated goals.
>
>         ...
>
>     If this is a key factor putting off developers (or if they find
>     they are
>     using tools which don't do it all for them and that gives them hell)
>     then if MicroXML doesn't solve that, how does it hope to get take
>     up from such developers? Maybe though, these aren't the type
>     of developers being targeted by MicroXML, in which case I stand
>     corrected.
>
>
> Writing language parsers is a skill.  Sadly not all developers gain 
> that skill.  If you do not have that skill you probably shouldn't even 
> be trying to write a JSON parser.  Heck, you probably shouldn't even 
> be trying to write a config file parser ;)
>
> Luckily there are plenty of people for every language and platform who 
> can write parsers, and most other developers just do the simple, sane 
> thing: they reuse libraries created by the experts. It would be crazy 
> to make it a goal of MicroXML that every odd developer could write his 
> own parser.
>
> As to the matter at hand, anyone with those basic parser development 
> skills will find dealing with the proposed character production pretty 
> much trivial.
>
Also, ahem if the hackish parser developer doesn't enforce these 
character restrictions, what, really is the damage? It's hard to know 
for sure, but it's quite possible said developer will be able to go 
merrily on using his own "MicroXML lite" and never be the wiser.

When using a compliant parser, on the other hand, I think what's likely 
the most commonly-noticed effect of this kind of character restriction 
is that developers get alerted to content whose character set is 
mislabeled.  Windows-1252 content that gets interpreted as UTF-8 will 
often end up with an illegal character.  Irritating as it is, I think 
this can be seen as a real benefit to developers since they get an early 
fail.

In any case, some character restrictions are imposed transitively by the 
XML restrictions given the goal to make uXML documents be XML 
documents.  Given that the character set is restricted anyway, the set 
of characters being restricted is the only topic under discussion, and I 
think the idea here is to provide some sort of coherent basis for what 
the set should be by appealing to another widely-accepted standard which 
defines the set of valid characters: ie Unicode.  Which makes sense to 
me, anyway.

-Mike

Received on Monday, 10 September 2012 16:26:38 UTC