W3C home > Mailing lists > Public > xml-editor@w3.org > April to June 2014

Re: Request/Question: XML specification - unclear character data definition

From: John Cowan <cowan@mercury.ccil.org>
Date: Fri, 11 Apr 2014 12:34:07 -0400
To: Jan.Petvalsky@tieto.com
Cc: xml-editor@w3.org, tbray@textuality.com, jeanpa@microsoft.com, cmsmcq@w3.org, elm@east.sun.com, cowan@ccil.org
Message-ID: <20140411163407.GJ18618@mercury.ccil.org>
Note: I am speaking for myself only, and not for the XML Core WG, never
mind anyone else.

Jan.Petvalsky@tieto.com scripsit:

> I can see that there are differences in XML versions specification
> regarding to character data:
> http://www.w3.org/TR/xml11/#NT-Char
> http://www.w3.org/TR/REC-xml/#NT-Char
> This unclear definition make that issue that one XML document could
> be valid for one XML processor, but not for others.

Rather, there are two different kinds of XML documents, XML 1.0 and XML
1.1.  An XML processor may accept XML 1.0 only, or XML 1.1 only, or both.
(For that matter, it might accept JSON or any other format as well.)

> It should be fixed that at least from specification definition that
> any UNICODE character is valid.

The character U+0000 was intentionally rejected for XML 1.1 character
content.  This is unlikely to change in future.

> It is possible to rewrite that by “&#” for some processors, but
> this not accepted by others.

It is acceptable in XML 1.1 documents, but not in XML 1.0 documents.

> I hope that you read and not put to bin. I hope that you also mark
> that XML version that is obsolete as obsolete.

XML 1.0 is not obsolete.  XML 1.1 is intended only for specific use
cases that XML 1.0 cannot handle.

> See also: http://stackoverflow.com/questions/9526951/xml-and-unicode-specifications-whats-a-legal-character

There is indeed a problem with Section 2.2, which reads "Legal characters
are tab, carriage return, line feed, and the legal characters of
Unicode and ISO/IEC 10646."  Obviously, TAB, CR, and LF are already
legal characters.  As I just noted on that page, the First Edition of
XML (1998) read "the legal graphic characters of Unicode." For whatever
reason, the word "graphic" was removed from the Second Edition of 2000,
perhaps because it is inaccurate: XML allows many characters that are
not graphic characters.  A correct, if not necessarily clear, revision
would be to add the words "except those in general category Cc".

> PS: Frankly speaking I would like to have XML 2.0 that it will be
> called short-xml, so pair tag will be possible to write in short form
> (e.g. <tag>…</tag> is same as <tag>…</>).

This also is unlikely to happen.  The failure of XML 1.1 has made us
very unwilling to work on any successor format that does not have *major*
advantages over XML 1.0.

John Cowan          http://www.ccil.org/~cowan        cowan@ccil.org
At times of peril or dubitation,
Perform swift circular ambulation,
With loud and high-pitched ululation.
Received on Friday, 11 April 2014 16:34:40 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:12:54 UTC