- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sat, 11 Nov 2006 15:35:56 +0200
On Nov 11, 2006, at 01:13, Fran?ois Yergeau wrote: > Henri Sivonen a ?crit : >> Does C003 in Charmod outlaw bdo? > > Nope. bdo is simply an assertion by the author that the > presentation order is not the usual one for the script. The text > is still stored, interchanged and processed in logical order. OK. >> I think C073 shouldn't render a document non-conforming. > > Disagree. C073 is a SHOULD NOT and it should carry over to HTML > conformance stricto sensu (i.e. as per RFC 2119). I agree that, in general, PUA characters aren't suitable for public interchange. However, I don't think it is necessarily a good idea to make a conformance checker proclaim documents that contain them non- conforming. I do think that a warning is called for. See also C040. There are cases when PUA characters are the best available way to communicate something: http://www.evertype.com/standards/csur/ I have tried hard to avoid marketing the would-be conformance checking service the same way fanboys market the W3C Validator. I intend to conformance checking service to be a tool that helps authors--not a graven image that needs to be satisfied at all cost. Regardless, I need to consider what kind of behavior the conformance checking service could induce among those who don't see the big picture but want their documents to have zero errors reported. If the use of PUA characters were errors, the people who want zero errors from a conformance checker at all cost could move from violating C073 to violating C076, which would be much worse but not detectable by a conformance checker. (I'm not suggesting that Everson & Cowan would do this, but, you know, others. :-) >> Would it be too annaying to emit a warning? Perhaps one warning >> per document rather than per character? > > No more than one per doc, please! OK. >> I think authors wouldn't like warnings on C047 and C048. > > Perhaps, perhaps not. Some authors want their apps to keep them as > close to spec as possible. Authoring tools should certainly abide > by C047 and C048 when generating escapes on behalf of the author. > >> Moreover, I think it should be concluded that Charmod SHOULD >> violation don't make an (X)HTML5 document non-conforming. Correct? > > Totally incorrect, IMHO. RFC2119 SHOULD's are real conformance > requirements that a spec admits can be disobeyed in some cases, > given good enough reasons. Absent such good reasons, they are > requirements, period. C047 does not have a hard machine-checkable definition. It does not cite testing particular Unicode character properties, for example. Moreover, numeric escapes of characters of any kind are expanded by the parser and are, therefore, totally harmless in the parsed document tree, because you can't even detect them there. C048 as far as text/html goes is even bad advice in terms of really backward backwards compatibility. In the case of XML, both decimal and hexadecimal have been supported from day 1. However, both decimal and hexadecimal are equally right as far as the XML 1.0 spec is concerned and neither causes any technical trouble over the other in conforming XML processors. Making the Charmod SHOULD an error would mean proclaiming documents non-conforming over an issue that causes absolutely no technical trouble in processing with conforming parsers but is about the view source convenience preference of Charmod authors! (Besides, there are lookup interfaces that support decimal: http://www.eki.ee/letter/ ) I think it would be unwise to make an (X)HTML5 conformance checking service cry wolf on C047 and C048. It would only undermine the usefulness of a conformance checking service for authors and would dilute the perceived seriousness of errors. But let's look at all the [C] SHOULDs (quoting from Charmod): > C022 [S] [I] [C] Character encodings that are not in the IANA > registry SHOULD NOT be used, except by private agreement. I guess I could make that an error. > C049 [I] [C] The character encoding of content SHOULD be chosen so > that it maximizes the opportunity to directly represent characters > (ie. minimizes the need to represent characters by markup means > such as character escapes) while avoiding obscure encodings that > are unlikely to be understood by recipients. First, Charmod doesn't define a conclusive list on non-obscure encodings. The XML side warns if the encoding is not US-ASCII, ISO-8859-1, UTF-8 or UTF-16. (The XML only requires UTF-8 and UTF-16 to be supported, so it follows that anything else is optional and, therefore, unsafe. However, I don't warn on US-ASCII or ISO-8859-1, because I don't want to cry wolf and I've never seen evidence of XML parsers that didn't also support US-ASCII and ISO-8859-1. I do have evidence of a popular parser that only supports those four by default: expat. And there's a lot of ASCII-only XML out there that is declared ISO-8859-1, which is harmless in practice.) As much as I'd like to be able to force everyone to use UTF-8, I am uncomfortable about making the use of an optionally-supported encoding an error, since the XML 1.0 spec intentionally leaves encoding support open-ended. Of course, I could deviously disable a host of decoders and claim implementation limitations. :-) On the text/html side, it wouldn't be useful, considering the practical backwards-compatibility goals of the WHAT WG, to complain about encodings that "everyone" supports. A passable practical definition could be the intersection of the IANA-registered encodings supported by IE6, Opera 9, Firefox 2.0, Safari 2.0.x, Sun JDK 1.4.2 and Python 2.4. (Make that Python 2.3 if you want to take a point against the CJK encoding soup.) Also, when an encoding is de facto supported, it is rather useless, in my opinion, to analyze if it is optimal in terms of byte count and to proclaim the document non-conforming if it isn't. > C024 [I] [C] Content and software that label text data MUST use > one of the names required by the appropriate specification (e.g. > the XML specification when editing XML text) and SHOULD use the > MIME preferred name of a character encoding to label data in that > character encoding. I already warn if the preferred name isn't used, but I guess I could make it an error. > C073 [C] Publicly interchanged content SHOULD NOT use codepoints > in the private use area. > C047 [I] [C] Escapes SHOULD only be used when the characters to be > expressed are not directly representable in the format or the > character encoding of the document, or when the visual > representation of the character is unclear. > C048 [I] [C] Content SHOULD use the hexadecimal form of character > escapes rather than the decimal form when there are both. Already discussed above. > C054 [I] [C] Users of specifications (software developers, content > developers) SHOULD whenever possible prefer ways other than string > indexing to identify substrings or point within a string. Not machine-checkable. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Saturday, 11 November 2006 05:35:56 UTC