- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Tue, 06 Nov 2007 18:33:21 +0900
- To: www-validator@w3.org
Frank Ellermann wrote: >olivier Thereaux wrote: > >> Now, if today the HTML 4.01 and XHTML 1.0 specs and above were >> updated to say "IRIs" instead of "URIs", what would you do? > >Maybe ditch the W3C and post the reasons in an Internet Draft. >I'd certainly consider it as unethical. I can understand your feelings, but they'd only clarify what they meant when they recommended, in http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1, to convert URIs with non-ASCII characters to UTF-8 and then to use percent-encoding, or what they meant when they used %URI; but declared it as being CDATA in the DTD. >RFC 3987 does not "update" 3986. Very correct. It was never intended as that, it was intended to serve as a stable specification for all those specs (including HTML4) that wanted to use the concept but had to use circumscriptive language. >The spec.s should be updated >with s/2396/3986/g, s/3066/4646/g, and similar clerical tasks, >e.g. explaining why xml:lang is forced to be still an NMTOKEN >wrt these document types. > >But for incompatible modifications we need new document types. The new document type would not at all differ in functionality from the old one. The only changes might be comments and the names of parameter entities, but as with programs, that doesn't change the functionality at all. >Not worldwide "upgrade your browser" campaigns, some users >can't, and besides it's completely unnecessary, all IRIs by >definition have an equivalent URI working with "any browser". Yes, but people who actually can read and understand "испытание" better than "testing", испытание may be very helpful, whereas %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5 would be just garbage for them. >> Saying that IRIs should not be used because they break in >> legacy software, is an argument I have sympathy for, but >> have trouble accepting. > >I'm not suprised if folks active in the W3C don't care much >about "backwards compatibility". But admittedly I was very >suprised when you introduced "let's not care about formally >valid" as new concept. Formally valid means valid according to the DTD, I guess. In this respect, IRIs have always been valid. >A user armed with an old text mode >browser could take out the ICANN IDN test, AFAIK "formally >invalid" is a FAIL in any accesibility test, isn't it ? I'm not sure what you are after here, but if you want to claim that IRIs somehow are anti-accessibility, then I think you should consider the following two points: a) There are temporary accessibility issues and long-term accessibility issues. Temporary accessibility issues are issues of the kind "The current screen readers/audio browsers/... only support foo, so in order to be accessible, use foo, not bar". Once the technology has caught up (and accessibility technology improves in the same way other technology improves), such a requirement may no longer apply. b) A Russian screen reader will definitely do a better job with испытание than with %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5. Other screen readers may have problems, but then, испытание will mostly appear in Russian documents. >> This reminds me of the situation whereby, in Japan, one >> still can't safely use unicode in mails, because so many >> MUAs or webmails just don't support it. > >Maybe they have plausible reasons why they don't need or >don't like it. BTW, the (formally valid) IDN test page >I've created last week was the first XHTML page where I >actually needed UTF-8. Now I'm curious what browsers do >with an IRI in a legacy charset. RFC 3987 allows this. For some tests, please see http://www.sw.it.aoyama.ac.jp/2005/iritest/, in particular the "Legacy Human" section at http://www.sw.it.aoyama.ac.jp/2005/iritest/HTML/index.html. >>> Sooner or later validators will be fixed to validate >>> URIs, what with all those "URI exploits" we've seen in >>> the last weeks for XP after the installation of IE7. > >> This is irrelevant to the discussion about IRIs. Please >> don't use internationalization as a scapegoat for bad >> coding. > >It's relevant for the discussion of bug 4916 submitted >by you 2007-08-07. If that bug is fixed it might also >detect IRIs where only URIs are allowed. In the original mail, Olivier actually wrote: >>>> 1) a parser to check that a given string is a proper URI/IRI This surely already exists, hopefully as open source code, or even better, as a perl module. Does anyone want to investigate this? >>>> That then got shortened in the bug report. But there is a rather fundamental reason why this actually may be a bad idea: URIs/IRIs are supposed to be very flexible. If somebody came along tomorrow with a very great idea for an extension to the URI syntax, and the community agreed with that extension, even if it wouldn't fit the current syntax definition, then this would lead to an update of the URI spec. Let's show you the idea behind the above with a somewhat more concrete example: If you want to create some software that tries to spot potential mistakes in an HTML document, I'd guess you'd surely flag something like <a href='htpp://www.w3.org'... But even actually reading the URI spec in detail, there's nothing there that says it's illegal. Somebody could register the "htpp:" scheme at any time. As another example, consider the following: <img src='http://example.org/top.html'> Again, this clearly looks like a mistake, one wouldn't use a link to a Web page in an src attribute. But you never know when some browsers might actually implement something like a thumbnail view of a web page in such a case (apart from the fact that you also don't know that top.html is a Web page and not some image). Again, <img src='mailto:abc@example.com'> looks like nonsense, but again, it may make sense in the future. The point is that URIs and IRIs are intended to be a very general mechanism to connect resources on the Web, and that any restriction has to be considered very carefully. >Admittedly almost impossible for a validator based on >DTDs, maybe you end up with a clumsy hack working only >for a few very important document types. Yes, trying to restrict the syntax in a field declared as CDATA (for attributes) or PCDATA (for elements) based only on the name of the field, the name of a parameter entity, or a comment found nearby would be difficult. >>> I can still tell you the day when the W3C validator >>> started to flag € as invalid on a windows-1252 >>> page. I was working on this page, it was stunning. > >> There once was a bug, and IIRC it was fixed in a few >> hours. > >Two days after 911, it's good if it only took you a few >hours. But it took me several months to figure out why >I need octet 128 instead of NCR €. Sorry to be a bit direct here, but if it took you several months to figure out why you need octet 128 rather than NCR €, then at least at that point in time, you didn't really know much about the fundamentals of Web internationalization. If you look at the SGML declaration and at the DTD, it's very clear that € is illegal and non-valid. >> Now, how is that relevant to the discussion at hand? > >If everybody and his dog start to use IRIs in document >types where it's not permitted, and some time later an >improved validator informs them that this was invalid, >the disturbed users will be annoyed. Well, this is a circular argument. "Let's annoy users now so that we don't need to annoy them later." doesn't make sense if "Let's not annoy them at all." is the best option anyway. >> The current XHTML DTD says that DTDs are CDATA > >Sure, the details are specified in the prose, the DTD >only uses an entity name %URI; It could also use %FOO; >or %IRI; as name. Likewise the RFC 2396 in the DTD is >only a comment. Exactly. Validation means validation according to the DTD, and parameter entity names and comments don't affect this process. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 6 November 2007 09:34:54 UTC