- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 18 Mar 2004 16:47:55 -0500
- To: public-webarch-comments@w3.org
- Cc: w3c-i18n-ig@w3.org
Dear TAG, Below receive the first part (apart from our comment on dependencies between the WebArch document, the revision of the URI spec, and the IRI draft) of the I18N WG (Core TF) comments on the WebArch document. Please send replies to these comments to the I18N IG mailing list (w3c-i18n-ig@w3.org), not only to me. [1] The term 'language' is used both for natural language (in the Abstract and for (document/data) format. In the later case, 'format' should be used everywhere, to avoid confusion. This would be the same as in Charmod. [2] 'Oaxaca' is used in many examples. Glad to see a non-US example, but we are afraid that this may lead to questions on how to pronounce it in large parts of the world. We suggest to replace it with something simpler, one idea might be 'Lima' (although the weather is not as good there as in Oaxaca :-( ). [3] section 1, figure: please show charset in Content-type. [4] 1.2.1, first bullet: This and http://www.w3.org/2001/tag/doc/whenToUseGet.html#i18n are basically okay. However the word 'limitations' (related to i18n) may give the wrong impression; it is not clear what the i18n concerns are. We suggest that you describe the issue more clearly, e.g. as "The design works reasonably well, although there are issues related to the transmission of non-ASCII characters." (please note the use of the word 'issues' rather than 'limitations'; although there are indeed some limitations as to the combinations of encodings in form pages and in requests, due to well-established practices based on HTML 4, there is no fundamental limitation to the basic use of non-ASCII characters. Also, please make sure the reader can directly go to the relevant section in the finding. Also, you may want to point to the FAQ on "What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?" http://www.w3.org/International/questions/qa-forms-utf-8.html [5] 1.2.1, third bullet: "Some authors use the META/http-equiv approach to declare the character encoding scheme of an HTML document. By design, this is a hint that an HTTP server should emit a corresponding "Content-Type" header field. In practice, the use of the hint in servers is not widely deployed. Furthermore, many user agents use this information to override the "Content-Type" header sent by the server. This works against the principle of authoritative representation metadata." This is rather misleading on several points: - "By design, this is a hint that an HTTP server should emit a corresponding "Content-Type" header field.": It's correct that by design, this WAS a hint to the sever. Practice has shown that this wasn't such a good idea, and practice has found a better use for it. The WebArch doc should mainly look forward, mentioning misguided/reused designs can help, but it should not be presented as if it would be better to go back to that design. So e.g. reword to "this was (originally) intended..." - "In practice, the use of the hint in servers is not widely deployed.": Is it actually deployed at all? Any pointers would be appreciated. - "Furthermore, many user agents use this information to override the "Content-Type" header sent by the server.": Tests we have done recently with reasonably new browsers have shown that none of the major browsers do this. In particular if 'many user agents' is an acronym for IE6win, this is actually wrong (in case you do you own tests, please clean your cache if you change encodings or encoding labels for files; encodings are very 'sticky' in IE6, but for fresh pages, it gets things right). We suggest rewording e.g. to "user agents use this information if there is no 'charset' information in the "Content-Type" header" and/or "some have in the past..." [6] 1.2.3 Error Handling: It might be very good to point to say something about character encoding/labeling errors. [7] End of 2. (just before 2.1): "Of course, what an agent does with a URI may vary." It would be better to mention more explicitly that this e.g. can include language negotiation,... [8] Section 2.1 (URI comparison). 2nd para. The first sentence establishes that character-by-character inequality doesn't mean that the resource referred is different. But the subsequent sentences say basically the opposite (that this is the most straightforward way to find resource equality). Break into two paragraphs, or otherwise improve wording to less confuse the reader. [9] S2.1. 3rd para. The casing example for weather.example.com/Oaxaca is a bit obscure. Perhaps spell out the fact that case sensitivity matters to some systems? [10] 2.1: " For instance, one might reasonably create URIs that begin with "http://www.example.com/tempo" and "http://www.example.com/tiempo" to provide access to resources by users who speak Italian and Spanish.": It is nice to see an i18n-related example. However, there are all kinds of issues with this. This is not necessarily a good way to organize information in different languages on a server, in particular if the information is highly parallel. It may be better to find another example, for example with two English words. Also, 'tempo' is an English word with a different meaning. Perhaps German "Wetter" is better? [11] Section 2.1. 4th para. "Likewise, URI consumers should ensure URI consistency. For instance, when transcribing a URI, agents should not gratuitously escape characters. The term "character" refers to URI characters as defined in section 2 of [URI]". The definition of 'character' in the first sentence is not clarified by section 2 of the URI draft, which deals with details such as percent escaping of characters. Section 1 of the URI draft *points to* a definition of 'character'. This is an area where the presence of IRI would be welcome. It might be more useful to describe what "gratuitious" means in this context (there is currently no definition; we *think* it means "don't escape characters unless it breaks usability", i.e. I would expect to see %20 instead of space (because space breaks the URI semantically). [12] 'The term "character"...': please say instead: 'The term "character" in the foregoing sentence...' (character takes on other meanings later...) [13] S2.3, para 2. Shouldn't there be a "but..." at the end of this paragraph? Yes, URI ambiguity is not the same thing as natural language ambiguity... but what is it? Please make the example more direct: "URI ambiguity should not be confused with ambiguity in natural language. The English statement "'http://www.example.com/moby' identifies 'Moby Dick'" is ambiguous because one could understand the phrase "Moby Dick" to refer to distinct resources: a particular printing of this work,..." This is highly ambiguous (sic!). Is 'http://www.example.com/moby' identifies 'Moby Dick' a statement in natural language, showing how natural language can be ambiguous? Or is it a statement about an URI, showing how URis can be ambiguous? Better change to say: "The URI http://www.example.com/moby is used ambiguously if it is used for more than one of the following: a particular printing of this work,..." [14] S2.3 URI ambiguity. This may imply or suggest that natural language differences in the representation of a resource are considered bad. There should be examples of both good and bad ambiguity (or in WebArch terminology, different but consistent representations of the same resource as opposed to the use of a single URI for different resources), with language negotation being a good example and wholly different resources being a bad example [15] Section 2.3.1. Missing a word in "URI ambiguity arises >>when<< (or 'as') a URI is used to identify two different Web resources. [16] Good practice: URI opacity: This says "Agents making use of URIs MUST NOT attempt to infer properties of the referenced resource except as licensed by relevant specifications." Earlier, the document defines 'agent' as both humans and machines. This good practice is not too difficult to follow for agents (although this seems to disallow e.g. Google to consider pieces of an URI in their algorithms, e.g. the 'weather' and 'oaxaca' in 'http://weather.example.com/oaxaca'; we're not sure disallowing this is intended or makes sense). However, this practice is *impossible* to follow for humans: It's just completely impossible to look at http://weather.example.com/oaxaca and NOT interfering that this may be about 'weather' or 'oaxaca'. The WebArch document itself is using this connection all the time. This is important in connection with IRIs. [17] 3.1, list item 2: "The XLink 1.0 [XLink10] specification, which defines the href attribute in section 5.4, states that "The value of the href attribute must be a URI reference as defined in [IETF RFC 2396], or must result in a URI reference after the escaping procedure described below is applied."" This refers to the conversion from an IRI to an URI. It would be a good occasion to mention IRIs. [18] 3.4.1: Maybe mention that editing tools may be more strict than simple user agents. [19] 3.4.1: We believe that charset handling, the way it is currently specified in various specs (i.e. outer information has priority to inner information), is basically okay (with the exception of the (irrelevant in practice) iso-8859-1 default given in the HTTP spec, and the us-ascii default for text/foo+xml, which makes text/foo+xml rather useless. It might be good to reach some consensus about this, and document it. [20] 3.4.1 (and later): "Furthermore, server managers can help reduce the risk of error through careful assignment of representation metadata (especially that which applies across representations). The section on media types for XML presents an example of reducing the risk of error by providing no metadata about character encoding when serving XML.": This seems to pick out a somewhat arbitrary detail, without stating the much more important underlying principles, such as: - Always make sure you know what the character encoding of a document or message is. - Make sure that it's easy for server managers and authors to configure and test metadata on the server, to make sure it's correct. - No arbitrary defaults for specs - No out-of-the-box with arbitrary settings The description of the example also is too general, because there are ways to implement/operate a server that make it much more easy/appropriate to put the 'charset' into the header than into the body, e.g. when producing content in a pipeline from a database. Regards, Martin.
Received on Thursday, 18 March 2004 17:34:41 UTC