- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 18 Mar 2004 16:47:55 -0500
- To: public-webarch-comments@w3.org
- Cc: w3c-i18n-ig@w3.org
Dear TAG,
Below receive the first part (apart from our comment on dependencies
between the WebArch document, the revision of the URI spec, and the
IRI draft) of the I18N WG (Core TF) comments on the WebArch document.
Please send replies to these comments to the I18N IG mailing list
(w3c-i18n-ig@w3.org), not only to me.
[1] The term 'language' is used both for natural language (in the Abstract
and for (document/data) format. In the later case, 'format' should
be used everywhere, to avoid confusion. This would be the same as
in Charmod.
[2] 'Oaxaca' is used in many examples. Glad to see a non-US example, but
we are afraid that this may lead to questions on how to pronounce it
in large parts of the world. We suggest to replace it with something
simpler, one idea might be 'Lima' (although the weather is not as
good there as in Oaxaca :-( ).
[3] section 1, figure: please show charset in Content-type.
[4] 1.2.1, first bullet: This and
http://www.w3.org/2001/tag/doc/whenToUseGet.html#i18n
are basically okay. However the word 'limitations' (related to i18n)
may give the wrong impression; it is not clear what the i18n concerns are.
We suggest that you describe the issue more clearly, e.g. as
"The design works reasonably well, although there are issues related
to the transmission of non-ASCII characters." (please note the
use of the word 'issues' rather than 'limitations'; although there
are indeed some limitations as to the combinations of encodings
in form pages and in requests, due to well-established practices
based on HTML 4, there is no fundamental limitation to the basic
use of non-ASCII characters.
Also, please make sure the reader can directly go to the relevant
section in the finding. Also, you may want to point to the FAQ
on "What is the best way to deal with encoding issues in forms that
may use multiple languages and scripts?"
http://www.w3.org/International/questions/qa-forms-utf-8.html
[5] 1.2.1, third bullet: "Some authors use the META/http-equiv approach
to declare the character encoding scheme of an HTML document. By
design, this is a hint that an HTTP server should emit a corresponding
"Content-Type" header field. In practice, the use of the hint in servers
is not widely deployed. Furthermore, many user agents use this information
to override the "Content-Type" header sent by the server. This works
against the principle of authoritative representation metadata."
This is rather misleading on several points:
- "By design, this is a hint that an HTTP server should emit a
corresponding "Content-Type" header field.": It's correct that
by design, this WAS a hint to the sever. Practice has shown
that this wasn't such a good idea, and practice has found a
better use for it. The WebArch doc should mainly look forward,
mentioning misguided/reused designs can help, but it should not
be presented as if it would be better to go back to that design.
So e.g. reword to "this was (originally) intended..."
- "In practice, the use of the hint in servers is not widely deployed.":
Is it actually deployed at all? Any pointers would be appreciated.
- "Furthermore, many user agents use this information to override the
"Content-Type" header sent by the server.": Tests we have done
recently with reasonably new browsers have shown that none of the
major browsers do this. In particular if 'many user agents' is
an acronym for IE6win, this is actually wrong (in case you do you
own tests, please clean your cache if you change encodings or
encoding labels for files; encodings are very 'sticky' in IE6,
but for fresh pages, it gets things right).
We suggest rewording e.g. to "user agents use this information
if there is no 'charset' information in the "Content-Type" header"
and/or "some have in the past..."
[6] 1.2.3 Error Handling: It might be very good to point to say something
about character encoding/labeling errors.
[7] End of 2. (just before 2.1): "Of course, what an agent does with a URI
may vary." It would be better to mention more explicitly that this
e.g. can include language negotiation,...
[8] Section 2.1 (URI comparison). 2nd para. The first sentence establishes
that character-by-character inequality doesn't mean that the resource
referred is different. But the subsequent sentences say basically the
opposite (that this is the most straightforward way to find resource
equality). Break into two paragraphs, or otherwise improve wording
to less confuse the reader.
[9] S2.1. 3rd para. The casing example for weather.example.com/Oaxaca is a
bit obscure. Perhaps spell out the fact that case sensitivity matters
to some systems?
[10] 2.1: " For instance, one might reasonably create URIs that begin with
"http://www.example.com/tempo" and "http://www.example.com/tiempo" to
provide access to resources by users who speak Italian and Spanish.":
It is nice to see an i18n-related example. However, there are all kinds
of issues with this. This is not necessarily a good way to organize
information in different languages on a server, in particular if the
information is highly parallel. It may be better to find another example,
for example with two English words. Also, 'tempo' is an English word
with a different meaning. Perhaps German "Wetter" is better?
[11] Section 2.1. 4th para. "Likewise, URI consumers should ensure URI
consistency. For instance, when transcribing a URI, agents should not
gratuitously escape characters. The term "character" refers to URI
characters as defined in section 2 of [URI]". The definition of
'character' in the first sentence is not clarified by section 2 of
the URI draft, which deals with details such as percent escaping of
characters. Section 1 of the URI draft *points to* a definition of
'character'.
This is an area where the presence of IRI would be welcome.
It might be more useful to describe what "gratuitious" means in
this context (there is currently no definition; we *think* it
means "don't escape characters unless it breaks usability", i.e.
I would expect to see %20 instead of space (because space breaks
the URI semantically).
[12] 'The term "character"...': please say instead: 'The term "character"
in the foregoing sentence...' (character takes on other meanings
later...)
[13] S2.3, para 2. Shouldn't there be a "but..." at the end of this
paragraph? Yes, URI ambiguity is not the same thing as natural
language ambiguity... but what is it? Please make the example more
direct:
"URI ambiguity should not be confused with ambiguity
in natural language. The English statement
"'http://www.example.com/moby' identifies 'Moby Dick'" is
ambiguous because one could understand the phrase "Moby Dick" to
refer to distinct resources: a particular printing of this work,..."
This is highly ambiguous (sic!). Is
'http://www.example.com/moby' identifies 'Moby Dick' a statement
in natural language, showing how natural language can be ambiguous?
Or is it a statement about an URI, showing how URis can be ambiguous?
Better change to say: "The URI http://www.example.com/moby is used
ambiguously if it is used for more than one of the following:
a particular printing of this work,..."
[14] S2.3 URI ambiguity. This may imply or suggest that natural language
differences in the representation of a resource are considered
bad. There should be examples of both good and bad ambiguity (or in
WebArch terminology, different but consistent representations of the
same resource as opposed to the use of a single URI for different
resources), with language negotation being a good example and wholly
different resources being a bad example
[15] Section 2.3.1. Missing a word in "URI ambiguity arises >>when<<
(or 'as') a URI is used to identify two different Web resources.
[16] Good practice: URI opacity: This says
"Agents making use of URIs MUST NOT attempt to infer properties of the
referenced resource except as licensed by relevant specifications."
Earlier, the document defines 'agent' as both humans and machines.
This good practice is not too difficult to follow for agents
(although this seems to disallow e.g. Google to consider pieces
of an URI in their algorithms, e.g. the 'weather' and
'oaxaca' in 'http://weather.example.com/oaxaca'; we're not sure
disallowing this is intended or makes sense).
However, this practice is *impossible* to follow for humans: It's
just completely impossible to look at http://weather.example.com/oaxaca
and NOT interfering that this may be about 'weather' or 'oaxaca'.
The WebArch document itself is using this connection all the time.
This is important in connection with IRIs.
[17] 3.1, list item 2: "The XLink 1.0 [XLink10] specification, which defines
the href attribute in section 5.4, states that "The value of the href
attribute must be a URI reference as defined in [IETF RFC 2396], or must
result in a URI reference after the escaping procedure described below
is applied.""
This refers to the conversion from an IRI to an URI. It would be
a good occasion to mention IRIs.
[18] 3.4.1: Maybe mention that editing tools may be more strict than
simple user agents.
[19] 3.4.1: We believe that charset handling, the way it is currently
specified in various specs (i.e. outer information has priority to
inner information), is basically okay (with the exception
of the (irrelevant in practice) iso-8859-1 default given in the
HTTP spec, and the us-ascii default for text/foo+xml, which makes
text/foo+xml rather useless. It might be good to reach some consensus
about this, and document it.
[20] 3.4.1 (and later): "Furthermore, server managers can help reduce the
risk of error through careful assignment of representation metadata
(especially that which applies across representations). The section
on media types for XML presents an example of reducing the risk of
error by providing no metadata about character encoding when serving
XML.": This seems to pick out a somewhat arbitrary detail, without
stating the much more important underlying principles, such as:
- Always make sure you know what the character encoding of a
document or message is.
- Make sure that it's easy for server managers and authors to configure
and test metadata on the server, to make sure it's correct.
- No arbitrary defaults for specs
- No out-of-the-box with arbitrary settings
The description of the example also is too general, because there
are ways to implement/operate a server that make it much more
easy/appropriate to put the 'charset' into the header than into
the body, e.g. when producing content in a pipeline from a database.
Regards, Martin.
Received on Thursday, 18 March 2004 17:34:41 UTC