I18N WG comments on WebArch Doc (first part) from Martin Duerst on 2004-03-18 (public-webarch-comments@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 18 Mar 2004 16:47:55 -0500
To: public-webarch-comments@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20040318113831.055dc060@localhost>
Dear TAG,

Below receive the first part (apart from our comment on dependencies
between the WebArch document, the revision of the URI spec, and the
IRI draft) of the I18N WG (Core TF) comments on the WebArch document.

Please send replies to these comments to the I18N IG mailing list
(w3c-i18n-ig@w3.org), not only to me.

[1] The term 'language' is used both for natural language (in the Abstract
     and for (document/data) format. In the later case, 'format' should
     be used everywhere, to avoid confusion. This would be the same as
     in Charmod.

[2] 'Oaxaca' is used in many examples. Glad to see a non-US example, but
     we are afraid that this may lead to questions on how to pronounce it
     in large parts of the world. We suggest to replace it with something
     simpler, one idea might be 'Lima' (although the weather is not as
     good there as in Oaxaca :-( ).

[3] section 1, figure: please show charset in Content-type.

[4] 1.2.1, first bullet: This and
     http://www.w3.org/2001/tag/doc/whenToUseGet.html#i18n
     are basically okay. However the word 'limitations' (related to i18n)
     may give the wrong impression; it is not clear what the i18n concerns are.
     We suggest that you describe the issue more clearly, e.g. as
     "The design works reasonably well, although there are issues related
      to the transmission of non-ASCII characters." (please note the
     use of the word 'issues' rather than 'limitations'; although there
     are indeed some limitations as to the combinations of encodings
     in form pages and in requests, due to well-established practices
     based on HTML 4, there is no fundamental limitation to the basic
     use of non-ASCII characters.
     Also, please make sure the reader can directly go to the relevant
     section in the finding. Also, you may want to point to the FAQ
     on "What is the best way to deal with encoding issues in forms that
     may use multiple languages and scripts?"
     http://www.w3.org/International/questions/qa-forms-utf-8.html

[5] 1.2.1, third bullet: "Some authors use the META/http-equiv approach
     to declare the character encoding scheme of an HTML document. By
     design, this is a hint that an HTTP server should emit a corresponding
     "Content-Type" header field. In practice, the use of the hint in servers
    is not widely deployed. Furthermore, many user agents use this information
    to override the "Content-Type" header sent by the server. This works
    against the principle of authoritative representation metadata."

    This is rather misleading on several points:
    - "By design, this is a hint that an HTTP server should emit a
       corresponding "Content-Type" header field.": It's correct that
       by design, this WAS a hint to the sever. Practice has shown
       that this wasn't such a good idea, and practice has found a
       better use for it. The WebArch doc should mainly look forward,
       mentioning misguided/reused designs can help, but it should not
       be presented as if it would be better to go back to that design.
       So e.g. reword to "this was (originally) intended..."
    - "In practice, the use of the hint in servers is not widely deployed.":
      Is it actually deployed at all? Any pointers would be appreciated.
    - "Furthermore, many user agents use this information to override the
      "Content-Type" header sent by the server.": Tests we have done
      recently with reasonably new browsers have shown that none of the
      major browsers do this. In particular if 'many user agents' is
      an acronym for IE6win, this is actually wrong (in case you do you
      own tests, please clean your cache if you change encodings or
      encoding labels for files; encodings are very 'sticky' in IE6,
      but for fresh pages, it gets things right).
      We suggest rewording e.g. to "user agents use this information
      if there is no 'charset' information in the "Content-Type" header"
      and/or "some have in the past..."

[6] 1.2.3 Error Handling: It might be very good to point to say something
     about character encoding/labeling errors.

[7] End of 2. (just before 2.1): "Of course, what an agent does with a URI
     may vary." It would be better to mention more explicitly that this
     e.g. can include language negotiation,...

[8] Section 2.1 (URI comparison). 2nd para. The first sentence establishes
     that character-by-character inequality doesn't mean that the resource
     referred is different. But the subsequent sentences say basically the
     opposite (that this is the most straightforward way to find resource
     equality). Break into two paragraphs, or otherwise improve wording
     to less confuse the reader.

[9] S2.1. 3rd para. The casing example for weather.example.com/Oaxaca is a
     bit obscure. Perhaps spell out the fact that case sensitivity matters
     to some systems?

[10] 2.1: " For instance, one might reasonably create URIs that begin with
      "http://www.example.com/tempo" and "http://www.example.com/tiempo" to
      provide access to resources by users who speak Italian and Spanish.":
      It is nice to see an i18n-related example. However, there are all kinds
      of issues with this. This is not necessarily a good way to organize
      information in different languages on a server, in particular if the
      information is highly parallel. It may be better to find another example,
      for example with two English words. Also, 'tempo' is an English word
      with a different meaning. Perhaps German "Wetter" is better?

[11] Section 2.1. 4th para. "Likewise, URI consumers should ensure URI
      consistency. For instance, when transcribing a URI, agents should not
      gratuitously escape characters. The term "character" refers to URI
      characters as defined in section 2 of [URI]". The definition of
      'character' in the first sentence is not clarified by section 2 of
      the URI draft, which deals with details such as percent escaping of
      characters. Section 1 of the URI draft *points to* a definition of
      'character'.
      This is an area where the presence of IRI would be welcome.
      It might be more useful to describe what "gratuitious" means in
      this context (there is currently no definition; we *think* it
      means "don't escape characters unless it breaks usability", i.e.
      I would expect to see %20 instead of space (because space breaks
      the URI semantically).

[12] 'The term "character"...': please say instead: 'The term "character"
      in the foregoing sentence...' (character takes on other meanings
      later...)

[13] S2.3, para 2. Shouldn't there be a "but..." at the end of this
      paragraph? Yes, URI ambiguity is not the same thing as natural
      language ambiguity... but what is it? Please make the example more
      direct:

      "URI ambiguity should not be confused with ambiguity
      in natural language. The English statement
      "'http://www.example.com/moby' identifies 'Moby Dick'" is
      ambiguous because one could understand the phrase "Moby Dick" to
      refer to distinct resources: a particular printing of this work,..."

      This is highly ambiguous (sic!). Is
      'http://www.example.com/moby' identifies 'Moby Dick' a statement
      in natural language, showing how natural language can be ambiguous?
      Or is it a statement about an URI, showing how URis can be ambiguous?
      Better change to say: "The URI http://www.example.com/moby is used
      ambiguously if it is used for more than one of the following:
      a particular printing of this work,..."

[14] S2.3 URI ambiguity. This may imply or suggest that natural language
      differences in the representation of a resource are considered
      bad. There should be examples of both good and bad ambiguity (or in
      WebArch terminology, different but consistent representations of the
      same resource as opposed to the use of a single URI for different
      resources), with language negotation being a good example and wholly
      different resources being a bad example

[15] Section 2.3.1. Missing a word in "URI ambiguity arises >>when<<
      (or 'as') a URI is used to identify two different Web resources.

[16] Good practice: URI opacity: This says
      "Agents making use of URIs MUST NOT attempt to infer properties of the
      referenced resource except as licensed by relevant specifications."
      Earlier, the document defines 'agent' as both humans and machines.

      This good practice is not too difficult to follow for agents
      (although this seems to disallow e.g. Google to consider pieces
      of an URI in their algorithms, e.g. the 'weather' and
      'oaxaca' in 'http://weather.example.com/oaxaca'; we're not sure
      disallowing this is intended or makes sense).

      However, this practice is *impossible* to follow for humans: It's
      just completely impossible to look at http://weather.example.com/oaxaca
      and NOT interfering that this may be about 'weather' or 'oaxaca'.
      The WebArch document itself is using this connection all the time.
      This is important in connection with IRIs.

[17] 3.1, list item 2: "The XLink 1.0 [XLink10] specification, which defines
      the href attribute in section 5.4, states that "The value of the href
      attribute must be a URI reference as defined in [IETF RFC 2396], or must
      result in a URI reference after the escaping procedure described below
      is applied.""
      This refers to the conversion from an IRI to an URI. It would be
      a good occasion to mention IRIs.

[18] 3.4.1: Maybe mention that editing tools may be more strict than
      simple user agents.

[19] 3.4.1: We believe that charset handling, the way it is currently
      specified in various specs (i.e. outer information has priority to
      inner information), is basically okay (with the exception
      of the (irrelevant in practice) iso-8859-1 default given in the
      HTTP spec, and the us-ascii default for text/foo+xml, which makes
      text/foo+xml rather useless. It might be good to reach some consensus
      about this, and document it.

[20] 3.4.1 (and later): "Furthermore, server managers can help reduce the
      risk of error through careful assignment of representation metadata
      (especially that which applies across representations). The section
      on media types for XML presents an example of reducing the risk of
      error by providing no metadata about character encoding when serving
      XML.": This seems to pick out a somewhat arbitrary detail, without
      stating the much more important underlying principles, such as:
      - Always make sure you know what the character encoding of a
        document or message is.
      - Make sure that it's easy for server managers and authors to configure
        and test metadata on the server, to make sure it's correct.
      - No arbitrary defaults for specs
      - No out-of-the-box with arbitrary settings
      The description of the example also is too general, because there
      are ways to implement/operate a server that make it much more
      easy/appropriate to put the 'charset' into the header than into
      the body, e.g. when producing content in a pipeline from a database.


Regards,    Martin.
Received on Thursday, 18 March 2004 17:34:41 UTC