Last Call comments on P3P from I18N WG/IG from Martin J. Duerst on 2000-03-22 (www-p3p-public-comments@w3.org from March 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 22 Mar 2000 18:01:05 +0900
To: www-p3p-public-comments@w3.org
Message-Id: <4.2.0.58.J.20000317144032.03637320@sh.w3.mag.keio.ac.jp>
Dear P3P Working Group(s),

Below please find the last call comments regarding
internationalization that the i18n WG approved at
its last teleconference. (a few minor comments were
submitted later, and a few more explanations have
been added by the editor of the comments).

To follow up on these comments, please use cross-posting
between the two groups in question (on our side the
I18N Interest Group), without copying this public list.


Character Encoding
------------------

- The spec says (http://www.w3.org/TR/P3P/#Policies) that policies
   must be encoded in UTF-8.

   From an i18n point of view, this looks very nice, because UTF-8
   covers the widest range of languages, and this clear single choice
   will avoid various interoperability problems such as negotiation
   on HTTP Accept-Charset and decoding of unknown 'charset's.

   However, we would also like to make you aware of the fact that
   this spec is one of the first that specifies 'UTF-8 only',
   and in some environments, this will need some effort from
   implementors. It is also in some way against the XML spec
   because that requires that an XML processor accept both UTF-8
   and UTF-16.

   We have given you the arguments for both sides, but in this
   case, we cannot decide for you. We request you to check this
   requirement with people working on implementations, e.g.
   in Japan. If they feel fine with it or have implemented it,
   this can be left as is. If not, it should be changed to say
   that policies are transmitted as any other XML documents,
   with encoding according to RFC 2376. In that case, the potential
   of the same policy being served in different character encodings
   also has to be considered; this is easier than policies being
   served in different languages because conversion is mechanical.
   [Should you need technical assistance to verify implementations,
    please feel free to contact us.]
   [The 'UTF-8 only' rule may be a leftover from the time when
    the plan was to stuff policies into a HTTP header. In that
    case, using only a single encoding would definitely have been
    very important to avoid protocol complications.]
   [wording detail: UTF-8 is not a 'syntax', it is a character encoding]

- Besides the point just above, the P3P spec uses 'UTF-8' in other
   places. In all these cases, using 'UTF-8' is inappropriate, and
   has to be corrected.

   In many cases, this should be done by replacing 'UTF-8' with
   'PCDATA'. PCDATA is already used in many places to indicate
   'arbitrary text', and this is the right way to do this. PCDATA
   should be introduced in Section 1.2, and not just on the fly.
   [please check whether PCDATA can be used to describe attribute
    content; if not, please use CDATA where appropriate]

   In three cases (Section 3.3.6 The DATA element:
       "no more than 127 [UTF-8] characters"
       "no more than 1023 [UTF-8] characters"
       "the maximum number of [UTF-8] characters"),
   '[UTF-8]' should be removed, and replaced with language such
   as that in XPath (http://www.w3.org/TR/xpath#strings; "where a
   character is defined as in the XML Recommendation [XML]. A single
   character in XPath thus corresponds to a single Unicode abstract
   character with a single corresponding Unicode scalar value (see
   [Unicode]);").


Language Variants
-----------------

The February 11, 2000 Version contains some new ideas in this area.
However, still a lot of work needs to be done.

First, the WD only considers language alternatives for Policies.
Language alternatives for Data schemas also have to be considered.

Also, it should be said that if the -Prefix or -Extension
headers are used, they apply to all variants (e.g. language
variants) negotiated/served from the same URI.


The P3P spec describes two mechanisms for dealing with language
variants:

1) Policy variants content-negotiated using Accept-Language and
    Content-Language.
2) Multiple occurences of elements such as <CONSEQUENCE> tagged
    with xml:lang.

Both mechanisms have their advantages and disadvantages:

- 1) makes it easier to add new languages. With 2), adding a new
   language means that you have to change the policy. Using the
   new policy only for the new language is possible, but may
   give the impression of policy inconsistency, and may complicate
   site management.

- 2) makes the management of a small number of languages easier.

- 2) may get into scalability problems for a large number of
   languages.

- Policies can easily be replaced, but Dataschemas can't. Solution
   2) only is inappropriate for Dataschemas. In general,
   one could create a separate schema for a different language,
   but for the base data schema, this must not be done, because
   it would otherwise disadvantage non-English users significantly.
   [The WG and the W3C should give some thought to how to handle
    translations of the base data schema.]


These arguments lead to the following conclusions:

- For dataschemas, 1) is necessary.
- Assuming that for simplicity, the design should be the same
   for policies and dataschemas, 1) should be used for policies, too.
- Whether to also use 2) or not depends on how the P3P WG
   judges protocol complexity vs. server management convenience.
- Choosing a single, simple solution has a better chance of achieving
   interoperability, which is the top goal.


The following points should in addition be considered:

- What it means for two policies or two dataschemes to be identical
   except for differences in language should be more clearly defined.
   The definition should contain two parts:
   - All formal (not natural language) protocol elements
     have to be semantically identical. (i.e. e.g. attribute
     order does not matter, the presence or absence of a default
     value does not matter, but if attribute values differ, this
     matters).
   - All natural language protocol elements have to correspond
     one-to-one, and for each correspondence, one has to be a
     careful translation of the other (maybe indirectly via third
     languages). [Please note that requiring 'the same meaning'
     is not adequate; it is NEVER possible for a translation to
     be guaranteed 100% the same as an original; natural languages
     are just too rich and subtle.]

- It should be mentioned that the fact that each language version
   is immutable means that a new policy URI has to be used
   if a change is necessary because a translation error
   has been found, which implies that translations should
   be done very carefully. For Dataschemas, even more care is
   necessary.

- It should be mentioned that user agents may see different language
   versions despite sending the same 'Accept-Language' request header
   if a new language version of a policy or dataschema has been added.

- The various considerations regarding external (and internal)
   language variants should be concentrated in a single section.

- If solution 2) is specified, some more attributes have to be moved
   to elements, and xml:lang added (see below, this move has to be
   done anyhow).

- 'xml:lang' should also be specified correctly in the ABNF
   (or the ABNF should be removed altogether, because it gives
   the impression you can use it to construct a policy parser,
   which is not the case). In

       "<CONSEQUENCE" [" xml:lang=" LanguageID ] ">" 
       PCDATA
       "</CONSEQUENCE>"
   (Section 3.3.2 The CONSEQUENCE element, production 11)
   the LanguageID is missing its quotes. This also occurs
   elsewhere in the document.


Wording problems:

- 'a particular language encoding': A language is a language, not an encoding.

- 'Content-Language' is not a tag, but a request header field.


Language negotiation also leads to some privacy concerns:

2.5.1 on the 'Safe Zone' should say something about
detection of identity from the 'Accept-Language' header,
similar to the HTTP 1.1 spec. (see http://www.ietf.org/rfc/rfc2616.txt,
15.1.4 Privacy Issues Connected to Accept Headers).
Maybe there could be some text saying that HEAD requests may be issued
without 'Accept-Language' to get the machine-
readable part of the policy, and if that is
reasonably satisfactory, to fetch the policy in the
appropriate language if necessary.



Attributes to Elements
----------------------

Various parts of the policy DTD use free text in attributes.
These should all be changed to elements:
   - "description" on <DISPUTES>
   - "entity" on <POLICY>
   - "short" and "long" on <DATA>
   - "other" in "category" on <DATA> (this is currently just a single
     attribute value, but mentions human readable explanations).
     [it also doesn't appear in production [18]]
   - "alt"
Besides the reasons given above, this is also important
because some languages and scripts need additional markup
(e.g. for bidirectional text). 




Data schemes
------------

- The syntax for 'name' on DATA has to be specified more
   exactly. Currently, it is completely unclear what is
   allowed and what not.

- The name/address fields still have the same US-centric
   structure and field names as when the I18N WG/IG had a
   look at them for the first time (see Review section
   on i18n WG page). This is a very important point, and
   must not be ignored. In particular, the following is needed
   for name.first and name.last:

   - Change the short display names to something much more culturally
     neutral. See e.g.
https://members.icann.org/cgi-bin/atlarge/join.cgi?mode=displayApplicationForm.

   - Add explanations about what each field means. A sentence or
     two per field should be enough.

   - Change the actual field names.


- The spec says that the short display name is defined as the
   concatenation of the various short names, concatenated by
   *commas*. This is inappropriate in general. It should say
   something like 'concatenated by a separator appropriate
   for the language/script in question, e.g. a comma for English.


- Section 4.4.1 Dates, and Appendix 2

   - "All the fields in the date. type must be in the same format 
     as those in the most informative profile of the time standard 
     ISO8601.

     What is the 'most informative profile'? Please be much more
     specific.

   - fractionsecond is defined as a 'number'. How is this supposed
     to work? What does e.g. '34' mean? .34? .034? .0034? Why is
     fractionsecond of size 6? (ISO 8601 has no restrictions)

   - "timezone" is defined as being of type "text", and of length
     10. This is not in any way in accordance with ISO 8601.

   - 'year' is defined as a number of size 6. Why  6? This is not
     in accordance with ISO 8601.

-  Section 4.4.4 Telephones
    Is there some implied relationship between "number" and "ext"?
    Please (as mentioned above) document the meaning of each field.
    Please change occurrences of 'Phone' to 'Telephone' or 'Tel'.

-  References

    There is no reference for ISO 8601. Please add it.


Various
-------

- The spec has to say that for URIs that appear within XML
   or HTML (e.g. discuri, meta content, disputes service,
   disputes image, data dataschema, data typeschema,...)
   have to be treated as specified in
   http://www.w3.org/TR/charmod/#URIs.
   Please make sure that comparison of URIs happens after
   this, not before.

   Please also explicitly say that this does not apply to
   URIs appearing in HTTP header fields; the URIs there
   should always be fully escaped.


- Some examples of domain names are wrong, e.g.
   www.thecoolcatalog.com.jp should be www.thecoolcatalog.co.jp.
   Please verify that all the examples make sense.


- The term 'disclosure' is used in a very specific way,
   and it takes the reader some time to hopefully get what
   it means. I expected it to mean publication of data
   that should be protected, not, as the spec seems to do,
   publication of privacy policies, for which the word
   'disclosure' seems to be strange, because companies,...
   should publish such policies anyway).
   Such a usage will lead to problems when the spec is
   being translated. The spec writers should make
   sure the very specific meaning of this term in this
   context is clearly explained early on.
   [the WG should also make a careful analysis about
    where 'disclosure' is used vs. other terms such
    as 'assert',..., and should explain the differences
    where there are differences]

- The abbreviation 'SSN' is not know to most people outside
   the US. Please expand.



For the I18N WG/IG,

Regards,   Martin.





#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org/People/D%C3%BCrst
Received on Wednesday, 22 March 2000 04:00:46 UTC