- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 22 Mar 2000 18:01:05 +0900
- To: www-p3p-public-comments@w3.org
Dear P3P Working Group(s), Below please find the last call comments regarding internationalization that the i18n WG approved at its last teleconference. (a few minor comments were submitted later, and a few more explanations have been added by the editor of the comments). To follow up on these comments, please use cross-posting between the two groups in question (on our side the I18N Interest Group), without copying this public list. Character Encoding ------------------ - The spec says (http://www.w3.org/TR/P3P/#Policies) that policies must be encoded in UTF-8. From an i18n point of view, this looks very nice, because UTF-8 covers the widest range of languages, and this clear single choice will avoid various interoperability problems such as negotiation on HTTP Accept-Charset and decoding of unknown 'charset's. However, we would also like to make you aware of the fact that this spec is one of the first that specifies 'UTF-8 only', and in some environments, this will need some effort from implementors. It is also in some way against the XML spec because that requires that an XML processor accept both UTF-8 and UTF-16. We have given you the arguments for both sides, but in this case, we cannot decide for you. We request you to check this requirement with people working on implementations, e.g. in Japan. If they feel fine with it or have implemented it, this can be left as is. If not, it should be changed to say that policies are transmitted as any other XML documents, with encoding according to RFC 2376. In that case, the potential of the same policy being served in different character encodings also has to be considered; this is easier than policies being served in different languages because conversion is mechanical. [Should you need technical assistance to verify implementations, please feel free to contact us.] [The 'UTF-8 only' rule may be a leftover from the time when the plan was to stuff policies into a HTTP header. In that case, using only a single encoding would definitely have been very important to avoid protocol complications.] [wording detail: UTF-8 is not a 'syntax', it is a character encoding] - Besides the point just above, the P3P spec uses 'UTF-8' in other places. In all these cases, using 'UTF-8' is inappropriate, and has to be corrected. In many cases, this should be done by replacing 'UTF-8' with 'PCDATA'. PCDATA is already used in many places to indicate 'arbitrary text', and this is the right way to do this. PCDATA should be introduced in Section 1.2, and not just on the fly. [please check whether PCDATA can be used to describe attribute content; if not, please use CDATA where appropriate] In three cases (Section 3.3.6 The DATA element: "no more than 127 [UTF-8] characters" "no more than 1023 [UTF-8] characters" "the maximum number of [UTF-8] characters"), '[UTF-8]' should be removed, and replaced with language such as that in XPath (http://www.w3.org/TR/xpath#strings; "where a character is defined as in the XML Recommendation [XML]. A single character in XPath thus corresponds to a single Unicode abstract character with a single corresponding Unicode scalar value (see [Unicode]);"). Language Variants ----------------- The February 11, 2000 Version contains some new ideas in this area. However, still a lot of work needs to be done. First, the WD only considers language alternatives for Policies. Language alternatives for Data schemas also have to be considered. Also, it should be said that if the -Prefix or -Extension headers are used, they apply to all variants (e.g. language variants) negotiated/served from the same URI. The P3P spec describes two mechanisms for dealing with language variants: 1) Policy variants content-negotiated using Accept-Language and Content-Language. 2) Multiple occurences of elements such as <CONSEQUENCE> tagged with xml:lang. Both mechanisms have their advantages and disadvantages: - 1) makes it easier to add new languages. With 2), adding a new language means that you have to change the policy. Using the new policy only for the new language is possible, but may give the impression of policy inconsistency, and may complicate site management. - 2) makes the management of a small number of languages easier. - 2) may get into scalability problems for a large number of languages. - Policies can easily be replaced, but Dataschemas can't. Solution 2) only is inappropriate for Dataschemas. In general, one could create a separate schema for a different language, but for the base data schema, this must not be done, because it would otherwise disadvantage non-English users significantly. [The WG and the W3C should give some thought to how to handle translations of the base data schema.] These arguments lead to the following conclusions: - For dataschemas, 1) is necessary. - Assuming that for simplicity, the design should be the same for policies and dataschemas, 1) should be used for policies, too. - Whether to also use 2) or not depends on how the P3P WG judges protocol complexity vs. server management convenience. - Choosing a single, simple solution has a better chance of achieving interoperability, which is the top goal. The following points should in addition be considered: - What it means for two policies or two dataschemes to be identical except for differences in language should be more clearly defined. The definition should contain two parts: - All formal (not natural language) protocol elements have to be semantically identical. (i.e. e.g. attribute order does not matter, the presence or absence of a default value does not matter, but if attribute values differ, this matters). - All natural language protocol elements have to correspond one-to-one, and for each correspondence, one has to be a careful translation of the other (maybe indirectly via third languages). [Please note that requiring 'the same meaning' is not adequate; it is NEVER possible for a translation to be guaranteed 100% the same as an original; natural languages are just too rich and subtle.] - It should be mentioned that the fact that each language version is immutable means that a new policy URI has to be used if a change is necessary because a translation error has been found, which implies that translations should be done very carefully. For Dataschemas, even more care is necessary. - It should be mentioned that user agents may see different language versions despite sending the same 'Accept-Language' request header if a new language version of a policy or dataschema has been added. - The various considerations regarding external (and internal) language variants should be concentrated in a single section. - If solution 2) is specified, some more attributes have to be moved to elements, and xml:lang added (see below, this move has to be done anyhow). - 'xml:lang' should also be specified correctly in the ABNF (or the ABNF should be removed altogether, because it gives the impression you can use it to construct a policy parser, which is not the case). In "<CONSEQUENCE" [" xml:lang=" LanguageID ] ">" PCDATA "</CONSEQUENCE>" (Section 3.3.2 The CONSEQUENCE element, production 11) the LanguageID is missing its quotes. This also occurs elsewhere in the document. Wording problems: - 'a particular language encoding': A language is a language, not an encoding. - 'Content-Language' is not a tag, but a request header field. Language negotiation also leads to some privacy concerns: 2.5.1 on the 'Safe Zone' should say something about detection of identity from the 'Accept-Language' header, similar to the HTTP 1.1 spec. (see http://www.ietf.org/rfc/rfc2616.txt, 15.1.4 Privacy Issues Connected to Accept Headers). Maybe there could be some text saying that HEAD requests may be issued without 'Accept-Language' to get the machine- readable part of the policy, and if that is reasonably satisfactory, to fetch the policy in the appropriate language if necessary. Attributes to Elements ---------------------- Various parts of the policy DTD use free text in attributes. These should all be changed to elements: - "description" on <DISPUTES> - "entity" on <POLICY> - "short" and "long" on <DATA> - "other" in "category" on <DATA> (this is currently just a single attribute value, but mentions human readable explanations). [it also doesn't appear in production [18]] - "alt" Besides the reasons given above, this is also important because some languages and scripts need additional markup (e.g. for bidirectional text). Data schemes ------------ - The syntax for 'name' on DATA has to be specified more exactly. Currently, it is completely unclear what is allowed and what not. - The name/address fields still have the same US-centric structure and field names as when the I18N WG/IG had a look at them for the first time (see Review section on i18n WG page). This is a very important point, and must not be ignored. In particular, the following is needed for name.first and name.last: - Change the short display names to something much more culturally neutral. See e.g. https://members.icann.org/cgi-bin/atlarge/join.cgi?mode=displayApplicationForm. - Add explanations about what each field means. A sentence or two per field should be enough. - Change the actual field names. - The spec says that the short display name is defined as the concatenation of the various short names, concatenated by *commas*. This is inappropriate in general. It should say something like 'concatenated by a separator appropriate for the language/script in question, e.g. a comma for English. - Section 4.4.1 Dates, and Appendix 2 - "All the fields in the date. type must be in the same format as those in the most informative profile of the time standard ISO8601. What is the 'most informative profile'? Please be much more specific. - fractionsecond is defined as a 'number'. How is this supposed to work? What does e.g. '34' mean? .34? .034? .0034? Why is fractionsecond of size 6? (ISO 8601 has no restrictions) - "timezone" is defined as being of type "text", and of length 10. This is not in any way in accordance with ISO 8601. - 'year' is defined as a number of size 6. Why 6? This is not in accordance with ISO 8601. - Section 4.4.4 Telephones Is there some implied relationship between "number" and "ext"? Please (as mentioned above) document the meaning of each field. Please change occurrences of 'Phone' to 'Telephone' or 'Tel'. - References There is no reference for ISO 8601. Please add it. Various ------- - The spec has to say that for URIs that appear within XML or HTML (e.g. discuri, meta content, disputes service, disputes image, data dataschema, data typeschema,...) have to be treated as specified in http://www.w3.org/TR/charmod/#URIs. Please make sure that comparison of URIs happens after this, not before. Please also explicitly say that this does not apply to URIs appearing in HTTP header fields; the URIs there should always be fully escaped. - Some examples of domain names are wrong, e.g. www.thecoolcatalog.com.jp should be www.thecoolcatalog.co.jp. Please verify that all the examples make sense. - The term 'disclosure' is used in a very specific way, and it takes the reader some time to hopefully get what it means. I expected it to mean publication of data that should be protected, not, as the spec seems to do, publication of privacy policies, for which the word 'disclosure' seems to be strange, because companies,... should publish such policies anyway). Such a usage will lead to problems when the spec is being translated. The spec writers should make sure the very specific meaning of this term in this context is clearly explained early on. [the WG should also make a careful analysis about where 'disclosure' is used vs. other terms such as 'assert',..., and should explain the differences where there are differences] - The abbreviation 'SSN' is not know to most people outside the US. Please expand. For the I18N WG/IG, Regards, Martin. #-#-# Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org/People/D%C3%BCrst
Received on Wednesday, 22 March 2000 04:00:46 UTC