- From: Eric Brunner-Williams in Portland Maine <brunner@nic-naa.net>
- Date: Thu, 11 Apr 2002 23:44:59 -0400
- To: Misha.Wolf@reuters.com
- cc: www-i18n-workshop@w3.org
Misha, Workshop participants, I've commented on draft-hollenbeck-ietf-xml-guide.txt, written by Scott Hollenbeck, Marshall Rose, and Larry Masinter. This memo depricates the use of any encoding other than UTF-8 for use with XML, which I think is inconsistent with rfc2277. It also depricates mechanisms for language indication that do not rely upon iso639 or iso3166, and as neither of my two non-European languages have 639 codes, and neither of my non-State polities have 3166 codes, I must "non-hum". It also continues an abuse of language that confuses "i18n" with Unicode (or encodings generally), and fails to state the collation issue, which arises when objects are "named" using strings, and names are matched, searched, or sorted. The text of the memo is available at: http://www.imc.org/ietf-xml-use/draft-hollenbeck-ietf-xml-guide.{html,txt} Comments to me, or this list, or to that list, if so inclined. I'll get them either way. Eric ------- Forwarded Message Message-Id: <200204111713.g3BHD4X74227@nic-naa.net> To: ietf-xml-use@imc.org Subject: Comments on Section 5 Date: Thu, 11 Apr 2002 13:13:04 -0400 From: Eric Brunner-Williams in Portland Maine <brunner@nic-naa.net> Sender: owner-ietf-xml-use@mail.imc.org Precedence: bulk List-Archive: <http://www.imc.org/ietf-xml-use/mail-archive/> List-Unsubscribe: <mailto:ietf-xml-use-request@imc.org?body=unsubscribe> List-ID: <ietf-xml-use.imc.org> 5. Internationalization Considerations This section describes internationalization considerations for the use of XML to represent data in IETF protocols. Readers should be familiar with IETF policy on the use of character sets and languages as described in RFC 2277 [3]. Suggestion: This section describes character set and language attribute declarations available to authors of protocols using XML, and the text directionality attribute declarations available using XHTML. Readers are encouraged to be familiar with RFC 2277, which requires protocols MUST identify which charset is used and suggests protocols contain a mechanism for charset negotiation, and additionaly requires that UTF-8 support MUST be possible. RFC 2277 also requires protocols MUST provide a mechanism capable of carrying information about the language of that text, and also suggests protocols contain a mechanism for language naming , and for language negotiation, and additionaly requires a default value for language, which MUST be understandable by an English-speaking person. This section does not describe considerations for the use of locales in XML to represent character properties, such as collation orderings, word breaking or formats for dates, numbers, or currency. [Meta-Comment: I doubt the wisdom of leaving locales out of IETF i18n boiler-plate, and its my experience that most IETF contributors who encounter i18n casually read 2277 as an issue-free license cum requirement to use Unicode.] 5.1 Character Sets XML provides native support for encoding information using the Unicode character set and its more compact representations including UTF-8 [4] and UTF-16 [26]. Other encodings are also supported and can be specified using an "encoding" attribute in a document's XML declaration. It is strongly recommended that UTF-8 be mandated for protocols that represent data using XML. Suggestion: ... UTF-8 [4] and UTF-16 [26]. Other encodings are also supported and may be specified using the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Examples: <?xml version="1.0" encoding='iso-8859-1' ?> <?xml version="1.0" encoding='iso-8859-2' ?> ... <?xml version="1.0" encoding='iso-2022-JP' ?> <?xml version="1.0" encoding='Shift-JIS' ?> <?xml version="1.0" encoding='EUC-JP' ?> ... <?xml version="1.0" encoding='i-mingo' ?> [Comment: Even if I agreed with the last sentance of the original paragraph, real examples are a good thing.] Guidelines for the use of XML declarations can be found in Section 4.1. [Comment: I don't see the import of this back-reference. How does sec. 4.1 provide guidelines for use, and meaningfully for charsets?] ... If an XML declaration is omitted, it is strongly urged to require use of a consistent character set, and to require UTF-8 as the most appropriate character set. If an XML declaration is allowed, it is again strongly urged to require use of a consistent character set, to require UTF-8 as the most appropriate character set, and to recommend inclusion of an "encoding" attribute that explicitly notes use of UTF-8 encoding. Suggestion: ... and to require UTF-8 as the most appropriate character set, if it is in fact the most appropriate character set. ... [Comment: The original text is over-reaching. Either it is repeating, and removing the conditional applicability from, 2277, or it is promoting a universal structured data over protocol-specific data. Now the W3C can discard the encoding pseudo-attribute, and mandate UTF-8, in XML, but that's their business. Ours is interoperability, legacy systems included. This shouldn't seem to be an end-run on the non-UTF-8 bits of 2277 and 3066. Remove the UTF-8 theocracy, and retain secular data exchange. Thx.] 5.2 Language Declaration [Comment: The reference to http://www.w3.org/TR/2000/REC-xml-20001006, sec. 1.12, refers to rfc1766, which in turn refers to iso636 and iso3166. The substitution of International Treaty Organization normative references and Nation State identifiers for the purpose of interoperable text concerns me. My concerns of course may be misplaced, I'm just an ignorant Indian. I suggest that we invite more comments, e.g., To: golla@ssila.org <SSILA LIST>, endangered-languages-l@cleo.murdoch.edu.au Subject: Request for Comments: Proposed IETF Guidelines for Language Identifiers in protocols using XML Body The IETF is considering a proposal for the authors of protocols using XML to limit the identifiers for human languages to the set defined in iso639-1 [1] and iso3166 [2]. This would have the effect of discouraging the development of internet protocols which use XML for structured data exchange, and which use identifiers not defined in these references. For reference, using North America as an example The set of Indigenous Languages of North America for which an iso639-1 identifers exist is: ik (inupiaq), iu (inuktitut), nv (navajo), qu (quechua) The set of additional identifers for which a value exists in iso3166 for North America is: gr (greenland), ca (canada), us (united states of america), mx (united states of mexico) Comments, particularly those that describe use cases for which the above is unsuitable, and any alternatives, should be sent to the ietf-xml-use@imc.org mailing list. To join the list, send a message to "ietf-xml-use-request@imc.org" with the word "subscribe" in the body of the message. There is a web site for the list archives at http:// www.imc.org/ietf-xml-use/. [1] http://linux.infoterm.org/infoterm-e/i-infoterm.htm?raiso639-1_start.htm~Mitte [2] http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/en_listp1.html EOT SSILA == Society for the Study of the Indigenous Languages of the Americas. Both scholarly and affected community input would be useful. The alternative is an inapplicability statement, something of the form "the use of minority, rare, infrequently taught, endangered, or extinct languages is depricated in IETF standards-track protocols that use XML for the delivery of structured text." ] Suggestion: language used to represent data in an XML document. The xml:lang attribute is defined in section 2.12 of [8], and has and values defined in [ISO 639]. [Add to References] It is strongly recommended that protocols representing data in a human language use of an xml:lang attribute if the XML instance might be interpreted in language-dependent contexts, and if the language identifier is defined in [ISO 639]. Meta-Nit: 2277, sec 6, lines 325 - 328, suggests that "Internationalization considerations", be placed next to the Security Considerations section. Suggestion: Reorder sections 6 and 7. Eric ------- End of Forwarded Message
Received on Thursday, 11 April 2002 23:52:01 UTC