Summary: xml:lang validity and RFC 1766 refs to outdated codes [l ong] from Mike Brown on 2000-08-07 (xml-editor@w3.org from July to September 2000)

From: Mike Brown <mbrown@corp.webb.net>
Date: Mon, 7 Aug 2000 15:42:33 -0600
To: "'unicode@unicode.org'" <unicode@unicode.org>
Cc: "'xml-dev@lists.xml.org'" <xml-dev@lists.xml.org>, "'xml-editor@w3.org'" <xml-editor@w3.org>
Message-ID: <8D96EDA0AC04D31197B400A0C96C1480F7076B@OSSEX1.webb.net>
SUMMARY
=======

It has been argued that strict interpretations of RFC 1766 would have the
effect of requiring values for XML 1.0's xml:lang attribute and HTML 4.01's
lang attribute to be created from outdated language and country code lists.

In a post to the ietf-languages list on 02 Aug 2000, Harald Alvestrand,
author of RFC 1766, stated that his normative references to language and
country code lists (e.g. ISO 639:1988 and ISO 3166:1988) are intended to be
to those specific lists *and their future revisions*. This clarification
allows for leniency in determination of both RFC 1766 conformance and
external validity of language identifiers.

This message is intended as a followup for the benefit of people who are not
on the ietf-languages list and didn't see Harald's post. Bail out now if you
never thought there was any ambiguity.


BACKGROUND
==========

In a post to the Unicode list on 25 Apr 2000, Elliotte Rusty Harold wrote,
regarding Official ISO 639 changes:
> Has anybody noticed that XML 1.0 requires 2-letter and
> forbids three-letter language codes? [...]

His post was cc'd to xml-dev@xml.org and xml-editor@w3.org. The bulk of the
thread continued on xml-dev, but also spilled over to the ietf-languages
list. Relevant archives:

http://lists.w3.org/Archives/Public/xml-editor/2000AprJun/thread.html
http://lists.xml.org/archives/xml-dev/200004/threads.html
http://www.alvestrand.no/archives/ietf-languages/ietf-languages.0004
http://www.unicode.org/Public/MailArchive/B0015.txt.Z
   [large file; nothing here that isn't in the other archives]

The discussion that ensued more or less left the issue at "wait until the
successor to RFC 1766 is finished", and Tim Bray's recollection was that
3-letter codes were excluded because at the time (1997) ISO was dragging
their heels on related matters. 


In the discussion, a separate but related issue was raised:

XML 1.0 says that xml:lang attributes must match production 33 for
well-formedness -- on that all seem to agree. But XML 1.0's normative
reference to RFC 1766 and the language of that RFC together *could* imply
that the 2-letter language code portion of xml:lang values must not only be
2 ASCII characters, but must also match ISO 639 2-letter language codes in
order to be valid.

Although I cannot find a reference in any of the list archives, someone (not
me) pointed out that if the language of RFC 1766 is subjected to strict
interpretation, then RFC 1766 conforming language identifiers that use
2-letter language codes must get their codes from ISO 639:1988, now known as
ISO 639-1:1988. 

That was the first edition of ISO 639. The language codes listed in it are
old and have been superceded many times since then. However, the strict
interpretation of the specs as they are currently written would require that
this old list be used, and even Tim Bray's annotated XML specification
references it (see the circle-U link in Appendix A.1, next to ISO 639).

Opinions were put forth on the issue of whether such validity was required
-- see posts by G. Ken Holman and David Brownell -- but apparently, no
decisive clarifications were issued.


NEW NEWS
========

I brought the issue of xml:lang validity possibly being bound to an original
version of ISO 639 to the attention of Martin J. Duerst, in reference to his
page at http://www.w3.org/International/O-HTML-tags.html. Martin was kind
enough to pass my concerns along to the ietf-languages list for comment.

I am pleased that in response, RFC 1766's author, Harald Tveit Alvestrand,
stated in an ietf-languages post on 02 Aug 2000: "The intent of RFC 1766 and
the current draft is that the lists referred to are the published versions +
any later changes. I refuse to put in references to unpublished documents,
but that's my only religion on the matter; replacement text is welcome." It
may not be a formal statement in an RFC, but this is good enough for me.

There still remains the unclear issue of whether xml:lang validity really
should correlate to strict RFC 1766 conformance, down to the selection of
language codes from ISO 639-1.

Regardless, in either case it does not seem unreasonable, especially in
light of Harald's clarification, to expect that if a validating XML parser
checks the 2-letter language code portion of an xml:lang value against an
ISO 639 list, then it will use the most current list available to it.
Granted, the impracticality of keeping up with changes may mean that no
parser will actually do this, but for the pedantic XML document author or
authoring tool, it's good information to know.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at         My XML/XSL resources:
webb.net in Denver, Colorado, USA           http://www.skew.org/xml/
Received on Monday, 7 August 2000 17:42:10 UTC