[Bug 14709] New: lang tag validation is insufficiently specified from bugzilla@jessica.w3.org on 2011-11-06 (public-html@w3.org from November 2011)

From: <bugzilla@jessica.w3.org>
Date: Sun, 06 Nov 2011 19:52:38 +0000
To: public-html@w3.org
Message-ID: <bug-14709-2495@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=14709

           Summary: lang tag validation is insufficiently specified
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: jdaggett@mozilla.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


In section "The lang and xml:lang attributes" describing the behavior
of language tags in HTML elements, there's wording that makes it difficult
to determine exactly if/when some form of language tag validation should occur.

The spec currently contains this wording:

  If the resulting value is not a recognized language tag, then
  it must be treated as an unknown language having the given
  language tag, distinct from all other languages. For the
  purposes of round-tripping or communicating with other services
  that expect language tags, user agents should pass unknown
  language tags through unmodified.

  Thus, for instance, an element with lang="xyzzy" would be
  matched by the selector :lang(xyzzy) (e.g. in CSS), but it
  would not be matched by :lang(abcde), even though both are
  equally invalid. Similarly, if a Web browser and screen reader
  working in unison communicated about the language of the
  element, the browser would tell the screen reader that the
  language was "xyzzy", even if it knew it was invalid, just in
  case the screen reader actually supported a language with that
  tag after all.

To give a concrete example of where this leads to fuzzy interpretation
in implementations, consider the language tag 'mya', the ISO 639-3
language code for Burmese.  There's a two-letter language tag from ISO
639-1 'my', so the valid BCP47 language tag is 'my'.  So what's the exact
behavior for user agents that use API's that make use of language tag
information, for example OpenType API's that have use OpenType
language tags. Should the language tag be validated and a default used
if none exists?  Or should 'mya' be passed through to these API's just
in case it might be a supported OpenType tag?  The spec can be read
either way, especially given the example of a screen reader which
"actually supported a language with that tag after all".

I think the wording needs to be stronger than this, I think the spec
specifically needs to say that when the language is used, if it
doesn't match a BCP47 language tag (such as 'mya'), then the only
interpretation is that it's the equivalent of an unknown language when
passed along to an API.  As is, the spec merely defines the
*expectation* that the language code is a BCP47 code but allows for an
entirely different language tag format to be used in it's place.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Sunday, 6 November 2011 19:54:44 UTC