Re: Testing RFC 4646 values in markup languages

Henri Sivonen wrote:
> On Mar 17, 2008, at 07:21, Karl Dubost wrote:
>
>> In Henri's Sivonen Thesis, he said in [partially implemented][3] it 
>> in HTML 5 Conformance checker (java).
>
>
> I have improved the implementation since then. Validator.nu should now 
> contain a language tag validator that supports the features that are 
> actually used by the actual registry. (I have a vague recollection of 
> not implementing some bit of the RFC that was never used by the actual 
> registry.)

cool! I added a bogus language tags in xml:lang at
http://www.w3.org/People/fsasaki/
and validated with
http://validator.nu/?doc=http%3A%2F%2Fwww.w3.org%2FPeople%2Ffsasaki%2F&schema=http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml-strict.rnc+http%3A%2F%2Fs.validator.nu%2Fxhtml10%2Fxhtml.sch+http%3A%2F%2Fc.validator.nu%2Fall-html4%2F&parser=xmldtd&laxtype=yes

I got an error message saying
#
Error: Bad value bla-xmlangggggg-test for attribute xml:lang on XHTML 
element html: Subtags must next exceed 8 characters in length.
 From line 2, column 1; to line 2, column 75
ict.dtd">↩<html xmlns="http://www.w3.org/1999/xhtml" 
xml:lang="bla-xmlangggggg-test">↩<head
Syntax of language tag:
An RFC 4646 language tag consists of hyphen-separated ASCII-alphanumeric 
subtags. There is a primary tag identifying a natural language by its 
shortest ISO 639 language code (e.g. en for English) and zero or more 
additional subtags adding precision. The most common additional subtag 
type is a region subtag which most commonly is a two-letter ISO 3166 
country code (e.g. GB for the United Kingdom). IANA maintains a registry 
of permissible subtags.


I think this should be "Subtags must *not* exceed 8 characters in length."

I added another language tag which is wellformed, but not valid, and got 
the following message:
#
Error: Bad value en-1yz for attribute xml:lang on XHTML element body: 
Found reserved language extension subtag.
 From line 10, column 1; to line 10, column 24
↩</head>↩↩<body xml:lang="en-1yz">↩ <p><
Syntax of language tag:
An RFC 4646 language tag consists of hyphen-separated ASCII-alphanumeric 
subtags. There is a primary tag identifying a natural language by its 
shortest ISO 639 language code (e.g. en for English) and zero or more 
additional subtags adding precision. The most common additional subtag 
type is a region subtag which most commonly is a two-letter ISO 3166 
country code (e.g. GB for the United Kingdom). IANA maintains a registry 
of permissible subtags.

This looks like values from the language subtag registry are actually 
checked, though again the error message sounds a bit confusing: "Found 
reserved language extension subtag.". Maybe "Found language sub tag 
which is not registered"?

I think it is great to see this application of RFC 4646 and you should 
make the LTRU WG (IETF) aware of this.

Felix

Received on Monday, 17 March 2008 22:31:08 UTC