W3C home > Mailing lists > Public > www-international@w3.org > January to March 2007

(wrong string) ‘this is not in any language’ in XHTML/HTML

From: CE Whitehead <cewcathar@hotmail.com>
Date: Sat, 17 Mar 2007 14:41:09 -0400
Message-ID: <BAY114-F217281754F0DB44208546CB3700@phx.gbl>
To: asmusf@ix.netcom.com
Cc: www-international@w3.org

>On 3/13/2007 12:12 PM, Richard Ishida wrote:
>>This is an attempt to summarise and move forward some ideas in a thread on 
>>www-international@w3.org by Christophe Strobbe, Martin Duerst, Bjoern 
>>Hoermann and Tex Texin.
>You have the classic problem of content tagging here. When applying tagging 
>to content you can have these cases:
>1) you have content that has not been classified
>2) you have content for which classification has failed
>3) you have content that is known to not fit  any of the classifications  
>(would not this be 2?)
>4) you have content to which the classification cannot apply
>5) you have content that fits multiple classifications
>6) you have content for which the classification depends on context
(there are some words that cross language boundaries)
>7) you have content that has been incorrectly classified
(and you know this, then why is it still so ????  I guess I can make sense 
of your note below, my question is is it a good idea to leave in the old tag 
if it was wrong??? )
>8) you have content that has possibly be correctly classified
>9) you have content that has been correctly classified
>in the case of tagging natural language content, the label "zxx" is clearly 
>the correct one for case 4. When there is no linguistic content, the 
>classification cannot apply.
>"und" seems  a fine label when you want to convey that tagging has not 
>happened (case 1 or 2 - the distinction between these is not necessarily of 
>sufficient interest to carry it forward). But so would the empty tag if it 
>had been allowed.
und would be o.k. if there were some language but it has not been determined 
which or maybe even how many
(I think Addison's comment, that und was not recommended by the rfc when 
there was no real language, was helpful for und)
>Case 3 could be handled with any form or label that says "no tag assigned 
>yet", but failing that, if available, a private tag might be useful.
>A single string like "OK" is an example that could fit category 5.

(OK. is sort of English/American, but I think a lot of people borrow it, I 
do not know how I'd classify it, as English;
incidentally 'oc' [pronounced /) k/ a little like /ak/ I do not have the 
backwords c here] means 'yes' in the Oc or Occitan--Southern France, 
northern Spain, Western Italy, etc.--and in the Middle Ages the Oc people 
did have close ties to the English and even intermarried, but people today 
I've talked to argue for a different evolution of o.k.
but anyway, I think other classifications would be o.k. too besides mul, but 
>Case 6 is not something that I would expect for language tagging, but it's 
>a concept that shows up when assigning script tags to runs of text.
>Case 7 is something that the tagging systems rarely handle, but for 
>archiving and scholarly purposes it is conceivable that there is a need to 
>express the concept that the content tagging implies a re-tagging of 
>existing content because of errors or disagreement with the previously 
>assigned tags.

>Case 8 is where you've successfully classified, but there's a margin of 
>error (perhaps machine classification).
>Finally, 9 is when you have assigned any of the existing tags with 
>confidence. (And case 10 could be where this assignment has been reviewed 
>and verified, but that's again something that belongs more in the scholarly 
>With any tagging system you need to decide which of the distinctions here 
>you need to convey. The usual problem is that tagging systems get designed 
>with 99% of the attention focused on case 9, and there on issues such as 
>how fine grained the tags need to be and what features of the content to 
>base the classification on.
>Just some thoughts,
>>You should always use the lang and/or xml:lang attributes in HTML or XHTML 
>>to identify the human language of the content so that applications such as 
>>voice browsers, style sheets, and the like can process that text. (See 
>>Declaring Language in XHTML and HTML[1] for the details.)
>>You can override that language setting for a part of the document that is 
>>in a different language, eg. some French quotation in an English document, 
>>by using the same attribute(s) around the relevant bit of text.
>>Suppose you have some text that is not in any language, such as type 
>>samples, part numbers, perhaps program code. How would you say that this 
>>was no language in particular?
>>There are a number of possible approaches:
>>    1. A few years ago we introduced into the XML spec the idea that 
>>xml:lang=”" conveys that ‘there is no language information 
>>available’. (See 2.12 Language Identification[2])
>>    2. An alternative is to use the value ‘und’, for 
>>    3. In the IANA Subtag Registry[3] there is another tag, ‘zxx’, 
>>that means ‘No linguistic content’. Perhaps this is a better choice. 
>>It has my vote at the moment.
>>Is ‘no language information available’ suitable to express ‘this is 
>>not a language’? My feeling is not.
>>If it were appropriate, there are some other questions to be answered 
>>here. With HTML an empty string value for the lang or xml:lang attribute 
>>produces a validation error.
>>It seems to me that the validator should not produce an error for 
>>xml:lang=”". It needs to be fixed.
>>I’m not clear whether the HTML DTD supports an empty string value for 
>>lang. If so, the presumably the validator needs to be fixed. If not, then 
>>this is not a viable option, since you’d really want both lang and 
>>xml:lang to have the same values.
>>Would the description ‘undetermined’ fit this case, given that it is 
>>not a language at all? Again, it doesn’t seem right to me, since 
>>‘undetermined’ seems to suggest that it is a language of some sort, 
>>but we’re not sure which.
>>This seems to be the right choice for me. It would produce no validation 
>>issues. The only issue is perhaps that it’s not terrible memorable.
>>[1] http://www.w3.org/International/tutorials/language-decl/
>>[2] http://www.w3.org/TR/REC-xml/#sec-lang-tag
>>[3] http://www.iana.org/assignments/language-subtag-registry
>>Richard Ishida
>>Internationalization Lead
>>W3C (World Wide Web Consortium)
>>  http://www.w3.org/People/Ishida/

Its tax season, make sure to follow these few simple tips 
Received on Saturday, 17 March 2007 18:41:24 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:53 UTC