W3C home > Mailing lists > Public > www-international@w3.org > January to March 2007

Re: How do I say ‘this is not in any language’ in XHTML/HTML

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Sat, 17 Mar 2007 20:47:10 -0700
Message-ID: <45FCB63E.3090502@ix.netcom.com>
To: CE Whitehead <cewcathar@hotmail.com>
CC: www-international@w3.org

On 3/17/2007 11:41 AM, CE Whitehead replied to my observation on the 
possible meta-sates of tagging. What prompted me to write them up is 
that I see so many different classification schemes struggle with the 
same types of situations, and the way the systems deal with them is 
rather ad-hoc, as nobody seems to have had all the cases in mind when 
designing each system. That leaves it to practical experience to unearth 
them one-by-one.

But perhaps such a systematic list already exists elsewhere.
>> You have the classic problem of content tagging here. When applying 
>> tagging to content you can have these cases:
>> 1) you have content that has not been classified
>> 2) you have content for which classification has failed
>> 3) you have content that is known to not fit  any of the 
>> classifications  (would not this be 2?)
>> 4) you have content to which the classification cannot apply
>> 5) you have content that fits multiple classifications
>> 6) you have content for which the classification depends on context
> (there are some words that cross language boundaries)
>> 7) you have content that has been incorrectly classified
> (and you know this, then why is it still so ????  I guess I can make 
> sense of your note below, my question is is it a good idea to leave in 
> the old tag if it was wrong??? )
>> 8) you have content that has possibly be correctly classified
>> 9) you have content that has been correctly classified
> (o.k.)
>> in the case of tagging natural language content, the label "zxx" is 
>> clearly the correct one for case 4. When there is no linguistic 
>> content, the classification cannot apply.
>> "und" seems  a fine label when you want to convey that tagging has 
>> not happened (case 1 or 2 - the distinction between these is not 
>> necessarily of sufficient interest to carry it forward). But so would 
>> the empty tag if it had been allowed.
> und would be o.k. if there were some language but it has not been 
> determined which or maybe even how many  (I think Addison's comment, 
> that und was not recommended by the rfc when there was no real 
> language, was helpful for und)
I read this as implying that we are in agreement.
>> Case 3 could be handled with any form or label that says "no tag 
>> assigned yet", but failing that, if available, a private tag might be 
>> useful.
>> A single string like "OK" is an example that could fit category 5.
> John Cowan did address the off-topic remarks on the word I'd chosen as 
> an exmaple sufficiently,
>> Case 6 is not something that I would expect for language tagging, but 
>> it's a concept that shows up when assigning script tags to runs of text.
>> Case 7 is something that the tagging systems rarely handle, but for 
>> archiving and scholarly purposes it is conceivable that there is a 
>> need to express the concept that the content tagging implies a 
>> re-tagging of existing content because of errors or disagreement with 
>> the previously assigned tags.
> O.k.
>> Case 8 is where you've successfully classified, but there's a margin 
>> of error (perhaps machine classification).
>> Finally, 9 is when you have assigned any of the existing tags with 
>> confidence. (And case 10 could be where this assignment has been 
>> reviewed and verified, but that's again something that belongs more 
>> in the scholarly realm).
>> With any tagging system you need to decide which of the distinctions 
>> here you need to convey. The usual problem is that tagging systems 
>> get designed with 99% of the attention focused on case 9, and there 
>> on issues such as how fine grained the tags need to be and what 
>> features of the content to base the classification on.
>> Just some thoughts,
>> A./
Received on Sunday, 18 March 2007 03:47:35 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:53 UTC