- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Sat, 17 Mar 2007 20:47:10 -0700
- To: CE Whitehead <cewcathar@hotmail.com>
- CC: www-international@w3.org
On 3/17/2007 11:41 AM, CE Whitehead replied to my observation on the possible meta-sates of tagging. What prompted me to write them up is that I see so many different classification schemes struggle with the same types of situations, and the way the systems deal with them is rather ad-hoc, as nobody seems to have had all the cases in mind when designing each system. That leaves it to practical experience to unearth them one-by-one. But perhaps such a systematic list already exists elsewhere. >> You have the classic problem of content tagging here. When applying >> tagging to content you can have these cases: >> >> 1) you have content that has not been classified >> 2) you have content for which classification has failed >> 3) you have content that is known to not fit any of the >> classifications (would not this be 2?) >> 4) you have content to which the classification cannot apply >> 5) you have content that fits multiple classifications >> 6) you have content for which the classification depends on context > (there are some words that cross language boundaries) >> 7) you have content that has been incorrectly classified > (and you know this, then why is it still so ???? I guess I can make > sense of your note below, my question is is it a good idea to leave in > the old tag if it was wrong??? ) >> 8) you have content that has possibly be correctly classified >> 9) you have content that has been correctly classified > (o.k.) >> >> in the case of tagging natural language content, the label "zxx" is >> clearly the correct one for case 4. When there is no linguistic >> content, the classification cannot apply. >> >> "und" seems a fine label when you want to convey that tagging has >> not happened (case 1 or 2 - the distinction between these is not >> necessarily of sufficient interest to carry it forward). But so would >> the empty tag if it had been allowed. > und would be o.k. if there were some language but it has not been > determined which or maybe even how many (I think Addison's comment, > that und was not recommended by the rfc when there was no real > language, was helpful for und) I read this as implying that we are in agreement. >> >> Case 3 could be handled with any form or label that says "no tag >> assigned yet", but failing that, if available, a private tag might be >> useful. >> >> A single string like "OK" is an example that could fit category 5. > > John Cowan did address the off-topic remarks on the word I'd chosen as > an exmaple sufficiently, >> >> Case 6 is not something that I would expect for language tagging, but >> it's a concept that shows up when assigning script tags to runs of text. >> >> Case 7 is something that the tagging systems rarely handle, but for >> archiving and scholarly purposes it is conceivable that there is a >> need to express the concept that the content tagging implies a >> re-tagging of existing content because of errors or disagreement with >> the previously assigned tags. > > O.k. >> >> Case 8 is where you've successfully classified, but there's a margin >> of error (perhaps machine classification). >> Finally, 9 is when you have assigned any of the existing tags with >> confidence. (And case 10 could be where this assignment has been >> reviewed and verified, but that's again something that belongs more >> in the scholarly realm). >> >> With any tagging system you need to decide which of the distinctions >> here you need to convey. The usual problem is that tagging systems >> get designed with 99% of the attention focused on case 9, and there >> on issues such as how fine grained the tags need to be and what >> features of the content to base the classification on. >> >> Just some thoughts, >> A./ >>>
Received on Sunday, 18 March 2007 03:47:35 UTC