- From: Dan Brickley <danbri@w3.org>
- Date: Sat, 12 Jun 2004 11:10:37 -0400
- To: public-swbp-wg@w3.org, public-esw-thes@w3.org
- Cc: em@w3.org
[resent with public-swbp-wg@w3.org spelled correctly] * Dan Brickley <danbri@w3.org> [2004-06-12 11:09-0400] > > (crossposting to SW Best Practices list and the public-esw-thes list) > > Forward from RSS and Syndication lists. Bill Kearney has been doing some > digging into the way the dc:subject property is being deployed in > RSS feeds. Short version: it's a mess. > > Part of the problem here, I think, is that there has been a vague > expectation floating around that RDFS/OWL class and property hierarchies > are the W3C SW stack's last word w.r.t. classififying things. RDF and DC > people haven't really finished off a good, clear and intuitive model for > using dc:subject with controlled vocabularies. I'm hoping our work here > can help get things back on track. > > Characterising the topic(s) of document-like content is helped by RDF > and by OWL, but there's much that can be done that doesn't naturally > find expression in a classes-and-properties model. Something like SKOS, > and an agreed model for expressing thesaurus-like content within > RDF (including via dc:subject) should go some way towards these > problems. But only so long as convenient utilities for authoring better > dc:subject data finds its way into mass-market tools (for blogging and > HTML editors). > > Dan > > ----- Forwarded message from Bill Kearney <ml_yahoo@ideaspace.net> ----- > > From: Bill Kearney <ml_yahoo@ideaspace.net> > Date: Sat, 12 Jun 2004 10:50:24 -0400 > To: syndic8@yahoogroups.com, rss-dev@yahoogroups.com > Subject: [syndic8] Messy use of subject in item data > Message-ID: <016901c4508c$98581750$200ca8c0@wkearney.com> > Reply-To: syndic8@yahoogroups.com > Organization: http://www.ideaspace.net/users/wkearney/foaf.xrdf > > Hi all, > > I've done a whole bunch of digging into how feeds are using the dc:subject > element. It was ugly. In the latest poll of over 52k feeds there were 9761 > that used an item dc:subject at least once. > > Sadly, it's little more than a mish-mash of string data. > > Of those 52k feeds a subtotal of 24k unique strings were used. It's at this > point that the data gets messy. > > Those 24 thousand unique subject descriptions are all over the map. Some are > what would be considered usefully simple keywords. Some are long string > statements, sentences and partial sentences. A bunch are just utter gibberish. > > Several are using what appears to be a quasi-delimited strings. Some in an > apparent attempt to make hierarchical categorizations while others as multiple > dichotomies. About 1100 are trying to use the comma as a sort of multiple > keyword delimiter. The comma is also being (ab)used for formal names (eg > "Smith, John Q.") > > Another thousand are trying to use the forward slash as a hierarchy delimiter. > Sometimes with or without leaders and/or space padded. Some are even trying to > use what look like DMOZ hierarchies (yay!). Most, however, are not. They're > just making it up. <sigh/> > > Suffice to say, case sensitivity is equally random. > > One feed is trying to use several subjects per item but uses some sort of > numeric identifier: God knows what the numbers correspond with; the data > doesn't say. > > I stared in utter horror upon seeing some feeds trying to use HTML markup in the > dc:subject element. Aiiieeeee, run way! > > I hope and pray there's a special place in Hell reserved for the authors of > feeds trying to use an HTML img as a subject. Get me a big cluestick as someone > deserves a thump or two. While the DC is ambiguous on the contents of the > subject element, it's not THAT ambiguous. Well, maybe not Hell; perhaps just > New Jersey. > > Oh, and don't get me started on the number of times someone misspelled a subject > word. > > In short, item subjects in feeds are an unholy mess. > > I mean, don't get me wrong, it's apparent that people are /trying/ to use some > sort of subject identifiers. This is a good sign. But at this point it's not > looking like it'd be very practical to attempt to do much with it. The data > just seems far too messy to make any predictable, let alone consistent, use of > it. > > If anything, stuff like XFML might be a good place to start. Or perhaps using > some sort of cross-referencing between a 'human readable' label used inside the > dc:subject element and an element in a Topic Map? Maybe simple stuff like > mapping use of the string "Jokes" as a cross-ref to > http://dmoz.org/Recreation/Humor/Jokes/ or some other ontology noted in the > feed's channel header section. > > We've long suggested that folks might want to use DMOZ strings. Either as the > hierarchy string in it's entirety or as part of some sort of rosetta stone cross > referencing document in XTM, xfml or whatever. Just as long as we get > /something/ in place that will help let the machines make educated guesses about > what the heck we're talking about. > > Is there some way reasonably painless way to introduce some sort of discipline > into the process? I'd welcome an open discussion on the matter. > > -Bill Kearney > Syndic8.com > > > > ------------------------ Yahoo! Groups Sponsor --------------------~--> > Yahoo! Domains - Claim yours for only $14.70 > http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/IRislB/TM > --------------------------------------------------------------------~-> > > To find more info about Syndicated XML newsfeeds visit http://www.syndic8.com > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/syndic8/ > > <*> To unsubscribe from this group, send an email to: > syndic8-unsubscribe@yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > ----- End forwarded message -----
Received on Saturday, 12 June 2004 11:10:37 UTC