- From: Dan Brickley <danbri@w3.org>
- Date: Sat, 12 Jun 2004 11:09:26 -0400
- To: public-swbpd-wg@w3.org, public-esw-thes@w3.org
- Cc: em@w3.org
(crossposting to SW Best Practices list and the public-esw-thes list) Forward from RSS and Syndication lists. Bill Kearney has been doing some digging into the way the dc:subject property is being deployed in RSS feeds. Short version: it's a mess. Part of the problem here, I think, is that there has been a vague expectation floating around that RDFS/OWL class and property hierarchies are the W3C SW stack's last word w.r.t. classififying things. RDF and DC people haven't really finished off a good, clear and intuitive model for using dc:subject with controlled vocabularies. I'm hoping our work here can help get things back on track. Characterising the topic(s) of document-like content is helped by RDF and by OWL, but there's much that can be done that doesn't naturally find expression in a classes-and-properties model. Something like SKOS, and an agreed model for expressing thesaurus-like content within RDF (including via dc:subject) should go some way towards these problems. But only so long as convenient utilities for authoring better dc:subject data finds its way into mass-market tools (for blogging and HTML editors). Dan ----- Forwarded message from Bill Kearney <ml_yahoo@ideaspace.net> ----- From: Bill Kearney <ml_yahoo@ideaspace.net> Date: Sat, 12 Jun 2004 10:50:24 -0400 To: syndic8@yahoogroups.com, rss-dev@yahoogroups.com Subject: [syndic8] Messy use of subject in item data Message-ID: <016901c4508c$98581750$200ca8c0@wkearney.com> Reply-To: syndic8@yahoogroups.com Organization: http://www.ideaspace.net/users/wkearney/foaf.xrdf Hi all, I've done a whole bunch of digging into how feeds are using the dc:subject element. It was ugly. In the latest poll of over 52k feeds there were 9761 that used an item dc:subject at least once. Sadly, it's little more than a mish-mash of string data. Of those 52k feeds a subtotal of 24k unique strings were used. It's at this point that the data gets messy. Those 24 thousand unique subject descriptions are all over the map. Some are what would be considered usefully simple keywords. Some are long string statements, sentences and partial sentences. A bunch are just utter gibberish. Several are using what appears to be a quasi-delimited strings. Some in an apparent attempt to make hierarchical categorizations while others as multiple dichotomies. About 1100 are trying to use the comma as a sort of multiple keyword delimiter. The comma is also being (ab)used for formal names (eg "Smith, John Q.") Another thousand are trying to use the forward slash as a hierarchy delimiter. Sometimes with or without leaders and/or space padded. Some are even trying to use what look like DMOZ hierarchies (yay!). Most, however, are not. They're just making it up. <sigh/> Suffice to say, case sensitivity is equally random. One feed is trying to use several subjects per item but uses some sort of numeric identifier: God knows what the numbers correspond with; the data doesn't say. I stared in utter horror upon seeing some feeds trying to use HTML markup in the dc:subject element. Aiiieeeee, run way! I hope and pray there's a special place in Hell reserved for the authors of feeds trying to use an HTML img as a subject. Get me a big cluestick as someone deserves a thump or two. While the DC is ambiguous on the contents of the subject element, it's not THAT ambiguous. Well, maybe not Hell; perhaps just New Jersey. Oh, and don't get me started on the number of times someone misspelled a subject word. In short, item subjects in feeds are an unholy mess. I mean, don't get me wrong, it's apparent that people are /trying/ to use some sort of subject identifiers. This is a good sign. But at this point it's not looking like it'd be very practical to attempt to do much with it. The data just seems far too messy to make any predictable, let alone consistent, use of it. If anything, stuff like XFML might be a good place to start. Or perhaps using some sort of cross-referencing between a 'human readable' label used inside the dc:subject element and an element in a Topic Map? Maybe simple stuff like mapping use of the string "Jokes" as a cross-ref to http://dmoz.org/Recreation/Humor/Jokes/ or some other ontology noted in the feed's channel header section. We've long suggested that folks might want to use DMOZ strings. Either as the hierarchy string in it's entirety or as part of some sort of rosetta stone cross referencing document in XTM, xfml or whatever. Just as long as we get /something/ in place that will help let the machines make educated guesses about what the heck we're talking about. Is there some way reasonably painless way to introduce some sort of discipline into the process? I'd welcome an open discussion on the matter. -Bill Kearney Syndic8.com ------------------------ Yahoo! Groups Sponsor --------------------~--> Yahoo! Domains - Claim yours for only $14.70 http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/IRislB/TM --------------------------------------------------------------------~-> To find more info about Syndicated XML newsfeeds visit http://www.syndic8.com Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/syndic8/ <*> To unsubscribe from this group, send an email to: syndic8-unsubscribe@yahoogroups.com <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/ ----- End forwarded message -----
Received on Saturday, 12 June 2004 11:09:26 UTC