[PORT] fwd: [syndic8] Messy use of subject in item data

(crossposting to SW Best Practices list and the public-esw-thes list)

Forward from RSS and Syndication lists. Bill Kearney has been doing some 
digging into the way the dc:subject property is being deployed in 
RSS feeds. Short version: it's a mess.

Part of the problem here, I think, is that there has been a vague 
expectation floating around that RDFS/OWL class and property hierarchies
are the W3C SW stack's last word w.r.t. classififying things. RDF and DC 
people haven't really finished off a good, clear and intuitive model for
using dc:subject with controlled vocabularies. I'm hoping our work here
can help get things back on track.

Characterising the topic(s) of document-like content is helped by RDF
and by OWL, but there's much that can be done that doesn't naturally 
find expression in a classes-and-properties model. Something like SKOS, 
and an agreed model for expressing thesaurus-like content within 
RDF (including via dc:subject) should go some way towards these
problems. But only so long as convenient utilities for authoring better 
dc:subject data  finds its way into mass-market tools (for blogging and
HTML editors).
 
Dan

----- Forwarded message from Bill Kearney <ml_yahoo@ideaspace.net> -----

From: Bill Kearney <ml_yahoo@ideaspace.net>
Date: Sat, 12 Jun 2004 10:50:24 -0400
To: syndic8@yahoogroups.com, rss-dev@yahoogroups.com
Subject: [syndic8] Messy use of subject in item data
Message-ID: <016901c4508c$98581750$200ca8c0@wkearney.com>
Reply-To: syndic8@yahoogroups.com
Organization: http://www.ideaspace.net/users/wkearney/foaf.xrdf

Hi all,

I've done a whole bunch of digging into how feeds are using the dc:subject
element.  It was ugly.  In the latest poll of over 52k feeds there were 9761
that used an item dc:subject at least once.

Sadly, it's little more than a mish-mash of string data.

Of those 52k feeds a subtotal of 24k unique strings were used.  It's at this
point that the data gets messy.

Those 24 thousand unique subject descriptions are all over the map.  Some are
what would be considered usefully simple keywords.  Some are long string
statements, sentences and partial sentences.  A bunch are just utter gibberish.

Several are using what appears to be a quasi-delimited strings.  Some in an
apparent attempt to make hierarchical categorizations while others as multiple
dichotomies.  About 1100 are trying to use the comma as a sort of multiple
keyword delimiter.  The comma is also being (ab)used for formal names (eg
"Smith, John Q.")

Another thousand are trying to use the forward slash as a hierarchy delimiter.
Sometimes with or without leaders and/or space padded.  Some are even trying to
use what look like DMOZ hierarchies (yay!).  Most, however, are not.  They're
just making it up.  <sigh/>

Suffice to say, case sensitivity is equally random.

One feed is trying to use several subjects per item but uses some sort of
numeric identifier:  God knows what the numbers correspond with; the data
doesn't say.

I stared in utter horror upon seeing some feeds trying to use HTML markup in the
dc:subject element.  Aiiieeeee, run way!

I hope and pray there's a special place in Hell reserved for the authors of
feeds trying to use an HTML img as a subject.  Get me a big cluestick as someone
deserves a thump or two.  While the DC is ambiguous on the contents of the
subject element, it's not THAT ambiguous.  Well, maybe not Hell; perhaps just
New Jersey.

Oh, and don't get me started on the number of times someone misspelled a subject
word.

In short, item subjects in feeds are an unholy mess.

I mean, don't get me wrong, it's apparent that people are /trying/ to use some
sort of subject identifiers.  This is a good sign.  But at this point it's not
looking like it'd be very practical to attempt to do much with it.  The data
just seems far too messy to make any predictable, let alone consistent, use of
it.

If anything, stuff like XFML might be a good place to start.  Or perhaps using
some sort of cross-referencing between a 'human readable' label used inside the
dc:subject element and an element in a Topic Map?  Maybe simple stuff like
mapping use of the string "Jokes" as a cross-ref to
http://dmoz.org/Recreation/Humor/Jokes/ or some other ontology noted in the
feed's channel header section.

We've long suggested that folks might want to use DMOZ strings.  Either as the
hierarchy string in it's entirety or as part of some sort of rosetta stone cross
referencing document in XTM, xfml or whatever.  Just as long as we get
/something/ in place that will help let the machines make educated guesses about
what the heck we're talking about.

Is there some way reasonably painless way to introduce some sort of discipline
into the process?  I'd welcome an open discussion on the matter.

-Bill Kearney
Syndic8.com



------------------------ Yahoo! Groups Sponsor --------------------~--> 
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/IRislB/TM
--------------------------------------------------------------------~-> 

To find more info about Syndicated XML newsfeeds visit http://www.syndic8.com 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/syndic8/

<*> To unsubscribe from this group, send an email to:
     syndic8-unsubscribe@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 

----- End forwarded message -----

Received on Saturday, 12 June 2004 11:09:26 UTC