W3C home > Mailing lists > Public > public-esw-thes@w3.org > June 2004

Re: [PORT] fwd: [syndic8] Messy use of subject in item data

From: Dan Brickley <danbri@w3.org>
Date: Sat, 12 Jun 2004 11:10:37 -0400
To: public-swbp-wg@w3.org, public-esw-thes@w3.org
Cc: em@w3.org
Message-ID: <20040612151037.GE15925@homer.w3.org>

[resent with public-swbp-wg@w3.org spelled correctly]

* Dan Brickley <danbri@w3.org> [2004-06-12 11:09-0400]
> (crossposting to SW Best Practices list and the public-esw-thes list)
> Forward from RSS and Syndication lists. Bill Kearney has been doing some 
> digging into the way the dc:subject property is being deployed in 
> RSS feeds. Short version: it's a mess.
> Part of the problem here, I think, is that there has been a vague 
> expectation floating around that RDFS/OWL class and property hierarchies
> are the W3C SW stack's last word w.r.t. classififying things. RDF and DC 
> people haven't really finished off a good, clear and intuitive model for
> using dc:subject with controlled vocabularies. I'm hoping our work here
> can help get things back on track.
> Characterising the topic(s) of document-like content is helped by RDF
> and by OWL, but there's much that can be done that doesn't naturally 
> find expression in a classes-and-properties model. Something like SKOS, 
> and an agreed model for expressing thesaurus-like content within 
> RDF (including via dc:subject) should go some way towards these
> problems. But only so long as convenient utilities for authoring better 
> dc:subject data  finds its way into mass-market tools (for blogging and
> HTML editors).
> Dan
> ----- Forwarded message from Bill Kearney <ml_yahoo@ideaspace.net> -----
> From: Bill Kearney <ml_yahoo@ideaspace.net>
> Date: Sat, 12 Jun 2004 10:50:24 -0400
> To: syndic8@yahoogroups.com, rss-dev@yahoogroups.com
> Subject: [syndic8] Messy use of subject in item data
> Message-ID: <016901c4508c$98581750$200ca8c0@wkearney.com>
> Reply-To: syndic8@yahoogroups.com
> Organization: http://www.ideaspace.net/users/wkearney/foaf.xrdf
> Hi all,
> I've done a whole bunch of digging into how feeds are using the dc:subject
> element.  It was ugly.  In the latest poll of over 52k feeds there were 9761
> that used an item dc:subject at least once.
> Sadly, it's little more than a mish-mash of string data.
> Of those 52k feeds a subtotal of 24k unique strings were used.  It's at this
> point that the data gets messy.
> Those 24 thousand unique subject descriptions are all over the map.  Some are
> what would be considered usefully simple keywords.  Some are long string
> statements, sentences and partial sentences.  A bunch are just utter gibberish.
> Several are using what appears to be a quasi-delimited strings.  Some in an
> apparent attempt to make hierarchical categorizations while others as multiple
> dichotomies.  About 1100 are trying to use the comma as a sort of multiple
> keyword delimiter.  The comma is also being (ab)used for formal names (eg
> "Smith, John Q.")
> Another thousand are trying to use the forward slash as a hierarchy delimiter.
> Sometimes with or without leaders and/or space padded.  Some are even trying to
> use what look like DMOZ hierarchies (yay!).  Most, however, are not.  They're
> just making it up.  <sigh/>
> Suffice to say, case sensitivity is equally random.
> One feed is trying to use several subjects per item but uses some sort of
> numeric identifier:  God knows what the numbers correspond with; the data
> doesn't say.
> I stared in utter horror upon seeing some feeds trying to use HTML markup in the
> dc:subject element.  Aiiieeeee, run way!
> I hope and pray there's a special place in Hell reserved for the authors of
> feeds trying to use an HTML img as a subject.  Get me a big cluestick as someone
> deserves a thump or two.  While the DC is ambiguous on the contents of the
> subject element, it's not THAT ambiguous.  Well, maybe not Hell; perhaps just
> New Jersey.
> Oh, and don't get me started on the number of times someone misspelled a subject
> word.
> In short, item subjects in feeds are an unholy mess.
> I mean, don't get me wrong, it's apparent that people are /trying/ to use some
> sort of subject identifiers.  This is a good sign.  But at this point it's not
> looking like it'd be very practical to attempt to do much with it.  The data
> just seems far too messy to make any predictable, let alone consistent, use of
> it.
> If anything, stuff like XFML might be a good place to start.  Or perhaps using
> some sort of cross-referencing between a 'human readable' label used inside the
> dc:subject element and an element in a Topic Map?  Maybe simple stuff like
> mapping use of the string "Jokes" as a cross-ref to
> http://dmoz.org/Recreation/Humor/Jokes/ or some other ontology noted in the
> feed's channel header section.
> We've long suggested that folks might want to use DMOZ strings.  Either as the
> hierarchy string in it's entirety or as part of some sort of rosetta stone cross
> referencing document in XTM, xfml or whatever.  Just as long as we get
> /something/ in place that will help let the machines make educated guesses about
> what the heck we're talking about.
> Is there some way reasonably painless way to introduce some sort of discipline
> into the process?  I'd welcome an open discussion on the matter.
> -Bill Kearney
> Syndic8.com
> ------------------------ Yahoo! Groups Sponsor --------------------~--> 
> Yahoo! Domains - Claim yours for only $14.70
> http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/IRislB/TM
> --------------------------------------------------------------------~-> 
> To find more info about Syndicated XML newsfeeds visit http://www.syndic8.com 
> Yahoo! Groups Links
> <*> To visit your group on the web, go to:
>      http://groups.yahoo.com/group/syndic8/
> <*> To unsubscribe from this group, send an email to:
>      syndic8-unsubscribe@yahoogroups.com
> <*> Your use of Yahoo! Groups is subject to:
>      http://docs.yahoo.com/info/terms/
> ----- End forwarded message -----
Received on Saturday, 12 June 2004 11:10:37 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:45:13 UTC