Multi-namespace architecture, cost etc (was re: Time to move along?)

Ooops, this turned into a big long msg! Sorry for the verbosity.  --danbri

short version:

 - not using namespaces is very expensive (hence RSS 1.0)
 - inventing our own architecture for combining namespaces is expensive
 - rss's core is not much more interesting than HTML's <ul><li>...
 - except that we can decorate it with other XML/RDF vocabularies
 - RSS benefits from RDF vocabularies designed without RSS in mind, and
   even without each other in mind
 - descriptive tasks don't map tidly onto descriptive vocublaries
 - without something like RDF for principled combination of independent
   namespaces, the coordination cost of making sure XML vocabs can work
   together is higher

I believe the 'in your face' cost of using RDF syntax is massively
outweighed by the hidden costs I outline. eg Flying people to boring
meetings to debate how different overlapping XML namespace can be used
together is a real, but non-obvious, cost that we risk if we don't adopt
some principles for namespace composition and design. RDF is the only set
of such principles I've seen proposed for this in the RSS community. "Just
use namespaces" doesn't address the problem of one task, multiple
namespaces: people, events, music, documents, concerts, prices,
locations... If we're interested in applying a variety of descriptive
vocabularies to a single task, we'll need to use vocabularies developed
outside of RSS-DEV. RDF apps focus on just this, whereas many XML apps
focus on a single monolithic DTD or Schema that captures a specific task.
Since we're deploying RSS in a general purpose, pluralist, wide-area
context, I reckon the namespace-mixing style adopted by RDF is well suited
to RSS goals... A rough cut at an motivating scenario is sketched below...


On Fri, 6 Sep 2002, Bill Kearney wrote:

(hmm, lost original attribution; was this David G.?)
> > The problem (perhaps to strong a word) we had at Moreover with implementing

Yup, too strong imho. I don't want to seem dismissive about the genuine
experiences folk have with the spec: the rdf:Seq table of contents does
create extra work. But not much. I'd rather create problems for software
developers than for consumers of newsfeeds (it's their job!), and the lack
of namespaces in pre-1.0 RSS was a big problem: it forced people to
overload the modest representational facilities offered by RSS. RDF was
one of the driving forces that motivated the creation of XML Namespaces,
and remains to this day (imho :) the best mainstream architecture for
deploying, aggregating and merging mixed-namespace XML documents. Sure,
you _can_ cut loose from the discipline RDF imposes and deploy a mix of
XML namespaces without specifying how they are written and interpreted,
but that path leads to TagSoupHell, with each RSS extension vocab created
without adopting common representational conventions shared with other
extensions.

That too can lead to big practical problems: the world doesn't carve up
nicely into discrete, separable descriptive tasks. Often you'll want to
draw upon several specialise representational vocabularies all at once.
Such vocabularies can be designed without consideration for mixing them
together, and deployed as mixed-namespace XML. Or they can be designed to
share a basic common structure whose principles govern their interactions
and mixing. The latter flavour of XML is called RDF.

For example, consider an XML document (an RSS feed) combining several
namespaces:

	RSS 1.0 + Dublin Core + Events/iCalendar + Music(Brainz) +  FOAF/vCard
	+ Wordnet or TAP KB identifiers + a geographical vocabulary and
	an ecommerce/pricing vocabulary.

This bundle of XML/RDF vocabularies might be used as the namespaces in an
RSS feed for (in this example scenario) a music concerts Web site. Folk are
doing this (though perhaps not yet in the detail sketched here). Each
vocabulary I list above provides a piece of the puzzle: the basic document
format (a list of descriptive items) comes from RSS itself. In our example these are
documents listing concerts. Dublin Core adds properties that describe
those documents (title, subject, dates etc). Other folk are working on
systems (eg. Redland/WSE) that consume Dublin Core RDF and make it
searchable with tools based on document retrieval. Others are working on
more specific ways of representing RDF dc:subject using classifications
from thesarui or efforts such as DMoz/OpenDirectory. An events vocabulary (eg.
RDF iCalendar, or RSS Events module, with some tweaks) helps us describe
the events that those documents describe: when they happen, timezones etc. And
to do so in a way that (because other folk are building the tools) can be
imported into calendars (Mozilla calendar, iPods, Palms etc). For concerts
and music we might want to describe more information about the artists
(and perhaps even track listings, for concerts that have happened). So we
could make use of a namespace (and database of information) designed for
describing artists, tracks etc. Fortunately MusicBrainz.org have done just
this. Which is great, it means we don't have to. And because they use not
just XML Namepsaces, but RDF's conventions for using XML Namespaces, we
can plug their work right in alongside the RSS, the Dublin Core, the event
vocabs, the subject taxonomies...  We could use the FOAF or vCard RDF/XML
vocabs to describe information about the people mentioned in our document;
the performers, the contact info for the concert; the homepage or mugshot
or insideLegMeasurement of the lead singer. Our (still fictional) RSS
concert listing feed could use still other RDF/XML vocabs: eg. performer or
band or IDs from the TAP Knowledge base (see
http://tap.stanford.edu/tapkb/ http://tap.stanford.edu/cgi-bin/kb.pl),
since these provide shareable IDs for many of the things the RSS feed will
mention, including places, people, record companies etc. We might also use
geographical markup (there are some RDF vocabs in progress for this), or
markup to represent ticket prices (@@your namespace here). The list goes
on.

So what's my verbose point?

There are several.

Cost is a subtle thing. There are costs associated with trying to squeeze
a rich description of something (eg. docs and the concerts/events/bands
they describe) into a format not designed for the task. Pre-1.0 RSS was
great; but then so were HTML bulleted lists. Trying to share structured
information by squeezing it into a list of 3-field records is costly. With
RSS 1.0, RDF and the Semantic Web, we are trying something tricky: we want
to make it easy to simple things (such as share bulleted lists of new
documents), and possible to do very challenging things (such as augment our
descriptions of those docs with increasingly specific information about
their content: the dates of concerts, the names of bands, the price of
tickets... When talking about costs, problems and annoyances with the
XML-syntax we're using (XML + Namespaces + RDF) we need to think about the
money that will be saved and spent in the world through using these data
feeds, as well as the money that will be spent by programmers adding
another while() loop into their RSS generation code.

There are also costs associated with throwing mixed-namespace XML
documents into the Web when the namespaces they draw upon were designed
independently without the expectiation of their being (unexpectedly :)
combined. Notice how annoying it is that the things we're describing (and
the levels of descriptive detail we care about) are all overlapping, there
is no simple mapping from descriptive task to stand-alone XML vocabulary.
There are lots of namespaces we might use to describe people;  some
applicable to all people (FOAF/vCard), some especially good at our chosen
problem domain (music/performance and musical content), eg. MusicBrainz.
Also, MusicBrainz's RDF vocab has other useful content: it
describes songs as well as artists. And there are other RDF vocabularies
(such as TAP) that can be used to pick out precisely which artist we're
talking about, since the creators of TAP took the time to do so. TAP also
lists a lot of places (many major cities at least) but doesn't go into
huge detail about them. Still further RDF datasources and vocabs do a
better job at Geography.

Descriptive tasks aren't nicely dividable: 'oh, I want to do a concert
listing vocab, I just use the concert-listings DTD' isn't how XML
namespaces will work. There'll always be several relevant vocbularies, and
which ones are chosen and combined will take some thought. The sceanario
above lists some of the raw materials available from RDF vocabularies
relevant to the concert-listings scenario. By using them, information
could become available to apps designed for other purposes (calendars,
address books, document search tools...). This interop isn't guaranteed by
using RDF, but it at least becomes possible.  Without a design for mixing
namespaces, such re-use of data and tools, to my mind, looks a lot less
feasible.

Just thinking about this simple scenario -- concert listings -- throws up
all sorts of problems and opportunities. For those who haven't been
engrossed in RDF for years, the RDF 'value added' might not be clear. This
mail won't make it clear either, but might flag up some of the concerns
that were a priority in RDF's design. With the Resource Description
Framework, we have (funnily enough) a Framework for Describing Resources.
It is as much a social thing as a technical one: a minimalistic set of
conventions for carving up the work of creating XML Namespaces in a way
that allows them to be subsequently combined in unexpected ways.

The XML Namespaces spec offers no guidance on how each namespace is
written. The only spec I know that offers a compelling story about how
independent namespaces can be designed for successful combination is RDF.
It was built with this goal in mind. That doesn't make it magically
effective at solving complex descriptive problems, or make it cheap to
produce and consume such data in the Web. But it does help.

It helps in several ways: by providing a layer of tools that take
namespace mixing and data merging for granted, as a common task for modern
Web tools. By allowing the task of creating lots of complimentary XML/RDF
vocabs to be parcelled up, and the participants in these efforts to have
to do a minmimum of coordination. The vocabs I listed above can all be
used to good effect in a single RSS RDF document. But they weren't even
designed with RSS in mind, let alone with the other vocabs in mind. RDF
was built so that these folks could attend fewer coordination meetings,
and get on with the interesting work. If we abandon RDF, and use XML
Namespaces with no rules on how our markup is written, we might save some
RSS folk some time, but we surely create work elsewhere: the Music Vocab
people will have to talk to the Events vocab people  who'll have to talk
more often to the Dublin Core people. That takes time and money...

Dan


ps. rest of this msg I might've sent separately; it responds to the specific
claim that rdf:Seq creation provides a problematic burden w.r.t. server
load, cpu usage etc.

> > RSS 1.0 was not to do with outputting the format but the extra step it took
> > to add the RDF bag which produced extra load on dynamically created RSS 1.0
> > from a search return (a trivial load but nonetheless greater than that for
> > RSS 0.9x). Using permalinks as the equivalent of RDF-about attributes deals
> > with syntax at the item level. Those that wish to convert to RDF would have
> > to do the secondary process of adding the list items.
>
> If you want to output in a stream then yes you're iterating over two loops.  But
> since RSS is supposed to be a limited number of items (~15) wouldn't it be
> better to build the items in memory and then stream the parts?  No extra DB load
> involved.  Just build both fragments simultaneously and the build the stream.
>
> But yes, it does involve additional cycles to do this.  How much is arguable
> depending on programming style.  Do we need a bake-off to compare massively
> large dataset performance issues here?

If the need to cache a list of ~15 URIs when generating RSS 1.0 is really
a performance/resource problem (rather than a minor nuisance for coders
who could be focussing on charsets, entities and the other gotchas
associated with *any* such XML format), this is a cause for celebration.

Either because these other challenges of deploying XML aren't proving too
painful, or because demand for the generated RSS 1.0 content is so high
that such a minor overhead risks unacceptable server load. If the latter
is true, and the server is being pestered for RSS (especially custom-case
/ personalised RSS that isn't usefully cached), then we've really made it
to the big time.

Received on Saturday, 7 September 2002 15:24:26 UTC