Re: XML Schema vs DAML/RDF/RDFS

On Wed, 17 Apr 2002, R.V.Guha wrote:

Hi Guha,

Interesting questions. Excuse the somewhat rambling response and
telling you stuff you've yourself said at various points in the past...

> I was talking yesterday to a friend whose is working with
> some geologists who want to share data. They are of

Much depends on the kind of data they have, and how broadly they hope to
share the data, and what their goals are for using Web data formats
instead of CSV etc.

> course planning on using xml and are in the process
> of writing up their xml schemas.
>
> They have applications that do all kinds of sophisticated analysis
> on this data. They have no need of doing the kinds of inferences
> that rdfs/daml enables. Their apps do computations that are far
> more complex and it would be easy for them to modify their
> apps to make it do the few (if any) inferential facilities rdfs/daml
> offers, if the need arises.
>
> I tried to make a case for  rdf/rdfs/daml, but given the
> substantially more tools available for xml/xml schema and their
> lack of interest in simple inferences, I couldn't in good faith push
> too hard for rdf/rdfs/daml.

A well made point. I've been hearing variations of this from members of
the digital library community too. The kinds of queries and inferences
licensed by Description Logic-based systems (DAML+OIL etc) are a world
away from the sorts of end-user queries traditionally encountered in the
Digital Library world. Simple phrase, substring and regex searching, or
searching based on based on datatypes, go beyond the facilities innate to
RDF, RDS and DAML+OIL (and, perhaps, WebOnt's new language).

RDF is a pretty handy intermediate representation for data exchange. But
I get the impression some folk feel it has been (to be blunt) side-tracked
into the AI/KR world, and that implementors are now expected to implement
everything in a logic programming / KR environment. I think that's a
mistaken view, and there are plenty of other deployment strategies, but
we've not been that clear on the various options and tradeoffs available
for implementors.

We can ship stuff around in RDF, make use of RDF (and DAML etc) for basic
inferences, and for many apps push the data into more specialised
environments. For example, into a Z39.50 server (eg. Zebra, Cheshire) for
doing substring query of Dublin Core / bibliographic data. Into LDAP or
IMAP tools for white page or email oriented query. Into KR tools for more
sophisticated inference, or MySQL / PostgreSQL for classic database-backed
Web stuff. Or, oftentimes, we can get on fine with pure RDF tools.

> So, should they be using rdfs/daml? Why?

Personal view: I use very little of DAML+OIL. The 'UnambiguousProperty'
construct is useful for picking out those properties that uniquely
identify things without those things having well known URIs. If your
friends care about merging data from multiple sources, there are some
tricks in that vein made easier by having a common data model and some
share conventions for identifying things.

I don't do that much fancy modelling in RDF Schema either. It lets me
define shallow, pragmatic and often ad-hoc categories (er, classses) and
relationship types. And that lets me create and exchange data in a format
defined in terms of things that matter to me (categories and relationship
types) instead of things that are utterly unrelated to my area of interest
(ie. XML Elements and Attributes are boring because they are artifacts of
an encoding, I mostly don't want to think about these any more than
character sets). App developers should have to spend much of their time
thinking about XML Elements and Attributes; by constrast, it is often
productive to think about ones data in terms of categories of things, and
types of relationships and attributes that describe them.

RDF is all about doing things in the Web. If a group of consenting adults
agree on an exact data structure they want to exchange, they could use
RDF, XML, UML, ASN.1 or comma separated files. But if they (for whatever
reason) hope for their data to be intermixed with other related data, or
their data structuring conventions to be adopted in other related data
formats, having some shared representational conventions (like RDF) is a
good bet.

There aren't many well worked out conventions for using (storing,
querying) highly mixed-namespace XML, and there is no general procedure
for merging the contents of two XML documents. RDF, DAML etc are pretty
good at such things (though could get better). Since the Web is the one
big melting pot for data sharing, one might hope that groups preparing
data formats for Web use would care about making it easier for their data
to be mixed and merged with other related information sources.

It'd be nice if we made it easy for them to do this while still using
simple XML tools. Right now, sadly, we haven't clearly explained how one
can create content that works with both XML and RDF tools. The RSS 1.0
format (and other RDF-based formats that constraint their syntax with DTD
or XML Schema) are good examples to build on. There is *nothing* in RDF or
DAML that tells an app what triples to expect in a specific kind of
XML/RDFdocument; meanwhile XML Schema and DTD are good at saying exactly
what arrangements of angle bracket a particular XML format contains.
Using the two levels together is possible but poorly documented and not
widely understood.

Another angle... We know from experience (online and off) that problems
don't neatly de-compose into crisply isolated tasks that can be managed
separately. Resource discovery metadata is tangled up with the task of
describing people, organisations, rights management, versioning etc., for
example. Metadata groups and initiatives are always bumping into one
another as their problem spaces overlap: where does Educational Resource
description shade into more general bibliographic metadata issues, or
issues about describing the creators of digital content and their
competencies and credentials. All these problems are horribly tangled up,
because... because that's the way the world is.  Saying "each community /
application / task requires its own -- independently designed -- XML DTD
or Schema" doesn't on its own address this problem. RDF is good at the
overlap part of the puzzle; less good at other bits of it.

There must be 1000+ XML DTDs or (various forms of) Schema out there. These
are helping various groups get their work done, but there is massive
redundancy and overlap. Many of them will for example offer ways of
(partially) describing people, and documents. The RDF model is an attempt
at characterising what all of these various XML data formats might have
(implicitly) in common. That they can all, more or less, be conceptualised
as encoding claims about the properties, relationships and categories of
various things described in the Web.

The central dillema: RDF (as a Webby thing) is pretty focussed on making
unexpected data re-use possible. As such, folk who know exactly what they
want to do with their data might feel it has its priorities backwards.
They know more about their data than RDF, DAML etc can capture, and feel
forced to choose between (i) using an elements'n'attributes data format
that captures more of what they care about (but bad for re-use) and (ii)
using a resources'n'properties data format that is strong on re-use but
weak on application specific constraints.

We shouldn't be making them choose. XML Schema annotations are part of the
answer. Better tutorials, worked scenarios etc are part of the answer.
As are compelling demos that show why data merging and mixing (with RDF)
can be rewarding...

Sounds like a lot of work. But then so does death by 100,000 DTDs. The
bigger XML gets, the more we'll need to defragment the information thats
being split between 100s of related but unconnected DTDs...

imho etc.,

dan



-- 
mailto:danbri@w3.org
http://www.w3.org/People/DanBri/

Received on Wednesday, 17 April 2002 18:56:52 UTC