RE: Documents, Cars, Hills, and Valleys

While I agree a with a lot that has been said on this thread, there are a
couple of points I must take issue with. The main issue for me is the
underlying assumption that a semantic web needs the explicit assertion of
metadata by content providers for it to work. A related issue, IMHO, is the
idea of 'polluted' metadata. The expectation that the great proportion of
content providers will provide metadata is unrealistic, particularly when
you consider the material already one the web. When Dreamweaver & FrontPage
make it obligatory to insert RDF before ftp-ing, only then can we have a
semantic web? Talking of 'polluted' metadata is about as useful as talking
of the English language as polluted because of dialectic variation. If the
English World is only those people that speak pure "Queen's English" then
we're looking at a quaint handful of people around London.

The Semantic Web, in the sense of one in which logical inference can be made
with the material on the web certainly requires metadata that is at least
consistent locally. Granted, people inserting metadata in their output will
be an aid to this, especially if they stick precisely to agreed schema. This
should certainly be encouraged, especially for automatically-produced
content where such conformance is easier (per page) to implement.

The 'what is an identifier' etc discussion is rather angels-on-pinheads.
What we have in the wild is a great mass of information, full of semantic
hooks, identifiers in the form of URLs. These may not be URIs in a form we
might prefer, but Pandora's box has already been opened. A proportion of
these identifiers will have associated with them explicit metadata, but even
this is likely to be 'polluted'.

The world is largely analog, but digital computers are still useful with
real-world data because we can extract discrete approximations. The web is a
semantic continuum, so why shouldn't that be digitised? I would suggest that
to provide the metadata to feed a Semantic Web, we need to look more to
other techniques in the (somewhat taboo) machine learning domain.

For humans to interface with the SW, then decent NLU is desirable, this same
technology can be used to generate metadata - yes, through scraping and
statistical/neural text analysis. There are a lot more sources of data that
could go into the mix as well, like browser behaviour analysis. The
existence of URLs within the dataset gives this an awful lot more potential
than single-document analysis. What I'm talking about is systems like
Google, but instead of producing material for immediate human consumption,
producing metadata for machines. The metadata generated by one such system
may be completely at odds with that generated by another, but this can be
sorted out at the logical layer, using the same methods that would for
example lead us to trust the opinion of expert A over that of expert B (I've
had my wrists slapped too many times to mention putting
fuzzy/statistical/neural techniques on this layer).

So what I'm basically saying is that the web is and will continue to be
'polluted', so any systems that don't take this into account risk excluding
a large proportion of available information, and that the extraction of
implicit metadata can significantly help circumvent the lack of explicit
metadata. Oh yes, and that at the end of the day, when definitions of
identifiers have been agreed on universally within the RDF community, the
world outside will by and large ignore those definitions.

Cheers,
Danny.
---
Danny Ayers
<stuff> http://www.isacat.net </stuff>


>-----Original Message-----
>From: www-rdf-interest-request@w3.org
>[mailto:www-rdf-interest-request@w3.org]On Behalf Of Joshua Allen
>Sent: 23 April 2002 06:38
>To: msabin@interx.com; www-rdf-interest@w3.org
>Subject: RE: Documents, Cars, Hills, and Valleys
>
>
>> There already _are_ thousands of such assertions. Either people are
>
>> Well, this is the status quo, and the prospects of changing it strike
>> me as fairly slim. So if you're right that this renders metadata
>> useless, we may as well pack up and go home.
>
>Now you see my point.  The status quo is that there are a few people
>publishing assertions that very few other people ever use, and are
>impossible to aggregate globally in any meaningful way.
>
>In other words, the status quo is that we do NOT have a semantic web; we
>have a bunch of people rolling their own hypercard systems and claiming
>that they are building a world-wide-web.
>
>In 1989, you could have argued that "there are thousands of hypertext
>pages that use hyperlinks which are only meaningful within context of
>their particular system -- this is the status quo, and dreaming about
>universal identifiers so that all hyperlink systems interoperate is a
>pipe-dream, bub."

>But this was as wrong about the WWW then as it is about the "semantic
>web" now.  A true semantic "web" uses universal identifiers, period.
>Saying that there are lots of fragmented systems that use identifiers
>which are not truly universal is not the same as saying that a system
>which *does* use universal identifiers is not possible or desirable.
>
>Hypercard didn't stop the WWW from being deployed -- in fact the WWW
>made closed-world hypertext systems seem rather insignificant in short
>order.  Maybe closed-world semantic systems are interesting to you, but
>I believe that a semantic web has potential to make the "status quo"
>insignificant.
>
>> largely untroubled by ambguity, or, in practice, ambiguity isn't the
>> disasterous problem you're making it out to be.
>
>In practice, there is no semantic web yet.  And in practice, people
>using identifiers in gratuitously ambiguous ways will never be a part of
>a global semantic web.  We all agree that these people will probably be
>able to do interesting things with their polluted metadata, and perhaps
>even build bridges to the global semantic web through lots of manual
>conversion.  But that's about as relevant to "the semantic web" as
>hypercard was to the WWW.
>

Received on Tuesday, 23 April 2002 05:39:21 UTC