Re: discussion about Semantic Web realization

Maciej Gawinecki wrote:

> Thank you for your help,

A suggestion/question: how might your analysis differ if you took the 
perspective that there is only one Web, ... the World Wide one; and that 
'Semantic Web' is the name of a (world-wide) project to help improve it. 
  Just as the 'Mobile Web' initiative aims to progress the state of the 
Web art relating to use from mobile devices. If you talk about the 
Semantic Web as a new replacement Web, you're bound to be dissapointed. 
If you think of it as a collaborative project, hopefully you'll find a 
way to get involved and help with it.

Noun phrases can mislead us sometimes. A phrase such as '[the] Semantic 
Web' (or 'Mobile ...') can have the unfortunate side-effect that it 
slips us into thinking that there are a countable number of "Webs". And 
we then look around and we see the these apparently-new "Webs" look like 
peas in orbit around the Jupiter "classic Web".

If we focus instead on the notion of their being just one Web, we can 
still ask why the proportion of it with an RDF representation is 
relatively tiny. But we don't take the absence of a new all-replacing 
'thing' as a measure of failure.

Thinking about RDF:

 > - decentralization, no central database of content and links

RDF also has this characteristic. The Web itself is our distributed 
database of schemas. We're all free to use shared schemas, or our own 
application-specific schemas. And by using Web identifiers for our 
descriptive terms, we set things up so that mappings (often lossy, 
pragmatic mappings) can be created days months or years later, either in 
procedural code or using technologies like SPARQL, OWL, RIF. The 
important thing is that these agreements and mappings can be documented 
later, if at all. People can get on with their immediate business 
without asking for permission or forgiveness. There is more 
bottlenecking and centralisation in a traditional 'enterprise' SQL-based 
   environment than in the entire planet-wide Semantic Web.


 > - one-way links, requiring no cooperation, approval from the link target

This corresponds to RDF's claim-based design, where anything that can be 
read as RDF is free to encode claims about anything else. eg. (for 
better or worse) I can talk about you in my FOAF file whether you like 
it or not.


 > - a simple protocol (HTTP) and markup format (HTTP*) that anyone could
adapt and copy

(assume you meant HTML for the latter (*))

RDF/SemWeb uses HTTP heavily too (but doesn't require it; we can eg. 
query SPARQL over XMPP protocol). For formats, a system designed for 
improved machine processing is by necessity going to be harder for 
humans to create at the byte or character level. But there are various 
efforts in play to ensure that we can RDF views of as much data as 
possible: by mapping from SQL (which humans have GUIs for, Web based and 
desktop); from wellformed or annotated HTML (GRDDL/RDFa), from Wikis, 
etc. Semantic Web people are pragmatists, and will pull data in from 
wherever it can be found...

 > - no established competitors serving the same need

Depending on level of analysis, Gopher was an early competitor; however 
the Web was a unifying abstraction that embraced gopher, ftp, telnet etc 
as components of our information universe; it embraces RDF too.

 > - significant commercial interest in selling more PCs, online 
services, net access, etc.

It's the single same Web... if RDF can drive traffic to commercial sites 
[yes, a work in progress] then the same business benefits can kick in.

 > - no critical mass required to make the Web interesting and useful

I don't see a fundamental difference here. RDF could be used on a single 
site quite happily, eg. to provide faceted browse into a collection of 
things. For example, 
http://www.w3.org/2001/sw/Europe/showcase/sem-portal.html

The Semantic Web isn't a new replacement Web; it's a project, part of 
the wider Web project. You can poke around in its origins, eg see  -
http://www.w3.org/Talks/WWW94Tim/ or 
http://www.w3.org/1999/11/11-WWWProposal/thenandnow

But yes [see below], RDF is at its best when cross-domain data is being 
merged; and more data makes this case more compelling than only having a 
few files. If the Web was a single page only, we'd probably search it 
with 'grep' rather than Google's server farm, after all.



As to your critical points:

 > - requires a measure of centralization in order to make sense of 
schemas, i.e. the semantics cannot be built in to every client as the
semantics of HTML and HTTP were built in to browsers

RDF is an exercise in 'agreeing to disagree'. By buying into a common 
data model (the nodes-and-arcs thing), decentralised parties can invent 
whatever classes and properties and URIs they like, benefiting from 
shared infrastructure (APIs, data stores, query languages) that are 
utterly domain neutral. Furthermore (and by contrast to typical XML 
usage) the descriptive vocabularies created by these different parties 
can be freely combined *without* prior or centralised agreement by those 
parties.

For example, look at 
http://search.cpan.org/src/ASCOPE/Net-Flickr-Backup-2.6/README and the 
list of namespaces used. I'll repeat them here:
      <rdf:RDF
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
       xmlns:a="http://www.w3.org/2000/10/annotation-ns"
       xmlns:acl="http://www.w3.org/2001/02/acls#"
       xmlns:exif="http://nwalsh.com/rdf/exif#"
       xmlns:skos="http://www.w3.org/2004/02/skos/core#"
       xmlns:cc="http://web.resource.org/cc/"
       xmlns:foaf="http://xmlns.com/foaf/0.1/"
       xmlns:exifi="http://nwalsh.com/rdf/exif-intrinsic#"
       xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
       xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
       xmlns:flickr="x-urn:flickr:"
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:i="http://www.w3.org/2004/02/image-regions#">

Now, OK some familiar faces show up when you look behind the scenes at 
who created those schemas (well we're a small but growing community!). 
However the development of these schemas did not *need* pairwise or 
central coordination, and the author of the Perl Net::Flickr::Backup 
module (Aaron Straup Cope) absolutely did not need anyones permission to 
recombine these schemas to create image descriptions which integrate 
data by using them all.

I'll dwell on this point a bit longer as it is a key one, and at risk of 
being lost in the social history of the Semantic Web. The push for RDF 
came in good part from people who were sick of sitting in metadata 
standardisation meetings, and in dealing with scoping overlaps. The RDF 
design is heavily decentralistic compared to some other approaches that 
could have been taken.

In earlier RDFS drafts we made some of this heritage more explicit, see eg
http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
[[
RDF and the RDF Schema language were also based on metadata research in 
the Digital Library community. In particular, RDF adopts a modular 
approach to metadata that can be considered an implementation of the 
Warwick Framework [WF]. RDF represents an evolution of the Warwick 
Framework model in that the Warwick Framework allowed each metadata 
vocabulary to be represented in a different syntax. In RDF, all 
vocabularies are expressed within a single well defined model. This 
allows for a finer grained mixing of machine-processable vocabularies, 
and addresses the need [EXTWEB] to create metadata in which statements 
can draw upon multiple vocabularies that are managed in a decentralized 
fashion by independent communities of expertise.
]]

The Warwick Framework was a conceptualisation of the  metadata problem 
space from the Dublin Core community in 1996; see 
http://www.dlib.org/dlib/july96/lagoze/07lagoze.html    ... it proposed 
a way of breaking descriptive tasks down into scoped 'packages'.

Quoting from the 1996 dlib paper,
[[
  The result of the Warwick Workshop is a container architecture, known 
as the Warwick Framework. The framework is a mechanism for aggregating 
logically, and perhaps physically, distinct packages of metadata. This 
is a modularization of the metadata issue with a number of notable 
characteristics.

     * It allows the designers of individual metadata sets to focus on 
their specific requirements, without concerns for generalization to 
ultimately unbounded scope .
     * It allows the syntax of metadata sets to vary in conformance with 
semantic requirements, community practices, and functional (processing) 
requirements for the kind of metadata in question.
     * It separates management of and responsibility for specific 
metadata sets among their respective "communities of expertise".
     * It promotes interoperability by allowing tools and agents to 
selectively access and manipulate individual packages and ignore others.
     * It permits access to the different metadata sets that are related 
to the same object to be separately controlled.
     * It flexibly accommodates future metadata sets by not requiring 
changes to existing sets or the programs that make use of them.

The separation of metadata sets into packages does not imply that 
packages are completely semantically distinct. In fact, it is a feature 
of the Warwick Framework that an individual container may hold packages, 
each managed and maintained by distinct parties, which have complex 
semantic overlap.
]]

In someways RDF is a realisation of this abstract architecture. But with 
RDF we really went further in exploring the issue of semantic overlap 
amongst different metadata 'packages'. By imposing a common data model 
across all metadata packages, we make it possible for apps to express 
data and queries that combine for example, rights metadata, geographic, 
discovery, workflow, tagging or any other RDF-expressed characteristics.

In this conceptualisation, we are buying more decentralisability at the 
expense of imposing a shared data model.

Regarding your point about decentralisation, I think RDF compares rather 
well with XML. Anyone can invent an XML schema and deploy it; the 
technology allows XML elements and attributes to be used in wildly 
varying ways. In RDF's XML syntax(es), the notation is always an 
encoding of a set of RDF claims about the world. We have a common set of 
rules to help interpret this, making it easier rather than harder to 
process and integrate data from unknown namespaces. If I see a new RDF 
schema, I know that it defines classes, properties and not a lot else. 
This lowers some costs (and raises some others, sure; nothing for free 
here). RDF takes expressive power away from those that define schemas, 
such that they all have a lot more in common. It's a shared pattern for 
schema authors designed to allow them to get on with their job and not 
have to fly to meetings with each other. By agreeing that data modeling 
work once instead of pairwise, we save on a lot of airfares, and a lot 
of teleconferences.

 > - requires much more cooperation from data sources (e.g. link targets)

I suspect some confusion about 'link targets' here. In classic Web, a 
link target is the thing you're pointing to. In the Semantic Web 
project, we can describe anything that the classic Web might link to; 
and beyond that, we can use reference-by-description techniques to talk 
about things indirectly, via their descriptions. No consent needed. I 
can talk about 'the person whose homepage is http://john.example.com/' 
for example.

 > - is based on a complex markup (RDF) that's difficult for
non-programmers to work with

Two flavours of complexity here:

1. Each RDF notation (RDF/XML, RDFa, ... custom GRDDL-ready formats) has 
some (varying) difficulty associated with learning the encoding. And an 
associated fragility risk: a misunderstanding or error could mess up the 
entire chunk of data if there's a mistake. RDF notations have not 
traditionally had any form of recovery from this, ie. nothing like the 
'quirks mode' that Web browsers have, where bad HTML is still somehow 
converted into a user-facing document.

2. Merely having a distinction between abstract data model vs markup(s) 
is a level of indirection that can be confusing, especially without 
fantastic tool support, tutorial materials etc.

These are real issues. But HTML itself is also difficult for 
non-programmers to work with *well*. Which is why so many sites don't 
give reliable cross-browser experience (people code for IE; as MacOSX 
Firefox user I suffer often enough when visiting bad HTML sites). 
Perhaps the difference here is that crappy HTML coding leads to 
sometimes-crappy Web experience; crappy RDF coding leads to ... no data 
at all from that document. The use of RDFa in an HTML5 context is where 
this part of the discussion goes next: it should be possible to mix 
semantic markups into environments where non-draconian error handing is 
the rule. The microformats folk do this, for example. We're all still 
figuring out exactly what the tradeoffs are here: how much mess to allow 
before things become too scruffy for our poor machines to have any idea 
what's happening?


 > - has to compete with its predecessor and many other technologies

I view this as a misunderstanding. It may be cleanest to think of the 
"Semantic Web" simply as a project. When http://www.w3.org/2001/sw/ says 
"The Semantic Web is a Web of data" it isn't talking about any other Web 
but the one we know and love. Read it as 'The-Web-made-more-Semantic is 
a Web of data', perhaps.


 > - has very little commercial interest, unclear revenue model, etc.

There may be no 'make money fast' route akin to the crazy dot-com days, 
but I see here more a 'chicken and egg' issue (which you allude to 
above). While RDF can be used on a single site, unless there is a lot of 
it around, nobody's going to bother building a planet-scale index of it. 
    And unless there's a planet-scale index and it's being used by major 
search engines, people won't have an incentive to publish a lot of RDF 
in the public Web. If things turn out well, publishing RDF should help 
drive users to classic Web sites, where they'll be parted from their 
money through various essentially timeless techniques. Some things 
change; some stay the same.

Re chicken-and-egg., ... I think we've done a bit to break that cycle in 
the FOAF scene. In recent months Google's Social Graph API has been 
indexing much of the public FOAF data out there, and more recently still 
has been using a real RDF parser. While this isn't currently integrated 
into Google's main user facing search, I am very encouraged by these 
developments and by those at Yahoo around RDF/RDFa. It has taken a while 
but we're getting there.

My other thought re "critical mass" has been that SemWeb adoption is 
difficult because as a fundamentally cross-domain technology, we only 
really show strong benefits, ie. the technology's key strengths, when 
used in several overlapping fields simultaneously. And as a 
representation system where data can always be missing, and always be 
extended/augmented, it can take a lot of data before there is enough to 
reliably index in certain ways. My answer to this (besides FOAF) is to 
suggest that SemWeb may perhaps take of in a few large cities first. 
Geographical proximity could allow critical mass of early-adopter data 
even without things going RDF crazy planet-wide. Some of us put in an EU 
project proposal on this a few years back, but the EU reviewers in their 
infinite wisdom chose not to fund it. Ah well :)

 > - requires a critical mass of participating sites to be interesting 
and useful

As I say above, having a mass of data isn't essential, although nice of 
course. And the work can be distributed: while the Semantic MediaWiki 
folk are showing how built-in RDF facilities could add value to 
MediaWiki and Wikipedia, the DBPedia team are already showing an 
externally generated RDF version of Wikipedia. Participation too is 
optional but not required; 3rd parties can write GRDDL transforms for 
XML formats, or D2RQ etc adaptors for existing SQL datasets. There are a 
lot of scraper/extractor/converter scripts around, and a few lines of 
code can create a huge amount of data.


These are good kinds of questions to ask, but I think somehow all a bit 
skewed by thinking of SW as a replacement for the Web, or as a rival for 
existing search engines. It may be that some fancy new search engine 
comes along that is fundamentally RDF-oriented, but it's also clear that 
there are many folk at the existing search engines who are well aware 
that the Web is slowly offering more by way of structured (meta)data.

cheers,

Dan

--
http://danbri.org/

Received on Monday, 28 April 2008 15:40:11 UTC