W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: adasal <adam.saltiel@gmail.com>
Date: Thu, 23 Jun 2011 11:36:03 +0100
Message-ID: <BANLkTiks3z5LZqm7FF+9+2QCFmqXz5+jaA@mail.gmail.com>
To: Henry Story <henry.story@bblfish.net>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, glenn mcdonald <glenn@furia.com>, Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
>
> In the academic/buisness nonsense, you should look at how much IBM and co
> put into SOAP, and where that got them. Pretty much nowhere.
>

I don't agree. SOAP is quite widely adopted in those areas where the use
case (slow running - usually internal to internal - transactions) exist.
Perhaps SOAP is less successful across organisational boundaries?
If so I would suggest that this is because organisational issues have to be
negotiated, such as, for instance, ways of doing business that is complex
and outside of the SOAP solution.

The semantic web seems a lot more fruitful than SOAP to me, and has a lot
> more potential.
>
It is not that difficult, it's just that people - in mass - are slow
> learners.
>
I take your point about diffusion.
We are talking about a huge, perhaps infinite, realm. Trying to think about
its current and future shape is very hard.
This is really my point about Cathedral and Bazaar. I know the original
essay spoke about the benefits of the Bazaar in software evolution, as has
been shown.
But I am also pointing out a contrast between structures and the processes
they give rise to.
I am using the metaphor to say that a Cathedral is a coherent structure in
contrast to the Bazaar. It is hierarchical and expresses clear priorities.
I am pointing out that this is the situation in the internet when we look at
some of its more paradoxical dimensions.
For instance that it is a free community with potentially open access to the
data of each of those individuals (or other agents) while what has become a
prime asset is the ability to harvest data that is generated in the process
of one agent finding another agent, this being by no means open while paying
for much of what we find as 'free' (search, infrastsucture, open source s/w
...).
How things are actually working now must be taken into account when
assessing future potential.
Glenn's needlebase is a beautiful example of this. What is wanted and needed
commercially is ITA (recently acquired by Google) and ITA's needlebase is a
(very exciting) spin off?
Those who make money have a huge influence on the shape of the medium, by
extension how people react to, use and are influenced by it in some regards.
More extremely, due to conditioning of expectations and having expectations
met in other ways there may never be a demand for a more intelligent web.

The internet here is about data gathering and repurposing. I assume that
there is a tiny aggregative gain to google in any repurposed data which is
helpful to them. This could be any source, from google's ITA, Refine or
external.
The economic question w.r.t. the Semantic Web is if it also contributes to a
small gain for e.g. google, what would it get back? How would this work?
On the other hand, if it begins to undermine google etc business model one
can expect a push back in proportion to the perceived size of that threat.
Would there be some commercial gain to be found this way? How would this
work, again?

And if it is merely neutral does that mean it will languish, get lost in the
noise?

Best,

Adam

On 22 June 2011 21:17, Henry Story <henry.story@bblfish.net> wrote:

>
> On 22 Jun 2011, at 21:05, Martin Hepp wrote:
>
> > Glenn:
> >
> >> If there isn't, why not? We're the Semantic Web, dammit. If we aren't
> the masters of data interoperability, what are we?
> > The main question is: Is the Semantic Web an evolutionary improvement of
> the Web, the Web understood as an ecosystem comprising protocols, data
> models, people, and economics - or is it a tiny special interest branch.
> >
> > As said: I bet a bottle of champagne that the academic Semantic Web
> community's technical proposals will never gain more than 10 % market share
> among "real" site-owners, because of
>
> I worked for AltaVista and Sun Microsystems, so I am not an academic.  And
> it would be difficult to get back to academia, as salaries are so low there.
> So we should be thankful at how much good work these people are putting into
> this for love of the subject.
>
> > - unnecessary complexity (think of the simplicity of publishing an HTML
> page vs. following LOD publishing principles),
>
> Well, data manipulation is more difficult of course than simple web pages.
> But there are large benefits to be gained from more structured data. In the
> academic/buisness nonsense, you should look at how much IBM and co put into
> SOAP, and where that got them. Pretty much nowhere. The semantic web seems a
> lot more fruitful than SOAP to me, and has a lot more potential. It is not
> that difficult, it's just that people - in mass - are slow learners. But you
> know there is time.
>
> > - bad design decisions (e.g explicit datatyping of data instances in
> RDFa),
> > - poor documentation for non-geeks, and
> > - a lack of understanding of the economics of technology diffusion.
>
> Technology diffuses a lot slower than people think. But in aggregate it
> diffuses a lot faster than we can cope with.
>  - "history of technology adoption" http://bblfish.net/blog/page1.html#14
>  - "was moore's law inevitable"
> http://www.kk.org/thetechnium/archives/2009/07/was_moores_law.php
>
>
> In any case WebID is so mindbogglingly simple, it falsifies all the points
> above. You have a problem and there is a solution to it. Of course we need
> to stop bad crawling. But also we should start showing how the web can
> protect itself, without asking just for good will.
>
> Henry
>
>
> >
> > Never ever.
> >
> > Best
> >
> > Martin
> >
> > On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:
> >
> >>> From my perspective as the designer of a system that both consumes and
> publishes data, the load/burden issue here is not at all particular to the
> semantic web. Needle obeys robots.txt rules, but that's a small deal
> compared to the difficulty of extracting whole data from sites set up to
> deliver it only in tiny pieces. I'd say about 98% of the time I can describe
> the data I want from a site with a single conceptual query. Indeed, once
> I've got the data into Needle I can almost always actually produce that
> query. But on the source site, I usually can't, and thus we are forced to
> waste everybody's time navigating the machines through superfluous
> presentation rendering designed for people. 10-at-a-time results lists,
> interminable AJAX refreshes, animated DIV reveals, grafting back together
> the splintered bits of tree-traversals, etc. This is all absurdly
> unnecessary. Why is anybody having to "crawl" an open semantic-web dataset?
> Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why
> not? We're the Semantic Web, dammit. If we aren't the masters of data
> interoperability, what are we?
> >>
> >> glenn
> >> (www.needlebase.com)
> >
> >
>
> Social Web Architect
> http://bblfish.net/
>
>
>
Received on Thursday, 23 June 2011 10:36:32 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC