Re: DBpedia hosting burden

On Wed, Apr 14, 2010 at 8:11 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:


> Some have cleaned up their act for sure.
>
> Problem is, there are others doing the same thing, who then complain about
> the instance in very generic fashion.

They're lucky it exists at all. I'd refer them to this Louis CK sketch
- http://videosift.com/video/Louie-CK-on-Conan-Oct-1st-2008?fromdupe=We-live-in-an-amazing-amazing-world-and-we-complain
(if it stays online...).

>> While it is a
>> shame to say 'no' to people trying to use linked data, this would be
>> more saying 'yes, but not like that...'.
>>
>
> I think we have an outstanding blog post / technical note about the DBpedia
> instance that hasn't been published (possibly due to the 3.5 and
> DBpedia-Live work we are doing), said note will cover how to work with the
> instance etc..
[..]
> We do have a solution in mind, basically, we are going to have a different
> place for the descriptor resources and redirect crawlers there  via 303's
> etc..
[...]
> We'll get the guide out.


That sounds useful

>> As you mention, DBpedia is an important and central resource, thanks
>> both to the work of the Wikipedia community, and those in the DBpedia
>> project who enrich and make available all that information. It's
>> therefore important that the SemWeb / Linked Data community takes care
>> to remember that these things don't come for free, that bills need
>> paying and that de-referencing is a privilege not a right.
>
> "Bills" the major operative word in a world where the "Bill Payer" and
> "Database Maintainer" is a footnote (at best) re. perception of what
> constitutes the DBpedia Project.

Yes, I'm sure some are thoughtless and take it for granted; but also
that others are well aware of the burdens.

(For that matter, I'm not myself so sure how Wikipedia cover their
costs or what their longer-term plan is...).


> For us, the most important thing is perspective. DBpedia is another space on
> a public network, thus it can't magically rewrite the underlying physics of
> wide area networking where access is open to the world.  Thus, we can make a
> note about proper behavior and explain how we protect the instance such that
> everyone has a chance of using it (rather than a select few resource
> guzzlers).

This I think is something others can help with, when presenting LOD
and related concepts: to encourage good habits that spread the cost of
keeping this great dataset globally available. So all those making
slides, tutorials, blog posts or software tools have a role to play
here.

>> Are there any scenarios around eg. BitTorrent that could be explored?
>> What if each of the static files in http://dbpedia.org/sitemap.xml
>> were available as torrents (or magnet: URIs)?
>
> When we set up the Descriptor Resource host, these would certainly be
> considered.

Ok, let's take care to explore that then; it would probably help
others too. There must be dozens of companies and research
organizations who could put some bandwidth resources into this, if
only there was a short guide to setting up a GUI-less bittorrent tool
and configuring it appropriately. Are there any bittorrent experts on
these mailing lists who could suggest next practical steps here (not
necessarily dbpedia-specific)?

(ah I see a reply from Ivan; copying it in here...)

> If I were The Emperor of LOD I'd ask all grand dukes of datasources to
> put fresh dumps at some torrent with control of UL/DL ratio :) For
> reason I can't understand this idea is proposed few times per year but
> never tried.

I suspect BitTorrent is in some ways somehow 'taboo' technology, since
it is most famous for being used to distributed materials that
copyright-owners often don't want distributed. I have no detailed idea
how torrent files are made, how trackers work, etc. I started poking
around magnet: a bit recently but haven't got a sense for how solid
that work is yet. Could a simple Wiki page be used for sharing
torrents? (plus published hash of files elsewhere for integrity
checks). What would it take to get started?

Perhaps if http://wiki.dbpedia.org/Downloads35 had the sha1 for each
download published (rdfa?), then others could experiment with torrents
and downloaders could cross-check against an authoritative description
of the file from dbpedia?

>>  I realise that would
>> only address part of the problem/cost, but it's a widely used
>> technology for distributing large files; can we bend it to our needs?
>>
>
> Also, we encourage use of gzip over HTTP  :-)

Are there any RDF toolkits in need of a patch to their default setup
in this regard? Tutorials that need fixing, etc?

cheers,

Dan


ps. re big datasets, Library of Congress apparently are going to have
complete twitter archive - see
http://twitter.com/librarycongress/status/12172217971  ->
http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/

Received on Wednesday, 14 April 2010 19:05:31 UTC