Contd: Size matters -- How big is the danged thing

Hello everyone,

 Sorry for joining this discussion late, but I think maybe the issue
here is more similar to the issue of "deep" (or hidden/invisible) web
[1] as opposed to the "surface" web (pages indexed by search engines).
Regardless of the copyright and license issues, the concern is not the
value of data sources generated by wrappers, in fact a wrapper script
may produce a much higher-quality data source than a huge static
linked data set. However I believe that the same way that there is a
clear number for the number of web pages indexed by search engines
(e.g., 11.5 billion as of Jan 2005 [2]) and an estimate of the deep
web (e.g., 550 billion as of 2000 [1]), we should have an exact number
for those linked data sources that can be counted, and an estimate of
the rest of the data sources that cannot be counted (like MySpace or
RDF Book Mashup). In this way, no one can question the accuracy of the
numbers provided. And it does make sense to have a much higher number
for the "deep" linked data web (Any better suggestions for a new term
here!?).

 I also think the DataSets wiki page [3] is getting too crowded with
the new table, I suggest moving the table to the bottom, or creating a
new page.

My best regards,
Oktie

[1] http://en.wikipedia.org/wiki/Deep_web
[2] http://en.wikipedia.org/wiki/Surface_Web
[3] http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets



On Fri, Nov 21, 2008 at 5:22 PM, David Wood <david@zepheira.com> wrote:
>
> Sorry to intervene here, but I think Kingsley's suggestion sets up a false
> dicotomy. REST principles (surely part of everything we stand for :) suggest
> that the source of RDF doesn't matter as long as a URL returns what we want.
> Late binding means not having to say you're sorry.
>
> Is it a good idea to set up a class system where those who publish to files
> are somehow better (or even different!) than those who publish via adapters?
>
> So, I vote for counting all of it. Isn't that what Google and Yahoo do when
> they count the number of "pages" indexed?
>
> Regards,
> Dave
> --
>
> On Nov 21, 2008, at 4:26 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>
>>
>> Giovanni Tummarello wrote:
>>>>
>>>> Overall, that's about 17 billion.
>>>>
>>>>
>>>
>>> IMO considering myspace 12 billion triples as part of LOD, is quite a
>>> stretch (same with other wrappers) unless they are provided by the
>>> entity itself (E.g. i WOULD count in livejournal foaf file on the
>>> other hand, ok they're not linked but they're not less useful than the
>>> myspace wrapper are they? (in fact they are linked quite well if you
>>> use the google social API)
>>>
>>>
>>> Giovanni
>>>
>>>
>>>
>> Giovanni,
>>
>> Maybe we should use the following dichotomy re. the Web of Linked Data
>> (aka. Linked Data Web):
>>
>> 1. Static Linked Data or Linked Data Warehouses - which is really what the
>> LOD corpus is about
>> 2. Dynamic Linked Data - which is what RDF-zation middleware (including
>> wrapper/proxy URI generators) is about.
>>
>> Thus, I would say that Jim is currently seeking stats for the Linked Data
>> Warehouse part of the burgeoning Linked Data Web. And hopefully, once we
>> have the stats, we can get on to the more important task of explaining and
>> demonstrating the utility of the humongous Linked Data corpus :-)
>>
>> ESW Wiki should be evolving as I write this mail (i.e. tabulated
>> presentation of the data that's already in place re. this matter).
>>
>>
>> All: Could we please stop .png and .pdf based dispatches of data, it kinda
>> contradicts everything we stand for :-)
>>
>> --
>>
>>
>> Regards,
>>
>> Kingsley Idehen          Weblog: http://www.openlinksw.com/blog/~kidehen
>> President & CEO OpenLink Software     Web: http://www.openlinksw.com
>>
>>
>>
>>
>>
>
>

Received on Tuesday, 25 November 2008 20:27:24 UTC