Re: Meaningless: towards a real-world web semantics observatory from Stéphane Corlosquet on 2013-03-27 (www-tag@w3.org from March 2013)

From: Stéphane Corlosquet <scorlosquet@gmail.com>
Date: Wed, 27 Mar 2013 12:01:53 -0400
To: Christian Bizer <chris@bizer.de>
Cc: nrm@arcanedomain.com, www-tag@w3.org, slightlyoff@google.com, tcelik@mozilla.com, Robert Meusel <robert@informatik.uni-mannheim.de>
Message-ID: <CAGR+nnHLCVqx-HRk3ZUT4QJZ7gfhUBJjkr-zFp0Kv7B6yjDmiQ@mail.gmail.com>
On Wed, Mar 27, 2013 at 11:48 AM, Christian Bizer <chris@bizer.de> wrote:

> Hi Noah and Alex,
>
>
> > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it
> start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc?
>
> Some deployment statistics for RDFa are provided by
> http://www.webdatacommons.org/
>
> The project does not use a browser plug-in for gathering the statistics,
> but extracts RDFa as well as Microdata and Microformats from the
> CommonCrawl, a large publicly available web crawl.
>
> Detailed statistics from the last extraction (Mid 2012, 3 billion HTML
> pages) are found at
>
> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>
>
> HTML pages are included into the CommonCrawl based on their PageRank. Thus
> the crawl is likely to cover the popular part of the Web, but things get
> sparse on the less interlinked sites.
>
> As  meaningless with its browser-extension based data collection approach
> might also cover such pages, I think it will be interesting to see whether
> the statistics from both projects will point into the same direction. As
> far as I can see, they currently do to a large extent :-)
>

One interesting aspect of meaningless not covered by
CommonCrawl/WebDataCommons is that it will include all browsed pages,
including those from the deep web locked behind passwords or firewall and
typically not accessible to public crawlers like the ones from Google or
CommonCrawl. (Alex, correct me if I'm wrong on that one).

How does meaningless deal with URLs having different markup based on user
variables such as someone being logged in to a service or being connect via
a particular network/VPN? Imagine https://www.facebook.com/ or
https://plus.google.com/ having different markup depending on whether the
user is logged in and who is logged in.

I'm assuming that reloading the page does not skew the numbers and has no
effect on the stats, i.e. the latest reload overrides the previous page
stats.

Steph.


>
> Cheers,
>
> Chris
>
>
> Am 27.03.2013 15:47, schrieb Noah Mendelsohn:
>
>  (Leaving off most of the cc: list to avoid cross-posted discussion.
>> Nothing sensitive here -- feel feel to forward if useful.)
>>
>> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing
>> to extend it start gathering statistics on JSON, XML, various forms of RDF,
>> RDF-a, etc? For that matter, it would also be >really< interesting to watch
>> things like content that will be interpreted differently by the HTML5
>> sniffing rules than by following authoritative metadata.
>>
>> In general, you seem to be on a very nice slippery slope of building a
>> dashboard for the Web's data/content encoding. Are you interested in
>> heading further down the slope?
>>
>> Noah
>>
>> On 3/27/2013 9:58 AM, Alex Russell wrote:
>>
>>> Hi all,
>>>
>>> These lists host many debates about the semantics (or lack thereof) of
>>> HTML. Good data that bears on these questions is often hard to come by.
>>> This isn't anyone's fault per sae but it sure would be nice if we had
>>> better data to use as the baseline for discussions about what should (and
>>> shouldn't) be in HTML.next.
>>>
>>> In the interest of building such a corpus, I've created a small extension
>>> to help gather information on the real-world semantics that users
>>> encounter
>>> in the web; both semantic HTML and extensions to it like Microformats,
>>> schema.org <http://schema.org> markup, and ARIA roles and states.
>>> Crawlers
>>> miss a lot as they (generally) aren't running scripts and interacting
>>> deeply with sites, so this anonymizing system attempts to fill that gap
>>> by
>>> observing the semantic content of pages both when the load and as they
>>> change over time.
>>>
>>> Why cross-post this so broadly? Because I need your help! If you think
>>> evolving the web based on data is better than trying to do it without and
>>> you happen to use Chrome as your browser, please install the extension:
>>>
>>> https://chrome.google.com/**webstore/detail/meaningless/**
>>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details>
>>>
>>> If you're a developer and use another browser, I'd love your help in
>>> porting the extension to other platforms (FF, Safari, etc.):
>>>
>>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless>
>>>
>>> If you're interested in the data, a sparse reporting front-end is
>>> currently
>>> in place:
>>>
>>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global>
>>>
>>> Help is needed to analyze the data in more meaningful ways, visualize it,
>>> etc. Filing tickets and submitting pull requests is the easiest way to
>>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues>
>>>
>>> Thanks for your help and attention.
>>>
>>
>>
>>
>
>


-- 
Steph.
Received on Wednesday, 27 March 2013 16:02:23 UTC