Re: Meaningless: towards a real-world web semantics observatory from Alex Russell on 2013-03-27 (www-tag@w3.org from March 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Wed, 27 Mar 2013 16:05:03 +0000
To: Christian Bizer <chris@bizer.de>
Cc: "nrm@arcanedomain.com" <nrm@arcanedomain.com>, "www-tag@w3.org" <www-tag@w3.org>, "tcelik@mozilla.com" <tcelik@mozilla.com>, Robert Meusel <robert@informatik.uni-mannheim.de>
Message-ID: <CANr5HFXLNXkRYSYko5=qGLxWwYaE7wd7sMj-TQYm1ntRsd07WQ@mail.gmail.com>

On Wednesday, March 27, 2013, Christian Bizer wrote:

> Hi Noah and Alex,
>
> > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it
> start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc?
>
> Some deployment statistics for RDFa are provided by
> http://www.webdatacommons.org/


That's exciting data that I wasn't aware of before! Glad to see it exists.
How often are these crawls scheduled?


> The project does not use a browser plug-in for gathering the statistics,
> but extracts RDFa as well as Microdata and Microformats from the
> CommonCrawl, a large publicly available web crawl.
>

Right, so it's also going to miss anything that JS adds to a page, which is
a shame.


> Detailed statistics from the last extraction (Mid 2012, 3 billion HTML
> pages) are found at
>
> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>
>
> HTML pages are included into the CommonCrawl based on their PageRank. Thus
> the crawl is likely to cover the popular part of the Web, but things get
> sparse on the less interlinked sites.
>
> As  meaningless with its browser-extension based data collection approach
> might also cover such pages, I think it will be interesting to see whether
> the statistics from both projects will point into the same direction. As
> far as I can see, they currently do to a large extent :-)
>

The stats I'm gathering are very sparse for now. We'll need to get quite a
bit more data before inferring anything from it.


> Cheers,
>
> Chris
>
>
> Am 27.03.2013 15:47, schrieb Noah Mendelsohn:
>
>> (Leaving off most of the cc: list to avoid cross-posted discussion.
>> Nothing sensitive here -- feel feel to forward if useful.)
>>
>> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing
>> to extend it start gathering statistics on JSON, XML, various forms of RDF,
>> RDF-a, etc? For that matter, it would also be >really< interesting to watch
>> things like content that will be interpreted differently by the HTML5
>> sniffing rules than by following authoritative metadata.
>>
>> In general, you seem to be on a very nice slippery slope of building a
>> dashboard for the Web's data/content encoding. Are you interested in
>> heading further down the slope?
>>
>> Noah
>>
>> On 3/27/2013 9:58 AM, Alex Russell wrote:
>>
>>> Hi all,
>>>
>>> These lists host many debates about the semantics (or lack thereof) of
>>> HTML. Good data that bears on these questions is often hard to come by.
>>> This isn't anyone's fault per sae but it sure would be nice if we had
>>> better data to use as the baseline for discussions about what should (and
>>> shouldn't) be in HTML.next.
>>>
>>> In the interest of building such a corpus, I've created a small extension
>>> to help gather information on the real-world semantics that users
>>> encounter
>>> in the web; both semantic HTML and extensions to it like Microformats,
>>> schema.org <http://schema.org> markup, and ARIA roles and states.
>>> Crawlers
>>> miss a lot as they (generally) aren't running scripts and interacting
>>> deeply with sites, so this anonymizing system attempts to fill that gap
>>> by
>>> observing the semantic content of pages both when the load and as they
>>> change over time.
>>>
>>> Why cross-post this so broadly? Because I need your help! If you think
>>> evolving the web based on data is better than trying to do it without and
>>> you happen to use Chrome as your browser, please install the extension:
>>>
>>> https://chrome.google.com/**webstore/detail/meaningless/**
>>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details>
>>>
>>> If you're a developer and use another browser, I'd love your help in
>>> porting the extension to other platforms (FF, Safari, etc.):
>>>
>>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless>
>>>
>>> If you're interested in the data, a sparse reporting front-end is
>>> currently
>>> in place:
>>>
>>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global>
>>>
>>> Help is needed to analyze the data in more meaningful ways, visualize it,
>>> etc. Filing tickets and submitting pull requests is the easiest way to
>>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues>
>>>
>>> Thanks for your help and attention.
>>>
>>
>>
>>
>

Received on Wednesday, 27 March 2013 16:05:37 UTC