Re: Meaningless: towards a real-world web semantics observatory

Hi Noah and Alex,

 > Would it be easy/reasonable/in-the-spirit-of-the-thing to extend it 
start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc?

Some deployment statistics for RDFa are provided by 
http://www.webdatacommons.org/

The project does not use a browser plug-in for gathering the statistics, 
but extracts RDFa as well as Microdata and Microformats from the 
CommonCrawl, a large publicly available web crawl.

Detailed statistics from the last extraction (Mid 2012, 3 billion HTML 
pages) are found at

http://www.webdatacommons.org/2012-08/stats/stats.html

HTML pages are included into the CommonCrawl based on their PageRank. 
Thus the crawl is likely to cover the popular part of the Web, but 
things get sparse on the less interlinked sites.

As  meaningless with its browser-extension based data collection 
approach might also cover such pages, I think it will be interesting to 
see whether the statistics from both projects will point into the same 
direction. As far as I can see, they currently do to a large extent :-)

Cheers,

Chris


Am 27.03.2013 15:47, schrieb Noah Mendelsohn:
> (Leaving off most of the cc: list to avoid cross-posted discussion. 
> Nothing sensitive here -- feel feel to forward if useful.)
>
> This looks very cool. Would it be 
> easy/reasonable/in-the-spirit-of-the-thing to extend it start 
> gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc? 
> For that matter, it would also be >really< interesting to watch things 
> like content that will be interpreted differently by the HTML5 
> sniffing rules than by following authoritative metadata.
>
> In general, you seem to be on a very nice slippery slope of building a 
> dashboard for the Web's data/content encoding. Are you interested in 
> heading further down the slope?
>
> Noah
>
> On 3/27/2013 9:58 AM, Alex Russell wrote:
>> Hi all,
>>
>> These lists host many debates about the semantics (or lack thereof) of
>> HTML. Good data that bears on these questions is often hard to come by.
>> This isn't anyone's fault per sae but it sure would be nice if we had
>> better data to use as the baseline for discussions about what should 
>> (and
>> shouldn't) be in HTML.next.
>>
>> In the interest of building such a corpus, I've created a small 
>> extension
>> to help gather information on the real-world semantics that users 
>> encounter
>> in the web; both semantic HTML and extensions to it like Microformats,
>> schema.org <http://schema.org> markup, and ARIA roles and states. 
>> Crawlers
>> miss a lot as they (generally) aren't running scripts and interacting
>> deeply with sites, so this anonymizing system attempts to fill that 
>> gap by
>> observing the semantic content of pages both when the load and as they
>> change over time.
>>
>> Why cross-post this so broadly? Because I need your help! If you think
>> evolving the web based on data is better than trying to do it without 
>> and
>> you happen to use Chrome as your browser, please install the extension:
>>
>> https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details 
>>
>>
>> If you're a developer and use another browser, I'd love your help in
>> porting the extension to other platforms (FF, Safari, etc.):
>>
>> https://github.com/slightlyoff/meaningless
>>
>> If you're interested in the data, a sparse reporting front-end is 
>> currently
>> in place:
>>
>> http://meaningless-stats.appspot.com/global
>>
>> Help is needed to analyze the data in more meaningful ways, visualize 
>> it,
>> etc. Filing tickets and submitting pull requests is the easiest way to
>> help: https://github.com/slightlyoff/meaningless/issues
>>
>> Thanks for your help and attention.
>
>

Received on Wednesday, 27 March 2013 15:48:47 UTC