- From: Alex Russell <slightlyoff@google.com>
- Date: Wed, 27 Mar 2013 16:05:03 +0000
- To: Christian Bizer <chris@bizer.de>
- Cc: "nrm@arcanedomain.com" <nrm@arcanedomain.com>, "www-tag@w3.org" <www-tag@w3.org>, "tcelik@mozilla.com" <tcelik@mozilla.com>, Robert Meusel <robert@informatik.uni-mannheim.de>
- Message-ID: <CANr5HFXLNXkRYSYko5=qGLxWwYaE7wd7sMj-TQYm1ntRsd07WQ@mail.gmail.com>
On Wednesday, March 27, 2013, Christian Bizer wrote: > Hi Noah and Alex, > > > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it > start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc? > > Some deployment statistics for RDFa are provided by > http://www.webdatacommons.org/ That's exciting data that I wasn't aware of before! Glad to see it exists. How often are these crawls scheduled? > The project does not use a browser plug-in for gathering the statistics, > but extracts RDFa as well as Microdata and Microformats from the > CommonCrawl, a large publicly available web crawl. > Right, so it's also going to miss anything that JS adds to a page, which is a shame. > Detailed statistics from the last extraction (Mid 2012, 3 billion HTML > pages) are found at > > http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html> > > HTML pages are included into the CommonCrawl based on their PageRank. Thus > the crawl is likely to cover the popular part of the Web, but things get > sparse on the less interlinked sites. > > As meaningless with its browser-extension based data collection approach > might also cover such pages, I think it will be interesting to see whether > the statistics from both projects will point into the same direction. As > far as I can see, they currently do to a large extent :-) > The stats I'm gathering are very sparse for now. We'll need to get quite a bit more data before inferring anything from it. > Cheers, > > Chris > > > Am 27.03.2013 15:47, schrieb Noah Mendelsohn: > >> (Leaving off most of the cc: list to avoid cross-posted discussion. >> Nothing sensitive here -- feel feel to forward if useful.) >> >> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing >> to extend it start gathering statistics on JSON, XML, various forms of RDF, >> RDF-a, etc? For that matter, it would also be >really< interesting to watch >> things like content that will be interpreted differently by the HTML5 >> sniffing rules than by following authoritative metadata. >> >> In general, you seem to be on a very nice slippery slope of building a >> dashboard for the Web's data/content encoding. Are you interested in >> heading further down the slope? >> >> Noah >> >> On 3/27/2013 9:58 AM, Alex Russell wrote: >> >>> Hi all, >>> >>> These lists host many debates about the semantics (or lack thereof) of >>> HTML. Good data that bears on these questions is often hard to come by. >>> This isn't anyone's fault per sae but it sure would be nice if we had >>> better data to use as the baseline for discussions about what should (and >>> shouldn't) be in HTML.next. >>> >>> In the interest of building such a corpus, I've created a small extension >>> to help gather information on the real-world semantics that users >>> encounter >>> in the web; both semantic HTML and extensions to it like Microformats, >>> schema.org <http://schema.org> markup, and ARIA roles and states. >>> Crawlers >>> miss a lot as they (generally) aren't running scripts and interacting >>> deeply with sites, so this anonymizing system attempts to fill that gap >>> by >>> observing the semantic content of pages both when the load and as they >>> change over time. >>> >>> Why cross-post this so broadly? Because I need your help! If you think >>> evolving the web based on data is better than trying to do it without and >>> you happen to use Chrome as your browser, please install the extension: >>> >>> https://chrome.google.com/**webstore/detail/meaningless/** >>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details> >>> >>> If you're a developer and use another browser, I'd love your help in >>> porting the extension to other platforms (FF, Safari, etc.): >>> >>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless> >>> >>> If you're interested in the data, a sparse reporting front-end is >>> currently >>> in place: >>> >>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global> >>> >>> Help is needed to analyze the data in more meaningful ways, visualize it, >>> etc. Filing tickets and submitting pull requests is the easiest way to >>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues> >>> >>> Thanks for your help and attention. >>> >> >> >> >
Received on Wednesday, 27 March 2013 16:05:37 UTC