- From: Alex Russell <slightlyoff@google.com>
- Date: Wed, 27 Mar 2013 16:07:54 +0000
- To: Stéphane Corlosquet <scorlosquet@gmail.com>
- Cc: Christian Bizer <chris@bizer.de>, "nrm@arcanedomain.com" <nrm@arcanedomain.com>, "www-tag@w3.org" <www-tag@w3.org>, "tcelik@mozilla.com" <tcelik@mozilla.com>, Robert Meusel <robert@informatik.uni-mannheim.de>
- Message-ID: <CANr5HFUYT=RHT7hQ5b5enY91AW02MZoeO5Kguf+=556_MDL2_Q@mail.gmail.com>
On Wednesday, March 27, 2013, Stéphane Corlosquet wrote: > > > > On Wed, Mar 27, 2013 at 11:48 AM, Christian Bizer <chris@bizer.de<javascript:_e({}, 'cvml', 'chris@bizer.de');> > > wrote: > >> Hi Noah and Alex, >> >> >> > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it >> start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc? >> >> Some deployment statistics for RDFa are provided by >> http://www.webdatacommons.org/ >> >> The project does not use a browser plug-in for gathering the statistics, >> but extracts RDFa as well as Microdata and Microformats from the >> CommonCrawl, a large publicly available web crawl. >> >> Detailed statistics from the last extraction (Mid 2012, 3 billion HTML >> pages) are found at >> >> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html> >> >> HTML pages are included into the CommonCrawl based on their PageRank. >> Thus the crawl is likely to cover the popular part of the Web, but things >> get sparse on the less interlinked sites. >> >> As meaningless with its browser-extension based data collection approach >> might also cover such pages, I think it will be interesting to see whether >> the statistics from both projects will point into the same direction. As >> far as I can see, they currently do to a large extent :-) >> > > One interesting aspect of meaningless not covered by > CommonCrawl/WebDataCommons is that it will include all browsed pages, > including those from the deep web locked behind passwords or firewall and > typically not accessible to public crawlers like the ones from Google or > CommonCrawl. (Alex, correct me if I'm wrong on that one). > Nope, that's correct. It's why ensuring that the data is entirely anonymous (and IPs are not logged, etc.) is such a high priority in the code. > How does meaningless deal with URLs having different markup based on user > variables such as someone being logged in to a service or being connect via > a particular network/VPN? Imagine https://www.facebook.com/ or > https://plus.google.com/ having different markup depending on whether the > user is logged in and who is logged in. > It doesn't care. It's not logging URLs or any other sensitive data, only characterizing the total # of elements, their types, and the ad-hoc metadata that gets added to the elements. > I'm assuming that reloading the page does not skew the numbers and has no > effect on the stats, i.e. the latest reload overrides the previous page > stats. > Nope, reloading simply add those elements to the count. The goal isn't to disambiguate by URL or anything like that, it's to get a sense for which semantics are most in use, and to do that, observing counts seems appropriate. If that's not correct for some reason, let me know = ) > Steph. > > >> >> Cheers, >> >> Chris >> >> >> Am 27.03.2013 15:47, schrieb Noah Mendelsohn: >> >> (Leaving off most of the cc: list to avoid cross-posted discussion. >>> Nothing sensitive here -- feel feel to forward if useful.) >>> >>> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing >>> to extend it start gathering statistics on JSON, XML, various forms of RDF, >>> RDF-a, etc? For that matter, it would also be >really< interesting to watch >>> things like content that will be interpreted differently by the HTML5 >>> sniffing rules than by following authoritative metadata. >>> >>> In general, you seem to be on a very nice slippery slope of building a >>> dashboard for the Web's data/content encoding. Are you interested in >>> heading further down the slope? >>> >>> Noah >>> >>> On 3/27/2013 9:58 AM, Alex Russell wrote: >>> >>>> Hi all, >>>> >>>> These lists host many debates about the semantics (or lack thereof) of >>>> HTML. Good data that bears on these questions is often hard to come by. >>>> This isn't anyone's fault per sae but it sure would be nice if we had >>>> better data to use as the baseline for discussions about what should >>>> (and >>>> shouldn't) be in HTML.next. >>>> >>>> In the interest of building such a corpus, I've created a small >>>> extension >>>> to help gather information on the real-world semantics that users >>>> encounter >>>> in the web; both semantic HTML and extensions to it like Microformats, >>>> schema.org <http://schema.org> markup, and ARIA roles and states. >>>> Crawlers >>>> miss a lot as they (generally) aren't running scripts and interacting >>>> deeply with sites, so this anonymizing system attempts to fill that gap >>>> by >>>> observing the semantic content of pages both when the load and as they >>>> change over time. >>>> >>>> Why cross-post this so broadly? Because I need your help! If you think >>>> evolving the web based on data is better than trying to do it without >>>> and >>>> you happen to use Chrome as your browser, please install the extension: >>>> >>>> https://chrome.google.com/**webstore/detail/meaningless/** >>>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details> >>>> >>>> If you're a developer and use another browser, I'd love your help in >>>> porting the extension to other platforms (FF, Safari, etc.): >>>> >>>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless> >>>> >>>> If you're interested in the data, a sparse reporting front-end is >>>> currently >>>> in place: >>>> >>>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global> >>>> >>>> Help is needed to analyze the data in more meaningful ways, visualize >>>> it, >>>> etc. Filing tickets and submitting pull requests is the easiest way to >>>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues> >>>> >>>> Thanks for your help and attention. >>>> >>> >>> >>> >> >> > > > -- > Steph. >
Received on Wednesday, 27 March 2013 16:08:27 UTC