Re: Meaningless: towards a real-world web semantics observatory from Stéphane Corlosquet on 2013-03-27 (www-tag@w3.org from March 2013)

From: Stéphane Corlosquet <scorlosquet@gmail.com>
Date: Wed, 27 Mar 2013 12:15:52 -0400
To: Alex Russell <slightlyoff@google.com>
Cc: Christian Bizer <chris@bizer.de>, "nrm@arcanedomain.com" <nrm@arcanedomain.com>, "www-tag@w3.org" <www-tag@w3.org>, "tcelik@mozilla.com" <tcelik@mozilla.com>, Robert Meusel <robert@informatik.uni-mannheim.de>
Message-ID: <CAGR+nnHbn-Uf2uSn_W9VzUHsC42__ZygkOpmcak=xJqm2aAeWw@mail.gmail.com>
On Wed, Mar 27, 2013 at 12:07 PM, Alex Russell <slightlyoff@google.com>wrote:

>
>
> On Wednesday, March 27, 2013, Stéphane Corlosquet wrote:
>
>>
>>
>>
>> On Wed, Mar 27, 2013 at 11:48 AM, Christian Bizer <chris@bizer.de> wrote:
>>
>>> Hi Noah and Alex,
>>>
>>>
>>> > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it
>>> start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc?
>>>
>>> Some deployment statistics for RDFa are provided by
>>> http://www.webdatacommons.org/
>>>
>>> The project does not use a browser plug-in for gathering the statistics,
>>> but extracts RDFa as well as Microdata and Microformats from the
>>> CommonCrawl, a large publicly available web crawl.
>>>
>>> Detailed statistics from the last extraction (Mid 2012, 3 billion HTML
>>> pages) are found at
>>>
>>> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>
>>>
>>> HTML pages are included into the CommonCrawl based on their PageRank.
>>> Thus the crawl is likely to cover the popular part of the Web, but things
>>> get sparse on the less interlinked sites.
>>>
>>> As  meaningless with its browser-extension based data collection
>>> approach might also cover such pages, I think it will be interesting to see
>>> whether the statistics from both projects will point into the same
>>> direction. As far as I can see, they currently do to a large extent :-)
>>>
>>
>> One interesting aspect of meaningless not covered by
>> CommonCrawl/WebDataCommons is that it will include all browsed pages,
>> including those from the deep web locked behind passwords or firewall and
>> typically not accessible to public crawlers like the ones from Google or
>> CommonCrawl. (Alex, correct me if I'm wrong on that one).
>>
>
> Nope, that's correct. It's why ensuring that the data is entirely
> anonymous (and IPs are not logged, etc.) is such a high priority in the
> code.
>
>
>> How does meaningless deal with URLs having different markup based on user
>> variables such as someone being logged in to a service or being connect via
>> a particular network/VPN? Imagine https://www.facebook.com/ or
>> https://plus.google.com/ having different markup depending on whether
>> the user is logged in and who is logged in.
>>
>
> It doesn't care. It's not logging URLs or any other sensitive data,
> only characterizing the total # of elements, their types, and the ad-hoc
> metadata that gets added to the elements.
>
>
>> I'm assuming that reloading the page does not skew the numbers and has no
>> effect on the stats, i.e. the latest reload overrides the previous page
>> stats.
>>
>
> Nope, reloading simply add those elements to the count. The goal isn't to
> disambiguate by URL or anything like that, it's to get a sense for which
> semantics are most in use, and to do that, observing counts seems
> appropriate. If that's not correct for some reason, let me know = )
>

sounds like a drawback to me. Someone could intentionally skew the stats by
reloading some page (manually or automatically). What about those sites
that reload their homepage every few seconds. Imagine if you left that page
open in a tab somewhere in the background? I guess one could argue that
someone malicious could also simply hack the extension locally to send
false stats to your reporting service...

Steph.


>
>
>> Steph.
>>
>>
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>> Am 27.03.2013 15:47, schrieb Noah Mendelsohn:
>>>
>>>  (Leaving off most of the cc: list to avoid cross-posted discussion.
>>>> Nothing sensitive here -- feel feel to forward if useful.)
>>>>
>>>> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing
>>>> to extend it start gathering statistics on JSON, XML, various forms of RDF,
>>>> RDF-a, etc? For that matter, it would also be >really< interesting to watch
>>>> things like content that will be interpreted differently by the HTML5
>>>> sniffing rules than by following authoritative metadata.
>>>>
>>>> In general, you seem to be on a very nice slippery slope of building a
>>>> dashboard for the Web's data/content encoding. Are you interested in
>>>> heading further down the slope?
>>>>
>>>> Noah
>>>>
>>>> On 3/27/2013 9:58 AM, Alex Russell wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> These lists host many debates about the semantics (or lack thereof) of
>>>>> HTML. Good data that bears on these questions is often hard to come by.
>>>>> This isn't anyone's fault per sae but it sure would be nice if we had
>>>>> better data to use as the baseline for discussions about what should
>>>>> (and
>>>>> shouldn't) be in HTML.next.
>>>>>
>>>>> In the interest of building such a corpus, I've created a small
>>>>> extension
>>>>> to help gather information on the real-world semantics that users
>>>>> encounter
>>>>> in the web; both semantic HTML and extensions to it like Microformats,
>>>>> schema.org <http://schema.org> markup, and ARIA roles and states.
>>>>> Crawlers
>>>>> miss a lot as they (generally) aren't running scripts and interacting
>>>>> deeply with sites, so this anonymizing system attempts to fill that
>>>>> gap by
>>>>> observing the semantic content of pages both when the load and as they
>>>>> change over time.
>>>>>
>>>>> Why cross-post this so broadly? Because I need your help! If you think
>>>>> evolving the web based on data is better than trying to do it without
>>>>> and
>>>>> you happen to use Chrome as your browser, please install the extension:
>>>>>
>>>>> https://chrome.google.com/**webstore/detail/meaningless/**
>>>>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details>
>>>>>
>>>>> If you're a developer and use another browser, I'd love your help in
>>>>> porting the extension to other platforms (FF, Safari, etc.):
>>>>>
>>>>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless>
>>>>>
>>>>> If you're interested in the data, a sparse reporting front-end is
>>>>> currently
>>>>> in place:
>>>>>
>>>>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global>
>>>>>
>>>>> Help is needed to analyze the data in more meaningful ways, visualize
>>>>> it,
>>>>> etc. Filing tickets and submitting pull requests is the easiest way to
>>>>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues>
>>>>>
>>>>> Thanks for your help and attention.
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Steph.
>>
>


-- 
Steph.
Received on Wednesday, 27 March 2013 16:16:24 UTC