Re: Meaningless: towards a real-world web semantics observatory

On Wednesday, March 27, 2013, Stéphane Corlosquet wrote:

>
>
>
> On Wed, Mar 27, 2013 at 11:48 AM, Christian Bizer <chris@bizer.de<javascript:_e({}, 'cvml', 'chris@bizer.de');>
> > wrote:
>
>> Hi Noah and Alex,
>>
>>
>> > Would it be easy/reasonable/in-the-spirit-**of-the-thing to extend it
>> start gathering statistics on JSON, XML, various forms of RDF, RDF-a, etc?
>>
>> Some deployment statistics for RDFa are provided by
>> http://www.webdatacommons.org/
>>
>> The project does not use a browser plug-in for gathering the statistics,
>> but extracts RDFa as well as Microdata and Microformats from the
>> CommonCrawl, a large publicly available web crawl.
>>
>> Detailed statistics from the last extraction (Mid 2012, 3 billion HTML
>> pages) are found at
>>
>> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>
>>
>> HTML pages are included into the CommonCrawl based on their PageRank.
>> Thus the crawl is likely to cover the popular part of the Web, but things
>> get sparse on the less interlinked sites.
>>
>> As  meaningless with its browser-extension based data collection approach
>> might also cover such pages, I think it will be interesting to see whether
>> the statistics from both projects will point into the same direction. As
>> far as I can see, they currently do to a large extent :-)
>>
>
> One interesting aspect of meaningless not covered by
> CommonCrawl/WebDataCommons is that it will include all browsed pages,
> including those from the deep web locked behind passwords or firewall and
> typically not accessible to public crawlers like the ones from Google or
> CommonCrawl. (Alex, correct me if I'm wrong on that one).
>

Nope, that's correct. It's why ensuring that the data is entirely anonymous
(and IPs are not logged, etc.) is such a high priority in the code.


> How does meaningless deal with URLs having different markup based on user
> variables such as someone being logged in to a service or being connect via
> a particular network/VPN? Imagine https://www.facebook.com/ or
> https://plus.google.com/ having different markup depending on whether the
> user is logged in and who is logged in.
>

It doesn't care. It's not logging URLs or any other sensitive data,
only characterizing the total # of elements, their types, and the ad-hoc
metadata that gets added to the elements.


> I'm assuming that reloading the page does not skew the numbers and has no
> effect on the stats, i.e. the latest reload overrides the previous page
> stats.
>

Nope, reloading simply add those elements to the count. The goal isn't to
disambiguate by URL or anything like that, it's to get a sense for which
semantics are most in use, and to do that, observing counts seems
appropriate. If that's not correct for some reason, let me know = )


> Steph.
>
>
>>
>> Cheers,
>>
>> Chris
>>
>>
>> Am 27.03.2013 15:47, schrieb Noah Mendelsohn:
>>
>>  (Leaving off most of the cc: list to avoid cross-posted discussion.
>>> Nothing sensitive here -- feel feel to forward if useful.)
>>>
>>> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing
>>> to extend it start gathering statistics on JSON, XML, various forms of RDF,
>>> RDF-a, etc? For that matter, it would also be >really< interesting to watch
>>> things like content that will be interpreted differently by the HTML5
>>> sniffing rules than by following authoritative metadata.
>>>
>>> In general, you seem to be on a very nice slippery slope of building a
>>> dashboard for the Web's data/content encoding. Are you interested in
>>> heading further down the slope?
>>>
>>> Noah
>>>
>>> On 3/27/2013 9:58 AM, Alex Russell wrote:
>>>
>>>> Hi all,
>>>>
>>>> These lists host many debates about the semantics (or lack thereof) of
>>>> HTML. Good data that bears on these questions is often hard to come by.
>>>> This isn't anyone's fault per sae but it sure would be nice if we had
>>>> better data to use as the baseline for discussions about what should
>>>> (and
>>>> shouldn't) be in HTML.next.
>>>>
>>>> In the interest of building such a corpus, I've created a small
>>>> extension
>>>> to help gather information on the real-world semantics that users
>>>> encounter
>>>> in the web; both semantic HTML and extensions to it like Microformats,
>>>> schema.org <http://schema.org> markup, and ARIA roles and states.
>>>> Crawlers
>>>> miss a lot as they (generally) aren't running scripts and interacting
>>>> deeply with sites, so this anonymizing system attempts to fill that gap
>>>> by
>>>> observing the semantic content of pages both when the load and as they
>>>> change over time.
>>>>
>>>> Why cross-post this so broadly? Because I need your help! If you think
>>>> evolving the web based on data is better than trying to do it without
>>>> and
>>>> you happen to use Chrome as your browser, please install the extension:
>>>>
>>>> https://chrome.google.com/**webstore/detail/meaningless/**
>>>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details>
>>>>
>>>> If you're a developer and use another browser, I'd love your help in
>>>> porting the extension to other platforms (FF, Safari, etc.):
>>>>
>>>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless>
>>>>
>>>> If you're interested in the data, a sparse reporting front-end is
>>>> currently
>>>> in place:
>>>>
>>>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global>
>>>>
>>>> Help is needed to analyze the data in more meaningful ways, visualize
>>>> it,
>>>> etc. Filing tickets and submitting pull requests is the easiest way to
>>>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues>
>>>>
>>>> Thanks for your help and attention.
>>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Steph.
>

Received on Wednesday, 27 March 2013 16:08:27 UTC