Re: Provbench submission request - wikipedia edits

Jun

I believe you have two options:

1- Matt's crawler (forked from mine)  https://github.com/matthewgamble/wikipedia-provenance   -- I managed to build this fine, but
please do try yourself
2- Tim's SSL-from-dump approach

depending on what you need to do...

i started documenting (1) but then Tim lost interest :-) and it went back to the back burner. Should I resume? I would love to see
this used/tested!

--Paolo

 
On 27/06/2014 11:30, Zhao, Jun wrote:
> Hi guys,
>
> I’ve been keeping an eye on this because I also want to reproduce the wikipedia prov. 
>
> I am at lost which one I should go for now. Spoiled with choices :)
>
> btw, I am cc’ing prov-commnets to see whether it is now working.
>
> Cheers,
>
> — Jun
>
>
> On 25 Jun 2014, at 20:37, Timothy Lebo <lebot@rpi.edu> wrote:
>
>> Hi, Paolo.
>>
>> I just read the paper, and I think the method of crawling is clear.
>> Interesting approach.
>>
>> I’ve passed my need for a crawl at the moment, but I’ll let you know when I find another excuse to try out your stuff.
>>
>> Best,
>> Tim
>>
>>
>> On Jun 23, 2014, at 3:16 PM, Paolo Missier <Paolo.Missier@ncl.ac.uk> wrote:
>>
>>> Tim,
>>>
>>> On 19/06/2014 13:38, Timothy Lebo wrote:
>>>> Paolo,
>>>>
>>>> On Jun 19, 2014, at 2:32 AM, Paolo Missier <Paolo.Missier@ncl.ac.uk> wrote:
>>>>
>>>>> Hi Tim
>>>>>
>>>>> haven't looked at the source -- is this your own code or a fork of Matt's?  (or our old one?)
>>>> It’s my own home-grown code.
>>>> (I couldn’t get either of your repositories to work.)
>>> I got Matt's version to build and work with just maven (from within eclipse)
>>> what was the problem?
>>>
>>>>> in fact is this a prov-o of a wiki dump (as opposed to a crawler?)
>>>> Yes, it’s an XSL of the wiki XML dump to produce PROV-O.
>>>> You feed it the page names to grab, so no crawling.
>>>>
>>>> It’d be nice to understand how your crawler works. It’s insightful than my approach.
>>>> Wiki page? :-)
>>> I got Matt's version to work but I need a little time to document -- there is a GUI (by the student) and a command line (which I remember adding myself).
>>> I shall document... but basically you can control
>>> - the max number of revisions you traverse for each page
>>> - the max number of contributions by any editor
>>> - and a "depth" field which I forgot about :-)
>>>
>>> have you read the short paper? https://github.com/provbench/Wikipedia-PROV/blob/master/wikipediaTraces.pdf
>>>
>>> Three parameters are used to control the extent of the user/ar- ticle spaced visited by the crawler. Firstly, the revision length determines the max. number of wasRevisionOf re- lations traversed, towards the past, from a landing revision page. Secondly, the max users parameter determines the max number of wasAssociatedWith relations, i.e., the max number of contributions explored per user. Thirdly, the depth parameter determines how many times the switch- over between article space and user space may occur. For example, setting depth = 3 results in the exploration of re- visions for articles that are connected to the original seed article through at most 2 intermediate users: base article → user1 → article2 → user2 → article3.
>>>
>>>
>>>>> sorr yI am rushing to ask before I look :-)
>>>> No worries. I’d do the same ;-)
>>>>
>>>> Best,
>>>> Tim
>>>>
>>>> p.s. are we ready to “go live” with this kind of discussion on prov-comments?
>>> sure why not -- 
>>>
>>> -Paolo
>>>


-- 
Paolo Missier - Paolo.Missier@newcastle.ac.uk, pmissier@acm.org 
School of Computing Science, Newcastle University,  UK
professional: http://www.cs.ncl.ac.uk/people/Paolo.Missier
photography: http://scattidistratti.smugmug.com/
PGP Public key: 0x45596549  - key servers: pool.sks-keyservers.net
=--= Tempus fugit =--=

Received on Friday, 27 June 2014 11:38:01 UTC