Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs

From: Timothy Lebo <lebot@rpi.edu> · Date: Thu, 21 Jun 2012 14:09:33 -0700

Sabastian,

On Jun 21, 2012, at 1:04 PM, Sebastian Hellmann wrote:

> Dear Provenance group,
> there was a discussion at WikiData, which lead to contacting you:
> http://lists.wikimedia.org/pipermail/wikidata-l/2012-May/000475.html
> http://lists.wikimedia.org/pipermail/wikidata-l/2012-May/000478.html
> http://lists.wikimedia.org/pipermail/wikidata-l/2012-May/000566.html
> http://lists.wikimedia.org/pipermail/wikidata-l/2012-June/000751.html
> ...
> 
> You are tracking provenance on the resource level.

Are you suggesting that text snippets within a document (resource representation, really) cannot be resources themselves?

PROV provides prov:Entity, and you can choose anything that you wish to be a prov:Entity (for cases when you want to describe its provenance).

So, we could tweak your example:

<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729> 
  a str:StringInContext, prov:Entity;
  prov:value "Semantic Web";
  prov:wasQuotedFrom <http://www.w3.org/DesignIssues/LinkedData.html>;
.

If you're concerned about time, you can get more specific by saying:

<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729> 
  a str:StringInContext, prov:Entity;
  prov:wasQuotedFrom :the-page-today;
.

:the-page-today
	a prov:Entity;
        prov:specializationOf <http://www.w3.org/DesignIssues/LinkedData.html>;
        prov:generatedAtTime "2009-06-18T18:24:33"^^xsd:dateTime;
.

> in NIF Fragments of resources are used as subject in RDF.
> Hence you could consider for inclusion, if it is not a too far stretch, and if there is enough time left. 

What specifically are you proposing the PROV-WG include?

Thanks for pointing out the NIF work, it will be great to reuse existing models for the strings in documents.

Regards,
Tim Lebo

> You could read here for a start: http://lists.wikimedia.org/pipermail/wikidata-l/2012-May/000475.html
> or here http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

> 
> All the best,
> Sebastian
> 
> -------- Original Message --------
> Subject:	Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs
> Date:	Thu, 21 Jun 2012 20:34:14 +0100
> From:	Barry Norton <barry.norton@ontotext.com>
> To:	Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
> CC:	Discussion list for the Wikidata project. <wikidata-l@lists.wikimedia.org>
> 
> As excused I wasn't really following your discussion, but indeed if 
> you're giving URIs to these fragments...
> 
> Barry
> 
> 
> On 21/06/2012 20:29, Sebastian Hellmann wrote:
> > Hi Barry,
> >
> > On 06/21/2012 08:51 PM, Barry Norton wrote:
> >>
> >> Sorry to jump in (without really understanding the context), but you 
> >> guys saw this today, right?
> >> 
> http://www.w3.org/TR/2012/WD-prov-aq-20120619/
> 
> > It seems to be very unrelated. That is only resource-level, right? 
> > "Fundamentally, provenance information 
> > 
> <http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-provenance-information>
>  
> > is /about/ resource 
> > 
> <http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-resource>
> s." So 
> > you would need a subject first. How do you say that the fact you just 
> > added to WikiData comes from a specific fragment of a resource?
> > i.e. 
> http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
>  the 
> > first occurence of "Semantic Web"
> >
> > Do you suggest, that NIF URIs might be standardized by inclusion in 
> > the PROV-AQ ? Might work. It could be compatible.
> >
> > Sebastian
> >
> >>
> >> Barry
> >>
> >>
> >> On 21/06/2012 19:05, Sebastian Hellmann wrote:
> >>> Hello Denny,
> >>> I was traveling for the past few weeks and can finally answer your 
> >>> email.
> >>> See my comments inline.
> >>>
> >>> On 05/29/2012 05:25 PM, Denny VrandeÄ?iÄ? wrote:
> >>>> Hello Sebastian,
> >>>>
> >>>>
> >>>> Just a few questions - as you note, it is easier if we all use the 
> >>>> same
> >>>> standards, and so I want to ask about the relation to other related
> >>>> standards:
> >>>> * I understand that you dismiss IETF RFC 5147 because it is not stable
> >>>> enough, right?
> >>> The offset scheme of NIF is built on this RFC.
> >>> So the following would hold:
> >>> @prefix ld: 
> <http://www.w3.org/DesignIssues/LinkedData.html#>
>  .
> >>> @prefix owl: 
> <http://www.w3.org/2002/07/owl#>
>  .
> >>> ld:offset_717_729  owl:sameAs ld:char=717,12 .
> >>>
> >>>
> >>> We might change the syntax and reuse the RFC syntax, but it has 
> >>> several issues:
> >>> 1.  The optional part is not easy to handle, because you would need 
> >>> to add owl:sameAs statements:
> >>>
> >>> ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 .
> >>> ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 .
> >>> ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .
> >>>
> >>> So theoretically ok, but annoying to implement and check.
> >>>
> >>> 2. When implementing web services, NIF allows the client to choose 
> >>> the prefix:
> >>> 
> http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&prefix=http%3A%2F%2Fthis.is%2Fa%2Fslash%2Fprefix%2F&urirecipe=offset&input=President+Obama+is+president
> . 
> >>>
> >>> returning URIs like 
> <http://this.is/a/slash/prefix/offset_10_15>
> 
> >>> So RFC 5147 would look like:
> >>> 
> <http://this.is/a/slash/prefix/char=717,12>
> 
> >>> 
> <http://this.is/a/slash/prefix/char=717,12;UTF-8>
> 
> >>> or
> >>> 
> <http://this.is/a/slash/prefix?char=717,12>
> 
> >>> 
> <http://this.is/a/slash/prefix?char=717,12;UTF-8>
> 
> >>>
> >>> 3. Character like = , prevent the use of prefixes:
> >>> echo "@prefix ld: 
> <http://www.w3.org/DesignIssues/LinkedData.html#>
>  .
> >>> @prefix owl: 
> <http://www.w3.org/2002/07/owl#>
>  .
> >>> ld:offset_717_729  owl:sameAs ld:char=717,12 .
> >>> " > test.ttl ; rapper -i turtle  test.ttl
> >>>
> >>> 4. implementation is a little bit more difficult, given that :
> >>> $arr = split("_", "offset_717_729") ;
> >>> switch ($arr[0]){
> >>>     case 'offset' :
> >>>         $begin = $arr[1];
> >>>         $end = $arr[2];
> >>>         break;
> >>>     case 'hash' :
> >>>         $clength = $arr[1];
> >>>         $slength = $arr[2];
> >>>         $hash = $arr[3];
> >>>         $rest = /*merge remaining with '_' */
> >>>         break;
> >>> }
> >>>
> >>> 5. RFC assumes a certain mime type, i.e. plain text. NIF does have a 
> >>> broader assumption.
> >>>> * what is the relation to the W3C media fragment URIs? Did not find a
> >>>> pointer there.
> >>> They are designed for media such as images, video, not strings. 
> >>> Potentially, the same principle can be applied, but it is not yet 
> >>> engineered/researched.
> >>>> * any plans of standardizing your approach?
> >>> We will do NIF 2.0  as a community standard and finish it in a 
> >>> couple of months. It will be published under open licences, so 
> >>> anybody W3C or ISO might pick it up, easily. Other than that there 
> >>> are plans by several EU projects (see e.g. here 
> >>> 
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.html
> ) 
> >>> and a US project to use it and there are several third party 
> >>> implementations, already.  We would rather have it adopted first on 
> >>> a large scale and then standardized, properly, i.e. W3C. This worked 
> >>> quite well for the FOAF project or for RDB2RDF Mappers.
> >>> Chances for fast standardization are not so unlikely, I would assume.
> >>>> We would strongly prefer to just use a standard instead of advocating
> >>>> contenders for one -- if one exists.
> >>> You might want to look at: 
> >>> 
> http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage
> 
> >>> and the same highlighting here:
> >>> 
> http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web
>  
> >>>
> >>>
> >>> NIF equivalent (4 triples instad of 14 and only one generated uuid):
> >>> ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a 
> >>> str:String ;
> >>>     oa:hasBody [
> >>>         oa:annotator 
> <mailto:Bob>
>  ;
> >>>         cnt:chars "Hey Tim, good idea that Semantic Web!" .
> >>>     ]
> >>>
> >>> So you might not think in a "contender" way. Approaches are 
> >>> complementary. NIF is simpler and the URIs have some features that 
> >>> might be wanted (stability, uniqueness, easy to implement).
> >>> This is why I was asking for your *use case* .
> >>>
> >>> Note that: there are still some problems, when annotating DOM with 
> >>> URIs, e.g. xPointer is abandoned and was never finished. Xpath has 
> >>> its limits and is also expensive (i.e. SAX not possible).
> >>> I think there is no proper solution as of now.
> >>> All the best,
> >>> Sebastian
> >>>
> >>>> Cheers,
> >>>> Denny
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2012/5/18 Sebastian Hellmann
> <hellmann@informatik.uni-leipzig.de>
> 
> >>>>
> >>>>> Hello again,
> >>>>> maybe the question, I asked was lost, as the text was TL;DR
> >>>>>
> >>>>> I heard that, it is planned to track provenance of facts. e.g. 
> >>>>> Berlin has
> >>>>> 3,337,000 citizens found 
> >>>>> here:
> http://www.worldatlas.com/**citypops.htm<http://www.worldatlas.com/citypops.htm>
> 
> >>>>> Do you have a place where the use case and the requirements are 
> >>>>> documented
> >>>>> for this? Or is it out of scope?
> >>>>> Will it be course grained, i.e. website level ? Or fine grained, 
> >>>>> i.e. text
> >>>>> paragraph level? See e.g. how Berlin is highlighted here:
> >>>>> 
> http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
> 
> >>>>> vorprojekt/index.php?**annotation_request=http%3A%2F%**
> >>>>> 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**
> >>>>> 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C
> <http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C>
>  
> >>>>>
> >>>>> in this very early prototype.
> >>>>>
> >>>>> Could you give me a link were I can read more about any Wikidata 
> >>>>> plans
> >>>>> towards this direction?
> >>>>> Sebastian
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:
> >>>>>
> >>>>>> Dear all,
> >>>>>> (Note: I could not find the document, where your requirements 
> >>>>>> regarding
> >>>>>> the tracking of facts on the web are written, so I am giving a 
> >>>>>> general
> >>>>>> introduction to NIF. Please send me a link to the document that 
> >>>>>> specifies
> >>>>>> your need for tracing facts on the web, thanks)
> >>>>>>
> >>>>>> I would like to point your attention to the URIs used in the NLP
> >>>>>> Interchange Format (NIF).
> >>>>>> NIF-URIs are quite easy to use, understand and implement. NIF has a
> >>>>>> one-triple-per-annotation paradigm.  The latest documentation can 
> >>>>>> be found
> >>>>>> here:
> >>>>>> 
> http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf>
>  
> >>>>>>
> >>>>>>
> >>>>>> The basic idea is to use URIs with hash fragment ids to annotate 
> >>>>>> or mark
> >>>>>> pages on the web:
> >>>>>> An example is the first occurrence of "Semantic Web" on
> >>>>>> 
> http://www.w3.org/**DesignIssues/LinkedData.html<http://www.w3.org/DesignIssues/LinkedData.html>
>  
> >>>>>> as highlighted here:
> >>>>>> 
> http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
> 
> >>>>>> vorprojekt/index.php?**annotation_request=http%3A%2F%**
> >>>>>> 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_**
> >>>>>> 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web
> <http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web>
>  
> >>>>>>
> >>>>>>
> >>>>>> Here is a NIF example for linking a part of the document to the 
> >>>>>> DBpedia
> >>>>>> entry of the Semantic Web:
> >>>>>> <
> http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729>
>  
> >>>>>>
> >>>>>>       a str:StringInContext ;
> >>>>>>       sso:oen 
> >>>>>> <
> http://dbpedia.org/resource/**Semantic_Web<http://dbpedia.org/resource/Semantic_Web>
> >
> >>>>>> .
> >>>>>>
> >>>>>>
> >>>>>> We are currently preparing a new draft for the spec 2.0. The old 
> >>>>>> one can
> >>>>>> be found here:
> >>>>>> 
> http://nlp2rdf.org/nif-1-0/
> 
> >>>>>>
> >>>>>> There are several EU projects that intend to use NIF. 
> >>>>>> Furthermore, it is
> >>>>>> easier for everybody, if we standardize a Web annotation format 
> >>>>>> together.
> >>>>>> Please give feedback of your use cases.
> >>>>>> All the best,
> >>>>>> Sebastian
> >>>>>>
> >>>>>>
> >>>>> -- 
> >>>>> Dipl. Inf. Sebastian Hellmann
> >>>>> Department of Computer Science, University of Leipzig
> >>>>> Projects:
> http://nlp2rdf.org ,http://dbpedia.org
> 
> >>>>> Homepage:
> http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>  
> >>>>>
> >>>>> Research Group:
> http://aksw.org
> 
> >>>>>
> >>>>>
> >>>>> ______________________________**_________________
> >>>>> Wikidata-l mailing list
> >>>>> 
> Wikidata-l@lists.wikimedia.org
> 
> >>>>> 
> https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>  
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Wikidata-l mailing list
> >>>> 
> Wikidata-l@lists.wikimedia.org
> 
> >>>> 
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
> 
> >>>
> >>>
> >>> -- 
> >>> Dipl. Inf. Sebastian Hellmann
> >>> Department of Computer Science, University of Leipzig
> >>> Projects:
> http://nlp2rdf.org ,http://dbpedia.org
> 
> >>> Homepage:
> http://bis.informatik.uni-leipzig.de/SebastianHellmann
> 
> >>> Research Group:
> http://aksw.org
> 
> >>>
> >>>
> >>> _______________________________________________
> >>> Wikidata-l mailing list
> >>> 
> Wikidata-l@lists.wikimedia.org
> 
> >>> 
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
> 
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Wikidata-l mailing list
> >> 
> Wikidata-l@lists.wikimedia.org
> 
> >> 
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
> 
> >
> >
> > -- 
> > Dipl. Inf. Sebastian Hellmann
> > Department of Computer Science, University of Leipzig
> > Projects:
> http://nlp2rdf.org  ,http://dbpedia.org
> 
> > Homepage:
> http://bis.informatik.uni-leipzig.de/SebastianHellmann
> 
> > Research Group:
> http://aksw.org
> 
> 
> 
> 
>