Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs

From: Timothy Lebo <lebot@rpi.edu> · Date: Mon, 25 Jun 2012 16:52:04 -0400

Sabastian,

On Jun 22, 2012, at 3:02 AM, Sebastian Hellmann wrote:

> Hi Timothy,
> 
> On 06/21/2012 11:09 PM, Timothy Lebo wrote:
>> 
>>> in NIF Fragments of resources are used as subject in RDF.
>>> Hence you could consider for inclusion, if it is not a too far stretch, and if there is enough time left.
>> 
>> What specifically are you proposing the PROV-WG include?
> Well, if you have a (web) document and you want to express, that a certain part was written by you.
> e.g. I have written (with some exceptions) the beginning of http://wole2012.eurecom.fr/call-papers
> From "This workshop envisions the Semantic..." until "Natural Language Processing and Semantic Web. "
> 
> How do you express this with the current work of your group?

If NIF-URIs provide you a way to identify that snippet of the document, then PROV and PROV-O can be used to describe its provenance.

Your writing can be described as the following. Depending on what other things you'd like to say, we can add more PROV assertions.

@prefix prov: <http://www.w3.org/ns/prov#> .

<your-nif-uri-for-that-portion-of-the-document>
   prov:wasAttributedTo <http://data.semanticweb.org/person/sebastian-hellmann>;
.

<http://data.semanticweb.org/person/sebastian-hellmann> a prov:Agent, prov:Person .

> NIF-URIs could fill this spot very well.

I agree. Since we haven't spent any effort for conventions on how to identify portions of resource representations, NIF and PROV complement each other nicely.

Regards,
Tim

> All the best,
> Sebastian
> 
>> 
>> Thanks for pointing out the NIF work, it will be great to reuse existing models for the strings in documents.
>> 
>> Regards,
>> Tim Lebo
>> 
>> 
>>> You could read here for a start: http://lists.wikimedia.org/pipermail/wikidata-l/2012-May/000475.html
>>> or here http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf
>>> All the best,
>>> Sebastian
>>> 
>>> -------- Original Message --------
>>> Subject:	Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs
>>> Date:	Thu, 21 Jun 2012 20:34:14 +0100
>>> From:	Barry Norton<barry.norton@ontotext.com>
>>> To:	Sebastian Hellmann<hellmann@informatik.uni-leipzig.de>
>>> CC:	Discussion list for the Wikidata project.<wikidata-l@lists.wikimedia.org>
>>> 
>>> As excused I wasn't really following your discussion, but indeed if
>>> you're giving URIs to these fragments...
>>> 
>>> Barry
>>> 
>>> 
>>> On 21/06/2012 20:29, Sebastian Hellmann wrote:
>>>> Hi Barry,
>>>> 
>>>> On 06/21/2012 08:51 PM, Barry Norton wrote:
>>>>> Sorry to jump in (without really understanding the context), but you
>>>>> guys saw this today, right?
>>>>> 
>>> http://www.w3.org/TR/2012/WD-prov-aq-20120619/
>>> 
>>>> It seems to be very unrelated. That is only resource-level, right?
>>>> "Fundamentally, provenance information
>>>> 
>>> <http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-provenance-information>
>>> 
>>>> is /about/ resource
>>>> 
>>> <http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-resource>
>>> s." So
>>>> you would need a subject first. How do you say that the fact you just
>>>> added to WikiData comes from a specific fragment of a resource?
>>>> i.e.
>>> http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
>>>  the
>>>> first occurence of "Semantic Web"
>>>> 
>>>> Do you suggest, that NIF URIs might be standardized by inclusion in
>>>> the PROV-AQ ? Might work. It could be compatible.
>>>> 
>>>> Sebastian
>>>> 
>>>>> Barry
>>>>> 
>>>>> 
>>>>> On 21/06/2012 19:05, Sebastian Hellmann wrote:
>>>>>> Hello Denny,
>>>>>> I was traveling for the past few weeks and can finally answer your
>>>>>> email.
>>>>>> See my comments inline.
>>>>>> 
>>>>>> On 05/29/2012 05:25 PM, Denny VrandeÄ?iÄ? wrote:
>>>>>>> Hello Sebastian,
>>>>>>> 
>>>>>>> 
>>>>>>> Just a few questions - as you note, it is easier if we all use the
>>>>>>> same
>>>>>>> standards, and so I want to ask about the relation to other related
>>>>>>> standards:
>>>>>>> * I understand that you dismiss IETF RFC 5147 because it is not stable
>>>>>>> enough, right?
>>>>>> The offset scheme of NIF is built on this RFC.
>>>>>> So the following would hold:
>>>>>> @prefix ld:
>>> <http://www.w3.org/DesignIssues/LinkedData.html#>
>>>  .
>>>>>> @prefix owl:
>>> <http://www.w3.org/2002/07/owl#>
>>>  .
>>>>>> ld:offset_717_729  owl:sameAs ld:char=717,12 .
>>>>>> 
>>>>>> 
>>>>>> We might change the syntax and reuse the RFC syntax, but it has
>>>>>> several issues:
>>>>>> 1.  The optional part is not easy to handle, because you would need
>>>>>> to add owl:sameAs statements:
>>>>>> 
>>>>>> ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 .
>>>>>> ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 .
>>>>>> ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .
>>>>>> 
>>>>>> So theoretically ok, but annoying to implement and check.
>>>>>> 
>>>>>> 2. When implementing web services, NIF allows the client to choose
>>>>>> the prefix:
>>>>>> 
>>> http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&prefix=http%3A%2F%2Fthis.is%2Fa%2Fslash%2Fprefix%2F&urirecipe=offset&input=President+Obama+is+president
>>> .
>>>>>> returning URIs like
>>> <http://this.is/a/slash/prefix/offset_10_15>
>>> 
>>>>>> So RFC 5147 would look like:
>>>>>> 
>>> <http://this.is/a/slash/prefix/char=717,12>
>>> 
>>> <http://this.is/a/slash/prefix/char=717,12;UTF-8>
>>> 
>>>>>> or
>>>>>> 
>>> <http://this.is/a/slash/prefix?char=717,12>
>>> 
>>> <http://this.is/a/slash/prefix?char=717,12;UTF-8>
>>> 
>>>>>> 3. Character like = , prevent the use of prefixes:
>>>>>> echo "@prefix ld:
>>> <http://www.w3.org/DesignIssues/LinkedData.html#>
>>>  .
>>>>>> @prefix owl:
>>> <http://www.w3.org/2002/07/owl#>
>>>  .
>>>>>> ld:offset_717_729  owl:sameAs ld:char=717,12 .
>>>>>> ">  test.ttl ; rapper -i turtle  test.ttl
>>>>>> 
>>>>>> 4. implementation is a little bit more difficult, given that :
>>>>>> $arr = split("_", "offset_717_729") ;
>>>>>> switch ($arr[0]){
>>>>>>     case 'offset' :
>>>>>>         $begin = $arr[1];
>>>>>>         $end = $arr[2];
>>>>>>         break;
>>>>>>     case 'hash' :
>>>>>>         $clength = $arr[1];
>>>>>>         $slength = $arr[2];
>>>>>>         $hash = $arr[3];
>>>>>>         $rest = /*merge remaining with '_' */
>>>>>>         break;
>>>>>> }
>>>>>> 
>>>>>> 5. RFC assumes a certain mime type, i.e. plain text. NIF does have a
>>>>>> broader assumption.
>>>>>>> * what is the relation to the W3C media fragment URIs? Did not find a
>>>>>>> pointer there.
>>>>>> They are designed for media such as images, video, not strings.
>>>>>> Potentially, the same principle can be applied, but it is not yet
>>>>>> engineered/researched.
>>>>>>> * any plans of standardizing your approach?
>>>>>> We will do NIF 2.0  as a community standard and finish it in a
>>>>>> couple of months. It will be published under open licences, so
>>>>>> anybody W3C or ISO might pick it up, easily. Other than that there
>>>>>> are plans by several EU projects (see e.g. here
>>>>>> 
>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.html
>>> )
>>>>>> and a US project to use it and there are several third party
>>>>>> implementations, already.  We would rather have it adopted first on
>>>>>> a large scale and then standardized, properly, i.e. W3C. This worked
>>>>>> quite well for the FOAF project or for RDB2RDF Mappers.
>>>>>> Chances for fast standardization are not so unlikely, I would assume.
>>>>>>> We would strongly prefer to just use a standard instead of advocating
>>>>>>> contenders for one -- if one exists.
>>>>>> You might want to look at:
>>>>>> 
>>> http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage
>>> 
>>>>>> and the same highlighting here:
>>>>>> 
>>> http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web
>>> 
>>>>>> 
>>>>>> NIF equivalent (4 triples instad of 14 and only one generated uuid):
>>>>>> ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a
>>>>>> str:String ;
>>>>>>     oa:hasBody [
>>>>>>         oa:annotator
>>> <mailto:Bob>
>>>  ;
>>>>>>         cnt:chars "Hey Tim, good idea that Semantic Web!" .
>>>>>>     ]
>>>>>> 
>>>>>> So you might not think in a "contender" way. Approaches are
>>>>>> complementary. NIF is simpler and the URIs have some features that
>>>>>> might be wanted (stability, uniqueness, easy to implement).
>>>>>> This is why I was asking for your *use case* .
>>>>>> 
>>>>>> Note that: there are still some problems, when annotating DOM with
>>>>>> URIs, e.g. xPointer is abandoned and was never finished. Xpath has
>>>>>> its limits and is also expensive (i.e. SAX not possible).
>>>>>> I think there is no proper solution as of now.
>>>>>> All the best,
>>>>>> Sebastian
>>>>>> 
>>>>>>> Cheers,
>>>>>>> Denny
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2012/5/18 Sebastian Hellmann
>>> <hellmann@informatik.uni-leipzig.de>
>>> 
>>>>>>>> Hello again,
>>>>>>>> maybe the question, I asked was lost, as the text was TL;DR
>>>>>>>> 
>>>>>>>> I heard that, it is planned to track provenance of facts. e.g.
>>>>>>>> Berlin has
>>>>>>>> 3,337,000 citizens found
>>>>>>>> here:
>>> http://www.worldatlas.com/**citypops.htm<http://www.worldatlas.com/citypops.htm>
>>> 
>>>>>>>> Do you have a place where the use case and the requirements are
>>>>>>>> documented
>>>>>>>> for this? Or is it out of scope?
>>>>>>>> Will it be course grained, i.e. website level ? Or fine grained,
>>>>>>>> i.e. text
>>>>>>>> paragraph level? See e.g. how Berlin is highlighted here:
>>>>>>>> 
>>> http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
>>> 
>>>>>>>> vorprojekt/index.php?**annotation_request=http%3A%2F%**
>>>>>>>> 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**
>>>>>>>> 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C
>>> <http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C>
>>> 
>>>>>>>> in this very early prototype.
>>>>>>>> 
>>>>>>>> Could you give me a link were I can read more about any Wikidata
>>>>>>>> plans
>>>>>>>> towards this direction?
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:
>>>>>>>> 
>>>>>>>>> Dear all,
>>>>>>>>> (Note: I could not find the document, where your requirements
>>>>>>>>> regarding
>>>>>>>>> the tracking of facts on the web are written, so I am giving a
>>>>>>>>> general
>>>>>>>>> introduction to NIF. Please send me a link to the document that
>>>>>>>>> specifies
>>>>>>>>> your need for tracing facts on the web, thanks)
>>>>>>>>> 
>>>>>>>>> I would like to point your attention to the URIs used in the NLP
>>>>>>>>> Interchange Format (NIF).
>>>>>>>>> NIF-URIs are quite easy to use, understand and implement. NIF has a
>>>>>>>>> one-triple-per-annotation paradigm.  The latest documentation can
>>>>>>>>> be found
>>>>>>>>> here:
>>>>>>>>> 
>>> http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf>
>>> 
>>>>>>>>> 
>>>>>>>>> The basic idea is to use URIs with hash fragment ids to annotate
>>>>>>>>> or mark
>>>>>>>>> pages on the web:
>>>>>>>>> An example is the first occurrence of "Semantic Web" on
>>>>>>>>> 
>>> http://www.w3.org/**DesignIssues/LinkedData.html<http://www.w3.org/DesignIssues/LinkedData.html>
>>> 
>>>>>>>>> as highlighted here:
>>>>>>>>> 
>>> http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
>>> 
>>>>>>>>> vorprojekt/index.php?**annotation_request=http%3A%2F%**
>>>>>>>>> 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_**
>>>>>>>>> 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web
>>> <http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web>
>>> 
>>>>>>>>> 
>>>>>>>>> Here is a NIF example for linking a part of the document to the
>>>>>>>>> DBpedia
>>>>>>>>> entry of the Semantic Web:
>>>>>>>>> <
>>> http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729>
>>> 
>>>>>>>>>       a str:StringInContext ;
>>>>>>>>>       sso:oen
>>>>>>>>> <
>>> http://dbpedia.org/resource/**Semantic_Web<http://dbpedia.org/resource/Semantic_Web>
>>>>>>>>> .
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> We are currently preparing a new draft for the spec 2.0. The old
>>>>>>>>> one can
>>>>>>>>> be found here:
>>>>>>>>> 
>>> http://nlp2rdf.org/nif-1-0/
>>> 
>>>>>>>>> There are several EU projects that intend to use NIF.
>>>>>>>>> Furthermore, it is
>>>>>>>>> easier for everybody, if we standardize a Web annotation format
>>>>>>>>> together.
>>>>>>>>> Please give feedback of your use cases.
>>>>>>>>> All the best,
>>>>>>>>> Sebastian
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Dipl. Inf. Sebastian Hellmann
>>>>>>>> Department of Computer Science, University of Leipzig
>>>>>>>> Projects:
>>> http://nlp2rdf.org ,http://dbpedia.org
>>> 
>>>>>>>> Homepage:
>>> http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>>> 
>>>>>>>> Research Group:
>>> http://aksw.org
>>> 
>>>>>>>> 
>>>>>>>> ______________________________**_________________
>>>>>>>> Wikidata-l mailing list
>>>>>>>> 
>>> Wikidata-l@lists.wikimedia.org
>>> 
>>> https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Wikidata-l mailing list
>>>>>>> 
>>> Wikidata-l@lists.wikimedia.org
>>> 
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Dipl. Inf. Sebastian Hellmann
>>>>>> Department of Computer Science, University of Leipzig
>>>>>> Projects:
>>> http://nlp2rdf.org ,http://dbpedia.org
>>> 
>>>>>> Homepage:
>>> http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>> 
>>>>>> Research Group:
>>> http://aksw.org
>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Wikidata-l mailing list
>>>>>> 
>>> Wikidata-l@lists.wikimedia.org
>>> 
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Wikidata-l mailing list
>>>>> 
>>> Wikidata-l@lists.wikimedia.org
>>> 
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> 
>>>> 
>>>> -- 
>>>> Dipl. Inf. Sebastian Hellmann
>>>> Department of Computer Science, University of Leipzig
>>>> Projects:
>>> http://nlp2rdf.org  ,http://dbpedia.org
>>> 
>>>> Homepage:
>>> http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>> 
>>>> Research Group:
>>> http://aksw.org
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Projects: http://nlp2rdf.org , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org
> 
>