Re: [ISSUE-29][ACTION-164] ITS2NIF2ITS - RDF roundtrip from Sebastian Hellmann on 2012-08-15 (public-multilingualweb-lt@w3.org from August 2012)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Wed, 15 Aug 2012 09:44:41 +0200
To: Felix Sasaki <fsasaki@w3.org>
CC: Jirka Kosek <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Message-ID: <502B5369.8010301@informatik.uni-leipzig.de>
Hi Felix,
there are some minor issues:

- turtle syntax => "<str:Context>" should either be "str:Context" (no 
<>) or full <http://nlp2rdf.lod2.eu/schema/string/Context>
- "offset" is missing sometimes "http://example.com/exampledoc.html#23_30"
- there is the open question, whether the fragment that covers the whole 
content of the document is equal to the document:
<http://example.com/exampledoc.html> owl:sameAs 
<http://example.com/exampledoc.html#offset_0_29>
But this might be rather philosophical.
- RDF recommends Unicode NormalForm C : 
http://www.w3.org/TR/rdf-concepts/#section-Literals
This is why, we will make it mandatory. Some of the RDF parsers might 
complain, if any literals are not in Unicode Normalform C . Sometimes 
these are just warning and sometimes parsing fails completely.


Please see below for the correct output for the string "Dublin" in the 
Context: "Welcome to Dublin in Ireland!" occuring in 
http://example.com/exampledoc.html
I validated it with the command line tools libraptor2 or rapper for 
unix: http://librdf.org/raptor/rapper.html

[

@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>.
@prefix str: <http://nlp2rdf.lod2.eu/schema/string/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
# the reference context, i.e. the whole string that occurs in 
http://example.com/exampledoc.html
<http://example.com/exampledoc.html#offset_0_29>
# encodes some simple provenance
str:occursIn <http://example.com/exampledoc.html> ;
# includes the whole string
str:isString "Welcome to Dublin in Ireland!" ;
a str:Context.
# this is a the substring "Dublin"
<http://example.com/exampledoc.html#offset_11_17>
a str:String ;
str:anchorOf "Dublin";
itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ;
# all substrings have a reference to their context
str:referenceContext <http://example.com/exampledoc.html#offset_0_29> .

]

All the best,
Sebastian

Am 10.08.2012 10:03, schrieb Felix Sasaki:
> Hi Sebastian, Jirka, all,
>
> thanks for the feedback. I have tried to integrate it into the output, with
> the (X)HTML file attached. This is the output:
>
> [
>
> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>.
> @prefix str: <http://nlp2rdf.lod2.eu/schema/string/>.
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
> <http://example.com/exampledoc.html#0_29> str:anchorOf
> <http://example.com/exampledoc.html>;
> 	str:isString "Welcome to Dublin in Ireland!";
> 	a <str:Context>.
> <http://example.com/exampledoc.html#11_17> str:anchorOf
> <http://example.com/exampledoc.html>;
> 	str:isString "Dublin";
> 	a <str:Context>.
> <http://example.com/exampledoc.html#23_30> str:anchorOf
> <http://example.com/exampledoc.html>;
> 	str:isString "Ireland";
> 	a <str:Context>.
> <http://example.com/exampledoc.html#offset_0_29> str:referenceContext
> <http://example.com/exampledoc.html#0_29>;
> 	a <str:String>;
> 	itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_11_17> str:referenceContext
> <http://example.com/exampledoc.html#11_17>;
> 	a <str:String>;
> 	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_23_30> str:referenceContext
> <http://example.com/exampledoc.html#23_30>;
> 	a <str:String>;
> 	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
>
> ]
>
>
> Let me know if there are any open issues with this output. One question: I
> don't understand your reference to normalization form C - do you require
> Unicode normalization for generating the output? Above offets are based on
> non normalized processing, let me know if this needs to be changed. We just
> need to have clear rules with regards to whitespace and normalization.
>
> Thanks,
>
> Felix
>
> 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
>
>> HI Felix,
>> there are some syntactic errors: <str:String> .
>>
>> Maybe this helps:
>> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
>> --data input-type=text  --data format=turtle http://nlp2rdf.lod2.eu/demo/*
>> *NIFStanfordCore <http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
>> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
>> --data input-type=text  --data format=turtle --data-urlencode prefix="
>> http://example.com/**exampledoc.html#<http://example.com/exampledoc.html#>"
>> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
>> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
>> --data input-type=text  --data format=turtle --data-urlencode
>> prefix="urn:uuid:CEB9FD94-**6779-4257-B992-C853617CB791#"
>> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
>>
>> I also attached the output. It is the Stanford Pos tagger NIF 2.0 draft
>> wrapper. (Errata: Context uses anchorOf instead of isString)
>> Normally, the prefix parameter is variable and set as config option.
>>   Please don't worry about UUIDs . NIF and ITS in NIF don't need them. The
>> reason, why I included them, was that I am writing a converter for Apache
>> Stanbol to NIF and ITS and Stanbol uses UUIDs. I removed them from the wiki
>> page.
>>
>> So here are some corrections:
>>
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString
>> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a <str:Context>.
>>
>> Should be:
>> <http://example.com/**exampledoc.html#0_54<http://example.com/exampledoc.html#0_54>>
>> str:isString
>>
>> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a str:Context .
>> Character length of 54 is correct as this is based on Unicode Normal Form
>> C, counted in Code Units: http://unicode.org/faq/char_**combmark.html#7<http://unicode.org/faq/char_combmark.html#7>
>>
>>
>> **************************
>>
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString
>> "Dublin";
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a <str:Context>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString
>> "Ireland";
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a <str:Context>.
>> Should be:
>> <http://example.com/**exampledoc.html#31_37<http://example.com/exampledoc.html#31_37>
>>      str:anchorOf "Dublin";
>>
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a str:Context.
>> <http://example.com/**exampledoc.html#41_48<http://example.com/exampledoc.html#41_48>
>>      str:anchorOf "Ireland";
>>
>>      str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>> ;
>>      a str:Context.
>>
>> The counts seem to be wrong. Other than that it looks already quite close.
>> All the best,
>> Sebastian
>>
>> Am 09.08.2012 13:30, schrieb Felix Sasaki:
>>
>>> Hi Sebastian, all,
>>>
>>> I tried to create the NIF output (since we need two implementations) for
>>>
>>> <html xmlns:its="http://www.w3.org/**2005/11/its<http://www.w3.org/2005/11/its>
>>> ">
>>>       <body>
>>>           <h2 its:translate="yes">Welcome to <span its:translate="no"
>>>                   >Dublin</span> in <b its:translate="no">Ireland</b>**!
>>> </h2>
>>>       </body>
>>> </html>
>>>
>>> (I used an XML input here, but otherwise this is the same like your
>>> example
>>> in the wiki.
>>>
>>> Does the below output make sense? I am sure that the uuid is wrong, but I
>>> don't know how to generate one.
>>>
>>>
>>> [
>>>
>>> @prefix itsrdf: <http://www.w3.org/2005/11/**its/rdf#<http://www.w3.org/2005/11/its/rdf#>
>>>> .
>>> @prefix str: <http://nlp2rdf.lod2.eu/**schema/string/<http://nlp2rdf.lod2.eu/schema/string/>
>>>> .
>>> @prefix rdf: <http://www.w3.org/1999/02/22-**rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>> .
>>> <http://example.com/**exampledoc.html#offset_0_50<http://example.com/exampledoc.html#offset_0_50>>
>>> str:referenceContext
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50>;
>>>          a <str:String>;
>>>          itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>>> <http://example.com/**exampledoc.html#offset_14_44<http://example.com/exampledoc.html#offset_14_44>>
>>> str:referenceContext
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44>;
>>>          a <str:String>;
>>>          itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>>> <http://example.com/**exampledoc.html#offset_25_31<http://example.com/exampledoc.html#offset_25_31>>
>>> str:referenceContext
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31>;
>>>          a <str:String>;
>>>          itsrdf:translate "no"^^<http://www.w3.org/TR/**
>>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>>> <http://example.com/**exampledoc.html#offset_25_32<http://example.com/exampledoc.html#offset_25_32>>
>>> str:referenceContext
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32>;
>>>          a <str:String>;
>>>          itsrdf:translate "no"^^<http://www.w3.org/TR/**
>>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>>> <http://example.com/**exampledoc.html#offset_5_49<http://example.com/exampledoc.html#offset_5_49>>
>>> str:referenceContext
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49>;
>>>          a <str:String>;
>>>          itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString
>>> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>>>          str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>>> ;
>>>          a <str:Context>.
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44> str:isString
>>> "Welcome to Dublin in Ireland! ";
>>>          str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>>> ;
>>>          a <str:Context>.
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString
>>> "Dublin";
>>>          str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>>> ;
>>>          a <str:Context>.
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString
>>> "Ireland";
>>>          str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>>> ;
>>>          a <str:Context>.
>>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49> str:isString
>>> "\r\n        Welcome to Dublin in Ireland! \r\n    ";
>>>          str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>>>> ;
>>>          a <str:Context>.
>>>
>>> ]
>>>
>>> Thanks,
>>>
>>> Felix
>>>
>>> 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-**leipzig.de<hellmann@informatik.uni-leipzig.de>
>>>   Hi Jirka,
>>>> thanks, for your feedback. I thought it was a requirement that the DOM
>>>> should not be touched. I really never had any whitespace problems in any
>>>> RDF serialization formats, so this was new to me. By the way, I can
>>>> understand now, what your problem with the bloated mapping is. We really
>>>> don't need to serialize it. Actually it can be kept in memory, which is
>>>> more efficient. I added serialization as optional. Also I made an XML
>>>> version, because for transferring such kind of data, XML is much better
>>>> suited. (Is the XML alright?)  I made all the changes you suggested, the
>>>> new version is online here:
>>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=**
>>>> **622#Example<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**622#Example>
>>>> <http://wiki.**nlp2rdf.org/index.php?title=**
>>>> ITS2NIF2ITS&oldid=622#Example<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example>
>>>>
>>>> all the best,
>>>> Sebastian
>>>>
>>>>
>>>> Am 09.08.2012 11:59, schrieb Jirka Kosek:
>>>>
>>>>    On 9.8.2012 11:47, Sebastian Hellmann wrote:
>>>>
>>>>>    you found an interesting point.
>>>>>
>>>>>> I wrote some notes on the optimization:
>>>>>> http://wiki.nlp2rdf.org/wiki/****ITS2NIF2ITS#Notes_on_**optional_**<http://wiki.nlp2rdf.org/wiki/**ITS2NIF2ITS#Notes_on_optional_**>
>>>>>> optimizations<http://wiki.**nlp2rdf.org/wiki/ITS2NIF2ITS#**
>>>>>> Notes_on_optional_**optimizations<http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations>
>>>>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=****<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**>
>>>>>> 614#Notes_on_optional_****optimizations<http://wiki.**
>>>>>> nlp2rdf.org/index.php?title=**ITS2NIF2ITS&oldid=614#Notes_**
>>>>>> on_optional_optimizations<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations>
>>>>>>
>>>>>> I think, it  generally depends on the use case, whether you would
>>>>>> optimize.  Do you think we should specify/limit what optimizations are
>>>>>> possible?
>>>>>> It might be easier to explain implications to help developers,
>>>>>> but leave the implementation under-specified.
>>>>>> Do you think I should remove them from the algorithm description and
>>>>>> move them to a completely different section? Would this help the
>>>>>> structure of the document?
>>>>>>
>>>>>>   I think that NIF mapping is so unnatural as is that optimization can
>>>>> make it really messy. If the goal of optimization was to create less
>>>>> complex RDF representation with not blank text nodes and trimmed text
>>>>> nodes with a lot of whitespace I can think that easier and workable
>>>>> approach would be to:
>>>>>
>>>>> - remove all whitespace optimization from mapping algorithm
>>>>>
>>>>> - saying that algorithm can produce a lot of "phantom" predicates from
>>>>> excessive whitespace
>>>>>
>>>>> - recommending to normalize whitespace in the input XML/HTML/DOM in
>>>>> order to minimize such phantom predicates
>>>>>
>>>>> This way each user/application can create custom whitespace
>>>>> normalization based on nature of input data and we don't have to care
>>>>> about it.
>>>>>
>>>>> For example for your sample document it is safe (knowing HTML whitespace
>>>>> handling rules) to normalize it to
>>>>>
>>>>> <html><body><h2 translate = "yes" >Welcome to <span
>>>>> its-disambig-ident-ref = "http://dbpedia.org/resource/****Dublin<http://dbpedia.org/resource/**Dublin>
>>>>> <http://dbpedia.org/**resource/Dublin<http://dbpedia.org/resource/Dublin>
>>>>>> ”
>>>>> translate
>>>>> = "no">Dublin</span> in <b translate="no">Ireland</b>!</****
>>>>>
>>>>> h2></body></html>
>>>>>
>>>>> (Actually one line with no excessive whitespace.)
>>>>>
>>>>> Does this sounds reasonable to my SemWeb-educated friends?
>>>>>
>>>>>                           Jirka
>>>>>
>>>>>
>>>>>   --
>>>> Dipl. Inf. Sebastian Hellmann
>>>> Department of Computer Science, University of Leipzig
>>>> Events:
>>>>     * http://sabre2012.infai.org/****mlode<http://sabre2012.infai.org/**mlode><
>>>> http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>>(Leipzig,
>>>> Sept. 23-24-25, 2012)
>>>>
>>>>     * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
>>>> Projects: http://nlp2rdf.org , http://dbpedia.org
>>>> Homepage: http://bis.informatik.uni-**le**ipzig.de/SebastianHellmann<http://leipzig.de/SebastianHellmann>
>>>> <htt**p://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>>>> Research Group: http://aksw.org
>>>>
>>>>
>>>>
>>>>
>> --
>> Dipl. Inf. Sebastian Hellmann
>> Department of Computer Science, University of Leipzig
>> Events:
>>    * http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>(Leipzig, Sept. 23-24-25, 2012)
>>    * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
>> Projects: http://nlp2rdf.org , http://dbpedia.org
>> Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>> Research Group: http://aksw.org
>>
>>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
   * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012)
   * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Wednesday, 15 August 2012 07:45:23 UTC