Re: [ISSUE-29][ACTION-164] ITS2NIF2ITS - RDF roundtrip from Sebastian Hellmann on 2012-08-09 (public-multilingualweb-lt@w3.org from August 2012)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Thu, 09 Aug 2012 15:32:02 +0200
To: Felix Sasaki <fsasaki@w3.org>
CC: Jirka Kosek <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Message-ID: <5023BBD2.6030207@informatik.uni-leipzig.de>
HI Felix,
there are some syntactic errors: <str:String> .

Maybe this helps:
curl -X POST --data-urlencode input="Apache Stanbol can detect 
entities." --data input-type=text  --data format=turtle 
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore
curl -X POST --data-urlencode input="Apache Stanbol can detect 
entities." --data input-type=text  --data format=turtle --data-urlencode 
prefix="http://example.com/exampledoc.html#" 
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore
curl -X POST --data-urlencode input="Apache Stanbol can detect 
entities." --data input-type=text  --data format=turtle --data-urlencode 
prefix="urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#" 
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore

I also attached the output. It is the Stanford Pos tagger NIF 2.0 draft 
wrapper. (Errata: Context uses anchorOf instead of isString)
Normally, the prefix parameter is variable and set as config option.  
Please don't worry about UUIDs . NIF and ITS in NIF don't need them. The 
reason, why I included them, was that I am writing a converter for 
Apache Stanbol to NIF and ITS and Stanbol uses UUIDs. I removed them 
from the wiki page.

So here are some corrections:
<urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#0_50> str:isString
"\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
     str:occursIn <http://example.com/exampledoc.html>;
     a <str:Context>.

Should be:
<http://example.com/exampledoc.html#0_54> str:isString
"\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
     str:occursIn <http://example.com/exampledoc.html>;
     a str:Context .
Character length of 54 is correct as this is based on Unicode Normal 
Form C, counted in Code Units: http://unicode.org/faq/char_combmark.html#7


**************************
<urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_31> str:isString "Dublin";
     str:occursIn <http://example.com/exampledoc.html>;
     a <str:Context>.
<urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_32> str:isString 
"Ireland";
     str:occursIn <http://example.com/exampledoc.html>;
     a <str:Context>.
Should be:
<http://example.com/exampledoc.html#31_37>
     str:anchorOf "Dublin";
     str:occursIn <http://example.com/exampledoc.html>;
     a str:Context.
<http://example.com/exampledoc.html#41_48>
     str:anchorOf "Ireland";
     str:occursIn <http://example.com/exampledoc.html>;
     a str:Context.

The counts seem to be wrong. Other than that it looks already quite close.
All the best,
Sebastian

Am 09.08.2012 13:30, schrieb Felix Sasaki:
> Hi Sebastian, all,
>
> I tried to create the NIF output (since we need two implementations) for
>
> <html xmlns:its="http://www.w3.org/2005/11/its">
>      <body>
>          <h2 its:translate="yes">Welcome to <span its:translate="no"
>                  >Dublin</span> in <b its:translate="no">Ireland</b>! </h2>
>      </body>
> </html>
>
> (I used an XML input here, but otherwise this is the same like your example
> in the wiki.
>
> Does the below output make sense? I am sure that the uuid is wrong, but I
> don't know how to generate one.
>
>
> [
>
> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>.
> @prefix str: <http://nlp2rdf.lod2.eu/schema/string/>.
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
> <http://example.com/exampledoc.html#offset_0_50> str:referenceContext
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#0_50>;
> 	a <str:String>;
> 	itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_14_44> str:referenceContext
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#14_44>;
> 	a <str:String>;
> 	itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_25_31> str:referenceContext
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_31>;
> 	a <str:String>;
> 	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_25_32> str:referenceContext
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_32>;
> 	a <str:String>;
> 	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <http://example.com/exampledoc.html#offset_5_49> str:referenceContext
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#5_49>;
> 	a <str:String>;
> 	itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#0_50> str:isString
> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
> 	str:occursIn <http://example.com/exampledoc.html>;
> 	a <str:Context>.
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#14_44> str:isString
> "Welcome to Dublin in Ireland! ";
> 	str:occursIn <http://example.com/exampledoc.html>;
> 	a <str:Context>.
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_31> str:isString "Dublin";
> 	str:occursIn <http://example.com/exampledoc.html>;
> 	a <str:Context>.
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#25_32> str:isString "Ireland";
> 	str:occursIn <http://example.com/exampledoc.html>;
> 	a <str:Context>.
> <urn:uuid:CEB9FD94-6779-4257-B992-C853617CB791#5_49> str:isString
> "\r\n        Welcome to Dublin in Ireland! \r\n    ";
> 	str:occursIn <http://example.com/exampledoc.html>;
> 	a <str:Context>.
>
> ]
>
> Thanks,
>
> Felix
>
> 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
>
>> Hi Jirka,
>> thanks, for your feedback. I thought it was a requirement that the DOM
>> should not be touched. I really never had any whitespace problems in any
>> RDF serialization formats, so this was new to me. By the way, I can
>> understand now, what your problem with the bloated mapping is. We really
>> don't need to serialize it. Actually it can be kept in memory, which is
>> more efficient. I added serialization as optional. Also I made an XML
>> version, because for transferring such kind of data, XML is much better
>> suited. (Is the XML alright?)  I made all the changes you suggested, the
>> new version is online here:
>> http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**622#Example<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example>
>>
>> all the best,
>> Sebastian
>>
>>
>> Am 09.08.2012 11:59, schrieb Jirka Kosek:
>>
>>   On 9.8.2012 11:47, Sebastian Hellmann wrote:
>>>   you found an interesting point.
>>>> I wrote some notes on the optimization:
>>>> http://wiki.nlp2rdf.org/wiki/**ITS2NIF2ITS#Notes_on_optional_**
>>>> optimizations<http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations>
>>>> http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**
>>>> 614#Notes_on_optional_**optimizations<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations>
>>>>
>>>> I think, it  generally depends on the use case, whether you would
>>>> optimize.  Do you think we should specify/limit what optimizations are
>>>> possible?
>>>> It might be easier to explain implications to help developers,
>>>> but leave the implementation under-specified.
>>>> Do you think I should remove them from the algorithm description and
>>>> move them to a completely different section? Would this help the
>>>> structure of the document?
>>>>
>>> I think that NIF mapping is so unnatural as is that optimization can
>>> make it really messy. If the goal of optimization was to create less
>>> complex RDF representation with not blank text nodes and trimmed text
>>> nodes with a lot of whitespace I can think that easier and workable
>>> approach would be to:
>>>
>>> - remove all whitespace optimization from mapping algorithm
>>>
>>> - saying that algorithm can produce a lot of "phantom" predicates from
>>> excessive whitespace
>>>
>>> - recommending to normalize whitespace in the input XML/HTML/DOM in
>>> order to minimize such phantom predicates
>>>
>>> This way each user/application can create custom whitespace
>>> normalization based on nature of input data and we don't have to care
>>> about it.
>>>
>>> For example for your sample document it is safe (knowing HTML whitespace
>>> handling rules) to normalize it to
>>>
>>> <html><body><h2 translate = "yes" >Welcome to <span
>>> its-disambig-ident-ref = "http://dbpedia.org/resource/**Dublin<http://dbpedia.org/resource/Dublin>”
>>> translate
>>> = "no">Dublin</span> in <b translate="no">Ireland</b>!</**
>>> h2></body></html>
>>>
>>> (Actually one line with no excessive whitespace.)
>>>
>>> Does this sounds reasonable to my SemWeb-educated friends?
>>>
>>>                          Jirka
>>>
>>>
>> --
>> Dipl. Inf. Sebastian Hellmann
>> Department of Computer Science, University of Leipzig
>> Events:
>>    * http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>(Leipzig, Sept. 23-24-25, 2012)
>>    * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
>> Projects: http://nlp2rdf.org , http://dbpedia.org
>> Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>> Research Group: http://aksw.org
>>
>>
>>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
   * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012)
   * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Attachments

text/plain attachment: stanford.example.ttl
text/plain attachment: stanford.noprefix..ttl
text/plain attachment: stanford.urn.ttl
Received on Thursday, 9 August 2012 13:32:30 UTC