Re: [ISSUE-29][ACTION-164] ITS2NIF2ITS - RDF roundtrip

Hi Sebastian, Jirka, all,

thanks for the feedback. I have tried to integrate it into the output, with
the (X)HTML file attached. This is the output:

[

@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>.
@prefix str: <http://nlp2rdf.lod2.eu/schema/string/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
<http://example.com/exampledoc.html#0_29> str:anchorOf
<http://example.com/exampledoc.html>;
	str:isString "Welcome to Dublin in Ireland!";
	a <str:Context>.
<http://example.com/exampledoc.html#11_17> str:anchorOf
<http://example.com/exampledoc.html>;
	str:isString "Dublin";
	a <str:Context>.
<http://example.com/exampledoc.html#23_30> str:anchorOf
<http://example.com/exampledoc.html>;
	str:isString "Ireland";
	a <str:Context>.
<http://example.com/exampledoc.html#offset_0_29> str:referenceContext
<http://example.com/exampledoc.html#0_29>;
	a <str:String>;
	itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
<http://example.com/exampledoc.html#offset_11_17> str:referenceContext
<http://example.com/exampledoc.html#11_17>;
	a <str:String>;
	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.
<http://example.com/exampledoc.html#offset_23_30> str:referenceContext
<http://example.com/exampledoc.html#23_30>;
	a <str:String>;
	itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>.

]


Let me know if there are any open issues with this output. One question: I
don't understand your reference to normalization form C - do you require
Unicode normalization for generating the output? Above offets are based on
non normalized processing, let me know if this needs to be changed. We just
need to have clear rules with regards to whitespace and normalization.

Thanks,

Felix

2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>

> HI Felix,
> there are some syntactic errors: <str:String> .
>
> Maybe this helps:
> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
> --data input-type=text  --data format=turtle http://nlp2rdf.lod2.eu/demo/*
> *NIFStanfordCore <http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
> --data input-type=text  --data format=turtle --data-urlencode prefix="
> http://example.com/**exampledoc.html#<http://example.com/exampledoc.html#>"
> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
> curl -X POST --data-urlencode input="Apache Stanbol can detect entities."
> --data input-type=text  --data format=turtle --data-urlencode
> prefix="urn:uuid:CEB9FD94-**6779-4257-B992-C853617CB791#"
> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore>
>
> I also attached the output. It is the Stanford Pos tagger NIF 2.0 draft
> wrapper. (Errata: Context uses anchorOf instead of isString)
> Normally, the prefix parameter is variable and set as config option.
>  Please don't worry about UUIDs . NIF and ITS in NIF don't need them. The
> reason, why I included them, was that I am writing a converter for Apache
> Stanbol to NIF and ITS and Stanbol uses UUIDs. I removed them from the wiki
> page.
>
> So here are some corrections:
>
> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString
> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a <str:Context>.
>
> Should be:
> <http://example.com/**exampledoc.html#0_54<http://example.com/exampledoc.html#0_54>>
> str:isString
>
> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a str:Context .
> Character length of 54 is correct as this is based on Unicode Normal Form
> C, counted in Code Units: http://unicode.org/faq/char_**combmark.html#7<http://unicode.org/faq/char_combmark.html#7>
>
>
> **************************
>
> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString
> "Dublin";
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a <str:Context>.
> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString
> "Ireland";
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a <str:Context>.
> Should be:
> <http://example.com/**exampledoc.html#31_37<http://example.com/exampledoc.html#31_37>
> >
>     str:anchorOf "Dublin";
>
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a str:Context.
> <http://example.com/**exampledoc.html#41_48<http://example.com/exampledoc.html#41_48>
> >
>     str:anchorOf "Ireland";
>
>     str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
> >;
>     a str:Context.
>
> The counts seem to be wrong. Other than that it looks already quite close.
> All the best,
> Sebastian
>
> Am 09.08.2012 13:30, schrieb Felix Sasaki:
>
>> Hi Sebastian, all,
>>
>> I tried to create the NIF output (since we need two implementations) for
>>
>> <html xmlns:its="http://www.w3.org/**2005/11/its<http://www.w3.org/2005/11/its>
>> ">
>>      <body>
>>          <h2 its:translate="yes">Welcome to <span its:translate="no"
>>                  >Dublin</span> in <b its:translate="no">Ireland</b>**!
>> </h2>
>>      </body>
>> </html>
>>
>> (I used an XML input here, but otherwise this is the same like your
>> example
>> in the wiki.
>>
>> Does the below output make sense? I am sure that the uuid is wrong, but I
>> don't know how to generate one.
>>
>>
>> [
>>
>> @prefix itsrdf: <http://www.w3.org/2005/11/**its/rdf#<http://www.w3.org/2005/11/its/rdf#>
>> >.
>> @prefix str: <http://nlp2rdf.lod2.eu/**schema/string/<http://nlp2rdf.lod2.eu/schema/string/>
>> >.
>> @prefix rdf: <http://www.w3.org/1999/02/22-**rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> >.
>> <http://example.com/**exampledoc.html#offset_0_50<http://example.com/exampledoc.html#offset_0_50>>
>> str:referenceContext
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50>;
>>         a <str:String>;
>>         itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>> <http://example.com/**exampledoc.html#offset_14_44<http://example.com/exampledoc.html#offset_14_44>>
>> str:referenceContext
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44>;
>>         a <str:String>;
>>         itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>> <http://example.com/**exampledoc.html#offset_25_31<http://example.com/exampledoc.html#offset_25_31>>
>> str:referenceContext
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31>;
>>         a <str:String>;
>>         itsrdf:translate "no"^^<http://www.w3.org/TR/**
>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>> <http://example.com/**exampledoc.html#offset_25_32<http://example.com/exampledoc.html#offset_25_32>>
>> str:referenceContext
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32>;
>>         a <str:String>;
>>         itsrdf:translate "no"^^<http://www.w3.org/TR/**
>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>> <http://example.com/**exampledoc.html#offset_5_49<http://example.com/exampledoc.html#offset_5_49>>
>> str:referenceContext
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49>;
>>         a <str:String>;
>>         itsrdf:translate "yes"^^<http://www.w3.org/TR/**
>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString
>> "\r\n    \r\n        Welcome to Dublin in Ireland! \r\n    \r\n";
>>         str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>> >;
>>         a <str:Context>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44> str:isString
>> "Welcome to Dublin in Ireland! ";
>>         str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>> >;
>>         a <str:Context>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString
>> "Dublin";
>>         str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>> >;
>>         a <str:Context>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString
>> "Ireland";
>>         str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>> >;
>>         a <str:Context>.
>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49> str:isString
>> "\r\n        Welcome to Dublin in Ireland! \r\n    ";
>>         str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html>
>> >;
>>         a <str:Context>.
>>
>> ]
>>
>> Thanks,
>>
>> Felix
>>
>> 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-**leipzig.de<hellmann@informatik.uni-leipzig.de>
>> >
>>
>>  Hi Jirka,
>>> thanks, for your feedback. I thought it was a requirement that the DOM
>>> should not be touched. I really never had any whitespace problems in any
>>> RDF serialization formats, so this was new to me. By the way, I can
>>> understand now, what your problem with the bloated mapping is. We really
>>> don't need to serialize it. Actually it can be kept in memory, which is
>>> more efficient. I added serialization as optional. Also I made an XML
>>> version, because for transferring such kind of data, XML is much better
>>> suited. (Is the XML alright?)  I made all the changes you suggested, the
>>> new version is online here:
>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=**
>>> **622#Example<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**622#Example>
>>> <http://wiki.**nlp2rdf.org/index.php?title=**
>>> ITS2NIF2ITS&oldid=622#Example<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example>
>>> >
>>>
>>>
>>> all the best,
>>> Sebastian
>>>
>>>
>>> Am 09.08.2012 11:59, schrieb Jirka Kosek:
>>>
>>>   On 9.8.2012 11:47, Sebastian Hellmann wrote:
>>>
>>>>   you found an interesting point.
>>>>
>>>>> I wrote some notes on the optimization:
>>>>> http://wiki.nlp2rdf.org/wiki/****ITS2NIF2ITS#Notes_on_**optional_**<http://wiki.nlp2rdf.org/wiki/**ITS2NIF2ITS#Notes_on_optional_**>
>>>>> optimizations<http://wiki.**nlp2rdf.org/wiki/ITS2NIF2ITS#**
>>>>> Notes_on_optional_**optimizations<http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations>
>>>>> >
>>>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=****<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**>
>>>>> 614#Notes_on_optional_****optimizations<http://wiki.**
>>>>> nlp2rdf.org/index.php?title=**ITS2NIF2ITS&oldid=614#Notes_**
>>>>> on_optional_optimizations<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations>
>>>>> >
>>>>>
>>>>>
>>>>> I think, it  generally depends on the use case, whether you would
>>>>> optimize.  Do you think we should specify/limit what optimizations are
>>>>> possible?
>>>>> It might be easier to explain implications to help developers,
>>>>> but leave the implementation under-specified.
>>>>> Do you think I should remove them from the algorithm description and
>>>>> move them to a completely different section? Would this help the
>>>>> structure of the document?
>>>>>
>>>>>  I think that NIF mapping is so unnatural as is that optimization can
>>>> make it really messy. If the goal of optimization was to create less
>>>> complex RDF representation with not blank text nodes and trimmed text
>>>> nodes with a lot of whitespace I can think that easier and workable
>>>> approach would be to:
>>>>
>>>> - remove all whitespace optimization from mapping algorithm
>>>>
>>>> - saying that algorithm can produce a lot of "phantom" predicates from
>>>> excessive whitespace
>>>>
>>>> - recommending to normalize whitespace in the input XML/HTML/DOM in
>>>> order to minimize such phantom predicates
>>>>
>>>> This way each user/application can create custom whitespace
>>>> normalization based on nature of input data and we don't have to care
>>>> about it.
>>>>
>>>> For example for your sample document it is safe (knowing HTML whitespace
>>>> handling rules) to normalize it to
>>>>
>>>> <html><body><h2 translate = "yes" >Welcome to <span
>>>> its-disambig-ident-ref = "http://dbpedia.org/resource/****Dublin<http://dbpedia.org/resource/**Dublin>
>>>> <http://dbpedia.org/**resource/Dublin<http://dbpedia.org/resource/Dublin>
>>>> >”
>>>> translate
>>>> = "no">Dublin</span> in <b translate="no">Ireland</b>!</****
>>>>
>>>> h2></body></html>
>>>>
>>>> (Actually one line with no excessive whitespace.)
>>>>
>>>> Does this sounds reasonable to my SemWeb-educated friends?
>>>>
>>>>                          Jirka
>>>>
>>>>
>>>>  --
>>> Dipl. Inf. Sebastian Hellmann
>>> Department of Computer Science, University of Leipzig
>>> Events:
>>>    * http://sabre2012.infai.org/****mlode<http://sabre2012.infai.org/**mlode><
>>> http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>>(Leipzig,
>>> Sept. 23-24-25, 2012)
>>>
>>>    * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
>>> Projects: http://nlp2rdf.org , http://dbpedia.org
>>> Homepage: http://bis.informatik.uni-**le**ipzig.de/SebastianHellmann<http://leipzig.de/SebastianHellmann>
>>> <htt**p://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>>> >
>>> Research Group: http://aksw.org
>>>
>>>
>>>
>>>
>>
>
> --
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Events:
>   * http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>(Leipzig, Sept. 23-24-25, 2012)
>   * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
> Projects: http://nlp2rdf.org , http://dbpedia.org
> Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
> Research Group: http://aksw.org
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Friday, 10 August 2012 08:04:22 UTC