- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Wed, 15 Aug 2012 09:44:41 +0200
- To: Felix Sasaki <fsasaki@w3.org>
- CC: Jirka Kosek <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Hi Felix, there are some minor issues: - turtle syntax => "<str:Context>" should either be "str:Context" (no <>) or full <http://nlp2rdf.lod2.eu/schema/string/Context> - "offset" is missing sometimes "http://example.com/exampledoc.html#23_30" - there is the open question, whether the fragment that covers the whole content of the document is equal to the document: <http://example.com/exampledoc.html> owl:sameAs <http://example.com/exampledoc.html#offset_0_29> But this might be rather philosophical. - RDF recommends Unicode NormalForm C : http://www.w3.org/TR/rdf-concepts/#section-Literals This is why, we will make it mandatory. Some of the RDF parsers might complain, if any literals are not in Unicode Normalform C . Sometimes these are just warning and sometimes parsing fails completely. Please see below for the correct output for the string "Dublin" in the Context: "Welcome to Dublin in Ireland!" occuring in http://example.com/exampledoc.html I validated it with the command line tools libraptor2 or rapper for unix: http://librdf.org/raptor/rapper.html [ @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>. @prefix str: <http://nlp2rdf.lod2.eu/schema/string/>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. # the reference context, i.e. the whole string that occurs in http://example.com/exampledoc.html <http://example.com/exampledoc.html#offset_0_29> # encodes some simple provenance str:occursIn <http://example.com/exampledoc.html> ; # includes the whole string str:isString "Welcome to Dublin in Ireland!" ; a str:Context. # this is a the substring "Dublin" <http://example.com/exampledoc.html#offset_11_17> a str:String ; str:anchorOf "Dublin"; itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ; # all substrings have a reference to their context str:referenceContext <http://example.com/exampledoc.html#offset_0_29> . ] All the best, Sebastian Am 10.08.2012 10:03, schrieb Felix Sasaki: > Hi Sebastian, Jirka, all, > > thanks for the feedback. I have tried to integrate it into the output, with > the (X)HTML file attached. This is the output: > > [ > > @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#>. > @prefix str: <http://nlp2rdf.lod2.eu/schema/string/>. > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. > <http://example.com/exampledoc.html#0_29> str:anchorOf > <http://example.com/exampledoc.html>; > str:isString "Welcome to Dublin in Ireland!"; > a <str:Context>. > <http://example.com/exampledoc.html#11_17> str:anchorOf > <http://example.com/exampledoc.html>; > str:isString "Dublin"; > a <str:Context>. > <http://example.com/exampledoc.html#23_30> str:anchorOf > <http://example.com/exampledoc.html>; > str:isString "Ireland"; > a <str:Context>. > <http://example.com/exampledoc.html#offset_0_29> str:referenceContext > <http://example.com/exampledoc.html#0_29>; > a <str:String>; > itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>. > <http://example.com/exampledoc.html#offset_11_17> str:referenceContext > <http://example.com/exampledoc.html#11_17>; > a <str:String>; > itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>. > <http://example.com/exampledoc.html#offset_23_30> str:referenceContext > <http://example.com/exampledoc.html#23_30>; > a <str:String>; > itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>. > > ] > > > Let me know if there are any open issues with this output. One question: I > don't understand your reference to normalization form C - do you require > Unicode normalization for generating the output? Above offets are based on > non normalized processing, let me know if this needs to be changed. We just > need to have clear rules with regards to whitespace and normalization. > > Thanks, > > Felix > > 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de> > >> HI Felix, >> there are some syntactic errors: <str:String> . >> >> Maybe this helps: >> curl -X POST --data-urlencode input="Apache Stanbol can detect entities." >> --data input-type=text --data format=turtle http://nlp2rdf.lod2.eu/demo/* >> *NIFStanfordCore <http://nlp2rdf.lod2.eu/demo/NIFStanfordCore> >> curl -X POST --data-urlencode input="Apache Stanbol can detect entities." >> --data input-type=text --data format=turtle --data-urlencode prefix=" >> http://example.com/**exampledoc.html#<http://example.com/exampledoc.html#>" >> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore> >> curl -X POST --data-urlencode input="Apache Stanbol can detect entities." >> --data input-type=text --data format=turtle --data-urlencode >> prefix="urn:uuid:CEB9FD94-**6779-4257-B992-C853617CB791#" >> http://nlp2rdf.lod2.eu/demo/**NIFStanfordCore<http://nlp2rdf.lod2.eu/demo/NIFStanfordCore> >> >> I also attached the output. It is the Stanford Pos tagger NIF 2.0 draft >> wrapper. (Errata: Context uses anchorOf instead of isString) >> Normally, the prefix parameter is variable and set as config option. >> Please don't worry about UUIDs . NIF and ITS in NIF don't need them. The >> reason, why I included them, was that I am writing a converter for Apache >> Stanbol to NIF and ITS and Stanbol uses UUIDs. I removed them from the wiki >> page. >> >> So here are some corrections: >> >> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString >> "\r\n \r\n Welcome to Dublin in Ireland! \r\n \r\n"; >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a <str:Context>. >> >> Should be: >> <http://example.com/**exampledoc.html#0_54<http://example.com/exampledoc.html#0_54>> >> str:isString >> >> "\r\n \r\n Welcome to Dublin in Ireland! \r\n \r\n"; >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a str:Context . >> Character length of 54 is correct as this is based on Unicode Normal Form >> C, counted in Code Units: http://unicode.org/faq/char_**combmark.html#7<http://unicode.org/faq/char_combmark.html#7> >> >> >> ************************** >> >> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString >> "Dublin"; >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a <str:Context>. >> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString >> "Ireland"; >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a <str:Context>. >> Should be: >> <http://example.com/**exampledoc.html#31_37<http://example.com/exampledoc.html#31_37> >> str:anchorOf "Dublin"; >> >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a str:Context. >> <http://example.com/**exampledoc.html#41_48<http://example.com/exampledoc.html#41_48> >> str:anchorOf "Ireland"; >> >> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>> ; >> a str:Context. >> >> The counts seem to be wrong. Other than that it looks already quite close. >> All the best, >> Sebastian >> >> Am 09.08.2012 13:30, schrieb Felix Sasaki: >> >>> Hi Sebastian, all, >>> >>> I tried to create the NIF output (since we need two implementations) for >>> >>> <html xmlns:its="http://www.w3.org/**2005/11/its<http://www.w3.org/2005/11/its> >>> "> >>> <body> >>> <h2 its:translate="yes">Welcome to <span its:translate="no" >>> >Dublin</span> in <b its:translate="no">Ireland</b>**! >>> </h2> >>> </body> >>> </html> >>> >>> (I used an XML input here, but otherwise this is the same like your >>> example >>> in the wiki. >>> >>> Does the below output make sense? I am sure that the uuid is wrong, but I >>> don't know how to generate one. >>> >>> >>> [ >>> >>> @prefix itsrdf: <http://www.w3.org/2005/11/**its/rdf#<http://www.w3.org/2005/11/its/rdf#> >>>> . >>> @prefix str: <http://nlp2rdf.lod2.eu/**schema/string/<http://nlp2rdf.lod2.eu/schema/string/> >>>> . >>> @prefix rdf: <http://www.w3.org/1999/02/22-**rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>>> . >>> <http://example.com/**exampledoc.html#offset_0_50<http://example.com/exampledoc.html#offset_0_50>> >>> str:referenceContext >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50>; >>> a <str:String>; >>> itsrdf:translate "yes"^^<http://www.w3.org/TR/** >>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>. >>> <http://example.com/**exampledoc.html#offset_14_44<http://example.com/exampledoc.html#offset_14_44>> >>> str:referenceContext >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44>; >>> a <str:String>; >>> itsrdf:translate "yes"^^<http://www.w3.org/TR/** >>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>. >>> <http://example.com/**exampledoc.html#offset_25_31<http://example.com/exampledoc.html#offset_25_31>> >>> str:referenceContext >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31>; >>> a <str:String>; >>> itsrdf:translate "no"^^<http://www.w3.org/TR/** >>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>. >>> <http://example.com/**exampledoc.html#offset_25_32<http://example.com/exampledoc.html#offset_25_32>> >>> str:referenceContext >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32>; >>> a <str:String>; >>> itsrdf:translate "no"^^<http://www.w3.org/TR/** >>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>. >>> <http://example.com/**exampledoc.html#offset_5_49<http://example.com/exampledoc.html#offset_5_49>> >>> str:referenceContext >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49>; >>> a <str:String>; >>> itsrdf:translate "yes"^^<http://www.w3.org/TR/** >>> its-2.0/its.xsd#yesOrNo <http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo>>. >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#0_50> str:isString >>> "\r\n \r\n Welcome to Dublin in Ireland! \r\n \r\n"; >>> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>>> ; >>> a <str:Context>. >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#14_44> str:isString >>> "Welcome to Dublin in Ireland! "; >>> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>>> ; >>> a <str:Context>. >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_31> str:isString >>> "Dublin"; >>> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>>> ; >>> a <str:Context>. >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#25_32> str:isString >>> "Ireland"; >>> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>>> ; >>> a <str:Context>. >>> <urn:uuid:CEB9FD94-6779-4257-**B992-C853617CB791#5_49> str:isString >>> "\r\n Welcome to Dublin in Ireland! \r\n "; >>> str:occursIn <http://example.com/**exampledoc.html<http://example.com/exampledoc.html> >>>> ; >>> a <str:Context>. >>> >>> ] >>> >>> Thanks, >>> >>> Felix >>> >>> 2012/8/9 Sebastian Hellmann <hellmann@informatik.uni-**leipzig.de<hellmann@informatik.uni-leipzig.de> >>> Hi Jirka, >>>> thanks, for your feedback. I thought it was a requirement that the DOM >>>> should not be touched. I really never had any whitespace problems in any >>>> RDF serialization formats, so this was new to me. By the way, I can >>>> understand now, what your problem with the bloated mapping is. We really >>>> don't need to serialize it. Actually it can be kept in memory, which is >>>> more efficient. I added serialization as optional. Also I made an XML >>>> version, because for transferring such kind of data, XML is much better >>>> suited. (Is the XML alright?) I made all the changes you suggested, the >>>> new version is online here: >>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=** >>>> **622#Example<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**622#Example> >>>> <http://wiki.**nlp2rdf.org/index.php?title=** >>>> ITS2NIF2ITS&oldid=622#Example<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example> >>>> >>>> all the best, >>>> Sebastian >>>> >>>> >>>> Am 09.08.2012 11:59, schrieb Jirka Kosek: >>>> >>>> On 9.8.2012 11:47, Sebastian Hellmann wrote: >>>> >>>>> you found an interesting point. >>>>> >>>>>> I wrote some notes on the optimization: >>>>>> http://wiki.nlp2rdf.org/wiki/****ITS2NIF2ITS#Notes_on_**optional_**<http://wiki.nlp2rdf.org/wiki/**ITS2NIF2ITS#Notes_on_optional_**> >>>>>> optimizations<http://wiki.**nlp2rdf.org/wiki/ITS2NIF2ITS#** >>>>>> Notes_on_optional_**optimizations<http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations> >>>>>> http://wiki.nlp2rdf.org/index.****php?title=ITS2NIF2ITS&oldid=****<http://wiki.nlp2rdf.org/index.**php?title=ITS2NIF2ITS&oldid=**> >>>>>> 614#Notes_on_optional_****optimizations<http://wiki.** >>>>>> nlp2rdf.org/index.php?title=**ITS2NIF2ITS&oldid=614#Notes_** >>>>>> on_optional_optimizations<http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations> >>>>>> >>>>>> I think, it generally depends on the use case, whether you would >>>>>> optimize. Do you think we should specify/limit what optimizations are >>>>>> possible? >>>>>> It might be easier to explain implications to help developers, >>>>>> but leave the implementation under-specified. >>>>>> Do you think I should remove them from the algorithm description and >>>>>> move them to a completely different section? Would this help the >>>>>> structure of the document? >>>>>> >>>>>> I think that NIF mapping is so unnatural as is that optimization can >>>>> make it really messy. If the goal of optimization was to create less >>>>> complex RDF representation with not blank text nodes and trimmed text >>>>> nodes with a lot of whitespace I can think that easier and workable >>>>> approach would be to: >>>>> >>>>> - remove all whitespace optimization from mapping algorithm >>>>> >>>>> - saying that algorithm can produce a lot of "phantom" predicates from >>>>> excessive whitespace >>>>> >>>>> - recommending to normalize whitespace in the input XML/HTML/DOM in >>>>> order to minimize such phantom predicates >>>>> >>>>> This way each user/application can create custom whitespace >>>>> normalization based on nature of input data and we don't have to care >>>>> about it. >>>>> >>>>> For example for your sample document it is safe (knowing HTML whitespace >>>>> handling rules) to normalize it to >>>>> >>>>> <html><body><h2 translate = "yes" >Welcome to <span >>>>> its-disambig-ident-ref = "http://dbpedia.org/resource/****Dublin<http://dbpedia.org/resource/**Dublin> >>>>> <http://dbpedia.org/**resource/Dublin<http://dbpedia.org/resource/Dublin> >>>>>> ” >>>>> translate >>>>> = "no">Dublin</span> in <b translate="no">Ireland</b>!</**** >>>>> >>>>> h2></body></html> >>>>> >>>>> (Actually one line with no excessive whitespace.) >>>>> >>>>> Does this sounds reasonable to my SemWeb-educated friends? >>>>> >>>>> Jirka >>>>> >>>>> >>>>> -- >>>> Dipl. Inf. Sebastian Hellmann >>>> Department of Computer Science, University of Leipzig >>>> Events: >>>> * http://sabre2012.infai.org/****mlode<http://sabre2012.infai.org/**mlode>< >>>> http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>>(Leipzig, >>>> Sept. 23-24-25, 2012) >>>> >>>> * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*) >>>> Projects: http://nlp2rdf.org , http://dbpedia.org >>>> Homepage: http://bis.informatik.uni-**le**ipzig.de/SebastianHellmann<http://leipzig.de/SebastianHellmann> >>>> <htt**p://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann> >>>> Research Group: http://aksw.org >>>> >>>> >>>> >>>> >> -- >> Dipl. Inf. Sebastian Hellmann >> Department of Computer Science, University of Leipzig >> Events: >> * http://sabre2012.infai.org/**mlode <http://sabre2012.infai.org/mlode>(Leipzig, Sept. 23-24-25, 2012) >> * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*) >> Projects: http://nlp2rdf.org , http://dbpedia.org >> Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann> >> Research Group: http://aksw.org >> >> > -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events: * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012) * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*) Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Received on Wednesday, 15 August 2012 07:45:23 UTC