Re: [ISSUE-29][ACTION-164] ITS2NIF2ITS - RDF roundtrip from Sebastian Hellmann on 2012-08-09 (public-multilingualweb-lt@w3.org from August 2012)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Thu, 09 Aug 2012 13:06:05 +0200
To: Jirka Kosek <jirka@kosek.cz>
CC: public-multilingualweb-lt@w3.org
Message-ID: <5023999D.4050201@informatik.uni-leipzig.de>

Hi Jirka,
thanks, for your feedback. I thought it was a requirement that the DOM 
should not be touched. I really never had any whitespace problems in any 
RDF serialization formats, so this was new to me. By the way, I can 
understand now, what your problem with the bloated mapping is. We really 
don't need to serialize it. Actually it can be kept in memory, which is 
more efficient. I added serialization as optional. Also I made an XML 
version, because for transferring such kind of data, XML is much better 
suited. (Is the XML alright?)  I made all the changes you suggested, the 
new version is online here:
http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example

all the best,
Sebastian


Am 09.08.2012 11:59, schrieb Jirka Kosek:
> On 9.8.2012 11:47, Sebastian Hellmann wrote:
>
>> you found an interesting point.
>>
>> I wrote some notes on the optimization:
>> http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations
>> http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations
>>
>> I think, it  generally depends on the use case, whether you would
>> optimize.  Do you think we should specify/limit what optimizations are
>> possible?
>> It might be easier to explain implications to help developers,
>> but leave the implementation under-specified.
>> Do you think I should remove them from the algorithm description and
>> move them to a completely different section? Would this help the
>> structure of the document?
> I think that NIF mapping is so unnatural as is that optimization can
> make it really messy. If the goal of optimization was to create less
> complex RDF representation with not blank text nodes and trimmed text
> nodes with a lot of whitespace I can think that easier and workable
> approach would be to:
>
> - remove all whitespace optimization from mapping algorithm
>
> - saying that algorithm can produce a lot of "phantom" predicates from
> excessive whitespace
>
> - recommending to normalize whitespace in the input XML/HTML/DOM in
> order to minimize such phantom predicates
>
> This way each user/application can create custom whitespace
> normalization based on nature of input data and we don't have to care
> about it.
>
> For example for your sample document it is safe (knowing HTML whitespace
> handling rules) to normalize it to
>
> <html><body><h2 translate = "yes" >Welcome to <span
> its-disambig-ident-ref = "http://dbpedia.org/resource/Dublin” translate
> = "no">Dublin</span> in <b translate="no">Ireland</b>!</h2></body></html>
>
> (Actually one line with no excessive whitespace.)
>
> Does this sounds reasonable to my SemWeb-educated friends?
>
>    Jirka
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
   * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012)
   * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

Received on Thursday, 9 August 2012 11:06:32 UTC