- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Thu, 09 Aug 2012 13:06:05 +0200
- To: Jirka Kosek <jirka@kosek.cz>
- CC: public-multilingualweb-lt@w3.org
Hi Jirka, thanks, for your feedback. I thought it was a requirement that the DOM should not be touched. I really never had any whitespace problems in any RDF serialization formats, so this was new to me. By the way, I can understand now, what your problem with the bloated mapping is. We really don't need to serialize it. Actually it can be kept in memory, which is more efficient. I added serialization as optional. Also I made an XML version, because for transferring such kind of data, XML is much better suited. (Is the XML alright?) I made all the changes you suggested, the new version is online here: http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=622#Example all the best, Sebastian Am 09.08.2012 11:59, schrieb Jirka Kosek: > On 9.8.2012 11:47, Sebastian Hellmann wrote: > >> you found an interesting point. >> >> I wrote some notes on the optimization: >> http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations >> http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations >> >> I think, it generally depends on the use case, whether you would >> optimize. Do you think we should specify/limit what optimizations are >> possible? >> It might be easier to explain implications to help developers, >> but leave the implementation under-specified. >> Do you think I should remove them from the algorithm description and >> move them to a completely different section? Would this help the >> structure of the document? > I think that NIF mapping is so unnatural as is that optimization can > make it really messy. If the goal of optimization was to create less > complex RDF representation with not blank text nodes and trimmed text > nodes with a lot of whitespace I can think that easier and workable > approach would be to: > > - remove all whitespace optimization from mapping algorithm > > - saying that algorithm can produce a lot of "phantom" predicates from > excessive whitespace > > - recommending to normalize whitespace in the input XML/HTML/DOM in > order to minimize such phantom predicates > > This way each user/application can create custom whitespace > normalization based on nature of input data and we don't have to care > about it. > > For example for your sample document it is safe (knowing HTML whitespace > handling rules) to normalize it to > > <html><body><h2 translate = "yes" >Welcome to <span > its-disambig-ident-ref = "http://dbpedia.org/resource/Dublin” translate > = "no">Dublin</span> in <b translate="no">Ireland</b>!</h2></body></html> > > (Actually one line with no excessive whitespace.) > > Does this sounds reasonable to my SemWeb-educated friends? > > Jirka > -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events: * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012) * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*) Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Received on Thursday, 9 August 2012 11:06:32 UTC