Re: [ISSUE-29][ACTION-164] ITS2NIF2ITS - RDF roundtrip from Jirka Kosek on 2012-08-09 (public-multilingualweb-lt@w3.org from August 2012)

From: Jirka Kosek <jirka@kosek.cz>
Date: Thu, 09 Aug 2012 11:59:17 +0200
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
CC: public-multilingualweb-lt@w3.org
Message-ID: <502389F5.20107@kosek.cz>

On 9.8.2012 11:47, Sebastian Hellmann wrote:

> you found an interesting point.
> 
> I wrote some notes on the optimization:
> http://wiki.nlp2rdf.org/wiki/ITS2NIF2ITS#Notes_on_optional_optimizations
> http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=614#Notes_on_optional_optimizations
> 
> I think, it  generally depends on the use case, whether you would
> optimize.  Do you think we should specify/limit what optimizations are
> possible?
> It might be easier to explain implications to help developers,
> but leave the implementation under-specified.
> Do you think I should remove them from the algorithm description and
> move them to a completely different section? Would this help the
> structure of the document?

I think that NIF mapping is so unnatural as is that optimization can
make it really messy. If the goal of optimization was to create less
complex RDF representation with not blank text nodes and trimmed text
nodes with a lot of whitespace I can think that easier and workable
approach would be to:

- remove all whitespace optimization from mapping algorithm

- saying that algorithm can produce a lot of "phantom" predicates from
excessive whitespace

- recommending to normalize whitespace in the input XML/HTML/DOM in
order to minimize such phantom predicates

This way each user/application can create custom whitespace
normalization based on nature of input data and we don't have to care
about it.

For example for your sample document it is safe (knowing HTML whitespace
handling rules) to normalize it to

<html><body><h2 translate = "yes" >Welcome to <span
its-disambig-ident-ref = "http://dbpedia.org/resource/Dublin” translate
= "no">Dublin</span> in <b translate="no">Ireland</b>!</h2></body></html>

(Actually one line with no excessive whitespace.)

Does this sounds reasonable to my SemWeb-educated friends?

   Jirka

-- 
------------------------------------------------------------------
  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
------------------------------------------------------------------
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 member
------------------------------------------------------------------

Received on Thursday, 9 August 2012 09:59:42 UTC