Re: Streaming ITS processor from Asgeir Frimannsson on 2008-06-19 (public-i18n-its-ig@w3.org from June 2008)

From: Asgeir Frimannsson <asgeirf@redhat.com>
Date: Thu, 19 Jun 2008 10:17:57 +1000
To: Felix Sasaki <fsasaki@w3.org>
Cc: public-i18n-its-ig@w3.org
Message-Id: <200806191017.58144.asgeirf@redhat.com>

Hi Felix, all,

On Tuesday 17 June 2008 11:04:49 Felix Sasaki wrote:
> Jirka Kosek さんは書きました:
> > Asgeir Frimannsson wrote:
> >> I guess this is one of the areas where you have a gut feeling that
> >> something could be done better, but have no implementations to
> >> justify that claim :) Some of the main drawbacks with ITS at the
> >> moment are:
> >> - Having to load the instance document into memory for processing
> >> - Having to traverse the in-memory DOM for each rule, as most xpath
> >> processors take one expression and returns a node set.
> >
> > Please note that as long as you stick to XPath patterns (not full
> > expressions) you can use internal pattern matching API of XSLT
> > processor which is optimized for this task and gives much better
> > performance then naive evaluating of each XPath against document tree.
>
> Asgeir, thanks for pointing to the Blog from Jeni, and Jirka, thanks for
> pointing out the benefit of using XPath (XSLT) patterns here. I'm
> wondering if these patterns would do the job for Asgeir, and I'm aware
> that this is no perfect solution. If you, Asgeir, still want to have
> something more streamable, "Compile a state machine based on a set of
> rules", it would be good to know how you want to construct these rules:
> based on XPath, a subset of XPath (like the XSLT patterns or the EBNF in
> the Wiki), or something completely different.

A bit of background: This topic initially started over a conversation between 
Yves (Savourel), myself and Jim (Hardgrave), where Yves briefly mentioned  his 
work on the ITS api. I - perhaps prematurely - argued that there had to be a 
better solution than using a memory-intensive DOM parser for converting XML 
documents to/from typical localisation formats. 

Now, much thanks to the wisdom of Jirka and Felix, I do see that this problem 
is not as simple a I initially thought :)

The deeper question I'm asking is perhaps if the full ITS spec is a bit 
overkill for many situations. For most formats (docbook, dita, etc), isn't  a 
very limited knowledge of the structure of a document enough to determine 
these i18n attributes? Look e.g. at the example ITS rules in the 'best 
practices' document, where the majority of rules uses a very simple 
"contextual subset" of xpath. In most cases the namespace+element names (or 
attribute + parent element) are enough information to determine the i18n-
attributes. This looks more than a 'schema' like language than the ITS 
pattern-based approach, and perhaps a way of annotating the schema/dtd/etc 
would be a better approach for many formats?

Now, I'm NOT suggesting a change to ITS itself, as it serves many other use-
cases than what I deal with. And once we go beyond the use-case I described 
above, ITS suddenly becomes very powerful and attractive. I do not have an 
immediate need for a streaming ITS processor, hence neither time to work 
develop one. ...Although at some point when we do start using ITS more 
heavily, I might have to revisit this. It's nevertheless a very interesting 
problem :)

cheers,
asgeir

Received on Thursday, 19 June 2008 00:19:01 UTC