Re: forbiddenCharacters data category - related to [ACTIOn-189] from Felix Sasaki on 2012-08-27 (public-multilingualweb-lt@w3.org from August 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 27 Aug 2012 17:52:05 +0200
To: Jirka Kosek <jirka@kosek.cz>
Cc: Yves Savourel <ysavourel@enlaso.com>, public-multilingualweb-lt@w3.org
Message-ID: <CAL58czrF+j_i6CzXi9WuBQi6MHiOqxoZ1f5k3MQYALU--NhK9g@mail.gmail.com>

I agree with all points Jirka made. Sorry, Yves, I disagree with the
proposal you made at

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0288.html

"The set of characters that are forbidden is specified using a regular
expression limited to a simple set pattern supported by most regular
expression engines: ..."

Jirka's solution is much cleaner. And, as Jirka said, we are starting from
HTML and XML content, so it makes sense to require that standardized XML
Schema regex is mapped to other expressions, if needed - not the other way
round.

Best,

Felix

2012/8/27 Jirka Kosek <jirka@kosek.cz>

> On 27.8.2012 16:36, Yves Savourel wrote:
>
> Hi Yves,
>
> > While Jirka’s solution would help in reducing the number of cases where
> C0-etc. need to be specified, I think it still doesn’t provide a full
> solution: we would not be able to allow such characters.
>
> But such characters can't be inside generic XML document unless you are
> using some custom application specific escaping mechanism. So until
> there is not generic mechanism for entering such characters inside XML
> it doesn't make sense to support such characters in ITS.
>
> > Another aspect to take into account is that—at least from my
> experience—it’s often much easier to define the list of forbidden
> characters than the list of allowed characters.
>
> However such approach was source of many security flaws in past. And
> with my proposal you can still do what you want by using [^...] notation.
>
> > If we look again at the issue: it seems to boils down to two problems:
> a) the inability of the XML regex to handle invalid XML characters, and b)
> \uHHHH not supported by all engines.
> >
> > For a) it's an XML problem that keeps coming back in localization. More
> and more formats are working around that problem because they need to store
> those characters in XML. For example Unicode-LDML, TS and XLIFF2 define
> elements to 'escape' them.
>
> But ITS will never understood to custom escape syntaxes, it operates on
> raw XML content. I see that you want to use ITS to transfer additional
> information to processing system which can work on superset of XML
> datamodel, for example by supporting C0 characters. But it feels strange
> to standardize something which doesn't work on our datamodel.
>
> > A use case like forbidden characters in Windows' path is a good
> illustration that we do need to support those characters in the regex.
>
> Why you can't use something like
>
> allowedCharacters="[&#x20;-&#x1ffff;-[&lt;>:&quot;\\/|\?*]]"
>
> instead of
>
> [\u0000-\u001F<>:"\\/|\?*]
>
> It's not that big difference and we would stay inside XML Schema regex
> syntax.
>
>                         Jirka
>
> --
> ------------------------------------------------------------------
>   Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
> ------------------------------------------------------------------
>        Professional XML consulting and training services
>   DocBook customization, custom XSLT/XSL-FO document processing
> ------------------------------------------------------------------
>  OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 member
> ------------------------------------------------------------------
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Monday, 27 August 2012 15:52:34 UTC