TrimTextNodes in Streaming parameter set from Meiko Jensen on 2010-05-05 (public-xmlsec@w3.org from May 2010)

From: Meiko Jensen <Meiko.Jensen@ruhr-uni-bochum.de>
Date: 5 May 2010 13:14:24 +0200
To: "Pratik Datta" <PRATIK.DATTA@oracle.com>, "XMLSec WG Public List" <public-xmlsec@w3.org>
Message-ID: <4BE15310.7010804@ruhr-uni-bochum.de>

Hi Pratik,

regarding the trimTextNodes parameter in my streaming proposal, here an
example:

<A>
   <B>  stupid                                        




example...




  </B>
</A>

In SAX, this might end up with the contents of B being split
to---say---3 separate characters() events. The first contains "stupid",
hence removing the leading whitespaces is no issue. Trailing whitespaces
already pose a problem: one can not be sure there's no non-whitespace
text following. Hence, this requires caching the trailing whitespaces up
to the point one can decide whether they are trailing or embedded
whitespaces. Second characters() event only contains whitespaces. Still,
we don't know whether we may safely discard them. However, I know at
least one programmer who will implement the trimTextNodes method so that
characters() events containing of whitespaces only will be discarded. We
may add a hint to this issue in the spec, but it still remains somewhat
tricky. Third characters() event: "example...". Now it turns out that
the cached whitespaces were in fact embedded, not trailing. So we have
to flush the cache to the c14n. Again, the trailing whitespaces trigger
caching. Then, there comes an "endElement()" event of the B element.
Here, it turns out that the cache can be discarded, as the contained
whitespaces indeed were trailing ones. However, this results in that
every event method must be implemented to take care not only of the
event itself, but also on the whitespace cache.

I know this is a rather constructed example, but the issue exists and
may cause the "WTF happened here?" kind of bugs in real-world scenarios.

Additionally, the issue is complicated a little by the
ignoreWhitespaces() event that is used only by validating parsers, and
would get called e.g. for the whitespaces between <A> and <B> in the
example above. In fact, that's why I suggested to consider a third
option (besides trim and noTrim) that would only erase
ignorableWhitespaces(). However, that one does not work if used in
non-validating parser environments (but could be emulated).

That's why I proposed to set trimTextNodes=false.

What do you think?

best regards

Meiko

-- 
Dipl.-Inf. Meiko Jensen
Chair for Network and Data Security 
Horst Görtz Institute for IT-Security 
Ruhr University Bochum, Germany
_____________________________
Universitätsstr. 150, Geb. IC 4/150
D-44780 Bochum, Germany
Phone: +49 (0) 234 / 32-26796
Telefax: +49 (0) 234 / 32-14347
http:// www.nds.rub.de

Received on Wednesday, 5 May 2010 11:14:54 UTC