RE: Grinding to a halt on Issue 27. from Joshua Allen on 2003-04-30 (www-tag@w3.org from April 2003)

From: Joshua Allen <joshuaa@microsoft.com>
Date: Wed, 30 Apr 2003 08:45:03 -0700
To: <robin.berjon@expway.fr>
Cc: <www-tag@w3.org>
Message-ID: <4F4182C71C1FDD4BA0937A7EB7B8B4C108E89116@red-msg-08.redmond.corp.microsoft.com>
> > This dance of "strcmp for namespaces and canonicalization for
anyUri"
> > has led to a status quo that is buggy and ambiguous, particularly
for
> > any non-trivial scenario.
> 
> Could you exemplify more clearly?

OK.  Based on what I've seen of our XML APIs, there are a surprising
variety of places where URI equivalence testing is encountered.  Some of
these places include (I am sure this is not comprehensive, but good
enough for illustration):

1) checking identity constraints
2) testing namespace equivalence; for example for purposes of duplicate
attribute detection
3) checking to see if a particular schema or entity has already been
cached/compiled in a collection
4) testing for circular references

And each of these has the facet of relative vs. absolute:
a) Some scenarios allow URIs to be absolutized before comparing
b) Some scenarios assume that the URI is absolute already
c) Some URIs are impossible to evaluate to determine if they are
relative or not

Now start mixing these factors.

One of the examples I have seen posted on this list is something like:

<e xmlns:ns1="http://bar" ns1:a="1" xmlns:ns2="http://bar" ns2:a="2" />

This illustrates one case in the very common scenario of "compound
document processing".  When people construct XML documents from multiple
sources, you see variations on this (namespaces redundantly declared
with different prefixes, elements or attributes jammed inline, etc.)

And I would guess that about 90% of existing processors would do the
right thing here (throw an error).  However, interop gets put to the
test if the namespaces include any URL-encoded characters or worse yet,
Unicode, or even worse yet, MBCS.  Since the parts of the document are
coming from multiple sources, it is impossible to guarantee that the all
choose to "normalize" their namespace URIs the same way.  If you built a
test matrix including these sorts of combinatorial cases, I expect you
would find consistency of behavior across implementations drop very
quickly.

In fact, this particular scenario gets more difficult, IMO, as people
are swayed by the RDDL folks into using HTTP identifiers as namespace
names.  
Two different divisions in the same company may use what they *think* is
the same namespace for their documents (and when they click on the
namespace name it sure enough connects them to the right place, so how
are they to know differently?).  Furthermore, all of their XML documents
work fine within their own department.  It is only two years later when
corporate IT tries to combine those documents that things break.  In
fact, depending on how much of the data is affected, it's quite possible
that IT won't even notice, and will just blindly throw away bits of
data.  And in this RDDL scenario, it is rather difficult for a vendor to
argue with the burnt customer and say "screw you, you should have
realized that we were going to do strcmp" as if the guy even knows what
that means.

You *could* argue that the above scenario is adequately specified, but
that's little consolation.  We have seen "bugs" like this with respect
to identity constraints as well.  When you hit scenarios that mix the
different URI comparison conventions, the customer expectations can
become confused and the consistency across implementations becomes less
trustworthy (exacerbating customer confusion about expectations - "my
Perl regex parser handles this just fine, and so does Xalan; why the
heck does MSFT have so much money and can't even write a simple XML
processor?").

Thanks,
Joshua
Received on Wednesday, 30 April 2003 11:45:21 UTC