Re: Options for dealing with IDs from Chris Lilley on 2003-01-10 (www-tag@w3.org from January 2003)

From: Chris Lilley <chris@w3.org>
Date: Fri, 10 Jan 2003 20:02:32 +0100
To: www-tag@w3.org, Norman Walsh <Norman.Walsh@Sun.COM>
Message-ID: <67323336296.20030110200232@w3.org>
On Friday, January 10, 2003, 6:57:18 PM, Norman wrote:


NW> -----BEGIN PGP SIGNED MESSAGE-----
NW> Hash: SHA1

NW> / Chris Lilley <chris@w3.org> was heard to say:
NW> | On Friday, January 10, 2003, 5:13:53 PM, Norman wrote:
| NW>> / noah_mendelsohn@us.ibm.com was heard to say:
| NW>> | I think I agree with Tim's other conclusion:  do nothing is probably the 
| NW>> | least risky solution.  We've got too many typing mechanisms already.
NW> |
| NW>> I have mixed feelings, but I think I agree with Tim and Noah.
NW> |
| NW>> "IDness" is a consequence of validation. That means you have to
| NW>> validate.
NW> |
NW> | So, your solution is option 1 or option 8 *DTD or Schema validation in
NW> | all cases).

NW> Yes. Or an internal subset as you point out further down. "The status quo."

| NW>>  I understand that sometimes has painful consequences. If a
| NW>> language wants to have IDs so that authors can point into documents,
| NW>> the workaround is to establish a MIME type for that language and
| NW>> describe what fragment identifiers mean independent of validation.
NW> |
NW> | That does not give you IDs. It gives you pointers. It does not solve
NW> | the getElementByID problem and it does not solve the #fo selector
NW> | problem.

NW> Right. getElementsByID() returns an empty set if you haven't validated.

You mean, you propose that it *should* return the empty set if you
haven't validated.

NW> Workarounds for the #fo problem could be achieved in the CSS spec
NW> without changing XML. (No, I don't have any specific workaround in
NW> mind.)

Allow me to consider that assertion unproven, in that case, and merely
observe that fixing the IDness problem in multiple *consumers* of IDs
(probably in different ways) is clearly suboptimal to fixing it
centrally.

| NW>> Similarly, the semantics of intra-document references could be defined
| NW>> independent of validation if necessary.
NW> |
NW> | I agree that, since we have well formed documents, the semantics of
NW> | intra-document references should be defined independent of validation.
NW> | There are tow ways to do this; one is to invent a whole new mechanism
NW> | that is independent of IDs and define how that works. The other way,
NW> | suggested in this thread, is to separate the assignment of IDness from
NW> | that of validation.

NW> As long as DTDs and schemas contribute "IDness" to the mix, they can't
NW> be separated. I'd be a lot happier with separation.

Well, I would be a lot happier if DTDs and Schemas could separate the
tesks of decoration and validation too - I would like to see a PSI and
a PSVI as separate things - but the solution to the problem of IDness
in well formd instances does not depend on them doing that.

NW> What's being proposed here is another, independent mechanism *in
NW> addition to* validation.

No, *before* validation.

NW> Like Noah said, "we've got too many typing mechanisms already".

And like I said, not fixing this will give us plenty more as all the
unsatisfied customers invent them one per specification. I can't
believe that you are seriously proposing that.

NW> | Which XML already does. Is it true to say that in the following
NW> | instance
NW> |
NW> | <?xml version="1.0" encoding="UTF-8"?>
NW> | <!DOCTYPE foo [
NW> | <!ATTLIST foo partnum ID #IMPLIED>
| ]>>
NW> | <foo  partnum="i54321" bar="toto"/>
NW> |
NW> | a) The instance is well formed
NW> | b) the instance is not valid(atable)
NW> | c) the partnum attribute on foo is of type ID

NW> Yep. All true.

Okay so the concept of IDness is *not* tied to validation.

| NW>> On the other hand, one of the consequences xml:idAttr (and do a lesser
| NW>> extent xml:id) that bothers me is that it moves this validation
| NW>> semantic out into authoring space.
NW> |
NW> | To be clear; it does nothing to validation at all. It decorates a well
NW> | formed instance. It does not do any validation and the three
NW> | validation constraints that apply to IDs are no enforced unless there
NW> | is a subsequent validation step (for example, with a W3C XML Schema).

NW> Fair point. Let me rephrase. It provides an additional type annotation
NW> mechanism out in the authoring space. This provides yet another
NW> mechanism to do something and it may do so in ways that are sometimes
NW> invalid.

Of course. Well formed authoring always has the possibility of
creating things that are then determined to be invalid - duplicate
ids, incorrect content models, missing required attributes and so on.
How is this different?

NW> If you look at a document with well-formed glasses on, then again with
NW> validation glasses on, there are a small number of differences that
NW> you may perceive. These proposals all add one more thing to that set.
NW> I'd like to make that set smaller, not larger.

So would I which is why I would like there to be a way to add IDness
to the infoset of well formed documents and for the W3C XML Schema to
pick that up as its input Infoset and reflect these values back in the
PSVI so that the number of differences seen with the two sets of
glasses becomes smaller: IDness is preserved after validation.

NW> (Before someone points out xsi:type, let me just say I've never used
NW> it and I hope I never do. Everytime I think about it, it whispers "I'm
NW> a design flaw, but you can't quite work out what design would be
NW> better, can you?" Then it giggles evilly.)

I hear Norm proposing option #10 (or is that 11) using xsi:type in the
instance (though that would need to be a child element not an
attribute now because we have multiple attributes ....)


NW> | Further, the validation semantic is already out in the authoring
NW> | space. Authors can plug away in the internal subset - particularluy in
NW> | those DTDs that have parameter entities in their content models
NW> | precisely to allow for such extension) and can even declare the entire
NW> | DTD in the internal subset and make it up as they go along.

NW> I concede that not all uses of the internal subset are validation, but
NW> I tend to think of them that way.

I agree you think of them that way. I am trying to get you not to
think of them all that way because it complicates the architecture.

NW>  Taking advantage of DTD parameter
NW> entities more-or-less implies that you're doing full validation
NW> because they almost never have any effect on a WF-only parser that
NW> ignores the external subset. So they're mostly local modifications to
NW> the DTD that occur before validation, and they usually indicate that
NW> validation is expected.

Yes. My point was merely that users can already affect validation when
they are editing their instances, which you asserted was a bad thing and
only introduced by these id proposals.

NW> | So I believe that your concern is unfounded because
NW> |
NW> | a) people can already do that, and

NW> People can modify the schema that will be used on a per-instance
NW> basis, and some of the modifications that they can perform effect a
NW> document that isn't subjected to validation because of the minimal
NW> "DTD processing requirements" placed on a WF parser.

Yes.

NW> That usage doesn't concern me as much.

Okay so modifications to the instance that affect the IDness do not
concern you, ok that is good ....

NW> | b) these proposals do not do it.

NW> They do introduce yet another way to do something and the way that's
NW> introduced will expose new kinds of validation problems.

Just like the 15 or so schema languages all introduce another way to
do something. But yes, its a new way. In the specific case of the
subset-less SOAP XML form, there is no other way (except for Schema
processing after parsing, which is unlikely in a messaging environment
that is security and performance conscious).

NW> I'm still concerned.

I agree that the "what happens when DTD validation is performed" is
still an issue that needs to be addressed. That might be as simple as
saying "if you have an external or internal subset and you declare
attributes to be of some other type than ID then interoperability will
suffer so you should ensure, if you are wise, that the IDness is the
same with and without DTD validation".

Or we could try and say which wins (but I am fairly sure the DTD
would win because of the ass-ba^H^H^H^H^H^H prior declaration wins
design and the instance is read last. Hence, a disparity between what
is declared as ID in the well formed instance and what is (re)declared
in the DTD might be best solved by authoring guidelines and best
practice; people who don't follow that get exactly what they used
to have, ie what is in the DTD is correct, and there is a disparity
between  well formed and valid views of the document, so they are no
worse off.

There seems to be no problem in terms of validation with W3C XML
Schema or with any other schema language that picks up an Infoset on
the way in, because this mechanism merely adds to the infoset at parse
time and can be defined to be the same sor tof annotation that Schema
does, so a processor that works on the PSVI need not care where the
IDness of  particular attribute came from.

| NW>> One of the reasons that W3C XML
| NW>> Schema says that schema location information is only a hint is so that
| NW>> I can apply my own schema independent of what the author asked for.
| NW>> Well, what if I want to use some other attribute as an ID sometimes?
NW> |
NW> | Realistically, unless it was authored that way, your chances of
NW> | getting uniqueness on attribute values that were not already checked
NW> | for uniqueness are going to be spotty at best. But ok suppose you want
NW> | to ....
NW> |
| NW>> It just seems to me that moving IDness into the document is a fairly
| NW>> significant can of worms.

It might be, but your assertion about a use case of suddenly changing
the IDness of a document and re-validating it does not establish the
worminess. I can sense that you feel unease; this might be because its
a can of worms or it might be that you have got used to treating two
concepts as the same when in fact they are architecturally different
and you are getting used to that.

NW> | Please see the example above which has the IDness in the instance and
NW> | tell me how you home-grown Schema which declares the toto attribute to
NW> | be an ID is going to deal with the input infoset that says partnum is
NW> | an ID.

NW> I didn't intend the latter comment about a can of worms as an
NW> extension of the former comment. I concede that having different
NW> schemas that use different attributes for IDness is a more theoretical
NW> than practical example. But it still raises philosophical issues to
NW> me.

Of course, and its good to think these thought experiments through to
catch use cases. But as Len said its a case of "figuring out who pays
which bills" and if it comes down to on the one hand having SOAP work
and having RDF/XML processable by XML tools and having multi-namespace
XML documents reliably and interoperably processed by a new generation
of XML clients so we can ditch the 1997-brand 'HTML' clients that
hamper us now  - pauses for breath - on the one hand, and allowing
someone to theoretically shuffle all the datatypes in an instance and
see if it revalidates, then in terms of cost/benefit and who pays the
bills its pretty obvious to me where the big win is.

NW> I think the worms in the can are:

NW> - - New validity problems:

NW>   <!DOCTYPE foo SYSTEM "foo.dtd">
NW>   <foo xml:id="bar"/>

NW>   If foo.dtd contains

NW>   <!ATTLIST foo name ID #IMPLIED>

NW>   Then the former document means one thing if it's accessed with a WF
NW>   parser and is rejected by a validating parser.

Yes.

Just as it would be rejected if it said

<!DOCTYPE foo SYSTEM "foo.dtd">
  <foo xml:lang="ja"/>

  If foo.dtd contains

  <!ATTLIST foo name CDATA #REQUIRED>

Validation *always* has the chance for rejecting well formed
documents. That is what it is for.

I propose that we deal with that by saying

a) the DTD view wins; if the DTD says different things than the
instance, that was your choice
b) best current practice is to reflect into the instance wha the DTD
says about IDness
c) best current practice for new document types is to use a single
attribute name for all attributes of type ID, where possible
d) best current practice for namespaces which are expected to be mixed
with others is to cal the ID attribute id.

NW> You could argue that
NW>   the same is true of

NW>   <!DOCTYPE foo SYSTEM "foo.dtd">
NW>   <foo id="bar"/>

NW>   But it's not the same since a WF parser would not associate "IDness" with
NW>   the 'id' attribute on foo. So xml:id really does introduce a new kind of
NW>   error.

Yes. One that is machine detectable, which is a big advance on the "if
its a namespace that you have personal knowledge of" weasel-wording.

NW> - - Complexity, the xml:idAttr (or xml:idAttrs) and the concomitant
NW>   xml:idrefsAttr(s) add new levels of hierarchical complexity.

Well, they add some compexity to document instances at the cost of
removing some complexity and some uncertainty elsewhere, so its unfair
to characterise it as "adding complexity". Saying "just do DTD
validation if you want IDs otherwise live without" also adds
complexity.

| NW>> If pushed, I think I could come to terms with the simple xml:id
| NW>> proposal, but the more complex variants look like too much complexity
| NW>> to me.
NW> |
NW> | Firstly, glad you could settle for xml:id. I could too, if that was
NW> | the best I was going to get but I think we can get better.
NW> |
NW> | However, it isn't simpler. If you have some XSL-T telmpate that copies
NW> | a bunch of stuff to the output and then copies foo from the sample
NW> | that I have above as a child element, then your choices are
NW> |
NW> | a) leave it alone and loose the IDness of partnum

NW> When you build a new result tree, you lose IDness anyway.

Because you can't guarantee uniqueness of the values? Sure, just as well
we are *not validating* then ;-)  or perhaps because you clearly can't
copy and paste the DTD fragments the same as you can with elements -
again, that means its handy that we are not relying on such a
mechanism.

NW> | b) rewrite partnum to xml:id and possibly break tools that use part
NW> | numbers

NW> That's a choice the tool writer gets to make. And he or she can have
NW> different transformations that do different things in different
NW> contexts.

Yes, lots of flexibility there, plenty of room for tradeoffs. Given a
choice between two alternatives both of which broke something, then
rather than having the flexibility to write two different tools that
broke things differently and the flexibility to carefully remember
which tool to use when, I would rather have a third option that didn't
force me to choose between keeping the local name or the type, but let
me retain the local name along with its type - nice and simple.

NW> | The 'more complex' variant lets you
NW> |
NW> | c) leave it alone and retain the IDness by adding an attribute
NW> |
NW> | of course you have to have parsed the instance and looked in the
NW> | infoset to get the IDness in the first place. If the example had
NW> | instead been
NW> |
NW> | <?xml version="1.0" encoding="UTF-8"?>
NW> | <foo  partnum="i54321" bar="toto" xml:idAttr="partnum"/>
NW> |
NW> | then just copying the foo element does everything. Which is what I
NW> | meant by "aiding composability".

NW> Yeah, but it's a whole new bit of context that the parser has to keep
NW> around as it's building the infoset.

And its right there on the element which makes XSLT handling much
simpler. How many XSLT sheets have you seen that read the DTD and
carefully constructed little internal subsets in the output document??

NW> Yes, it's clear how it would be implemented and taken by itself it's
NW> clearly not *that complex*, but I feel like over the last few years
NW> we've taken a simple idea (a subset of SGML useful to the desperate
NW> perl hacker)

That simple idea is now the basis for the worlds information system
and its electronic commerce system. So its grown a bit beyond the
desperate perl hacker, who I don't see doing a great job on a PSVI
anytime this century.

NW> and added processing expectations and complexities (large
NW> and small) on top of each other again and again and again.

You persist in portraying it as complexity - who could argue for
complexity - and I will persist in showing that not doing any of these
options merely leaves great complexity in other places.

After al my point is not to gratuitously add complexity just to annoy
people. my point is to have a simple, easily understood, rapidly
retrofittable method to get a real interoperability and authoring
benefit within a year or so.

NW> All of the decisions to add stuff, taken in isolation, looked
NW> tractable, but the whole is starting to appear ponderous. (Some would
NW> argue it became ponderous long ago, but this is not a troll.).

NW> I'm not sure that doing nothing is exactly the right answer,

I am very sure that doing nothing is not the right answer.

NW> but today I feel pretty strongly that something as complex as
NW> xml:idAttrs is too much.

Unfortunately you have not really demonstrated that it is. You have
demonstrated that you feel uneasy about it, and that it is a change.
You have argued that it increases complexity and I have argued that
introducing one of these methods would decrease complexity of
authoring multimedia documents for the Web and writing
multi-namespace-aware XML Web clients.

-- 
 Chris                            mailto:chris@w3.org
Received on Friday, 10 January 2003 14:02:37 UTC