Re: Options for dealing with IDs from Chris Lilley on 2003-01-11 (www-tag@w3.org from January 2003)

From: Chris Lilley <chris@w3.org>
Date: Sat, 11 Jan 2003 19:51:42 +0100
To: www-tag@w3.org, Norman Walsh <Norman.Walsh@Sun.COM>
Message-ID: <91409086843.20030111195142@w3.org>
On Friday, January 10, 2003, 9:27:18 PM, Norman wrote:


NW> -----BEGIN PGP SIGNED MESSAGE-----
NW> Hash: SHA1

NW> / Chris Lilley <chris@w3.org> was heard to say:
NW> | On Friday, January 10, 2003, 6:57:18 PM, Norman wrote:
| NW>> -----BEGIN PGP SIGNED MESSAGE-----
| NW>> / Chris Lilley <chris@w3.org> was heard to say:
| NW>> | On Friday, January 10, 2003, 5:13:53 PM, Norman wrote:
| | NW>>> / noah_mendelsohn@us.ibm.com was heard to say:
| | NW>>> | I think I agree with Tim's other conclusion:  do nothing is probably the 
| | NW>>> | least risky solution.  We've got too many typing mechanisms already.
| NW>> |
| | NW>>> I have mixed feelings, but I think I agree with Tim and Noah.
| NW>> |
| | NW>>> "IDness" is a consequence of validation. That means you have to
| | NW>>> validate.
| NW>> |
| NW>> | So, your solution is option 1 or option 8 *DTD or Schema validation in
| NW>> | all cases).
NW> |
| NW>> Yes. Or an internal subset as you point out further down. "The status quo."
NW> |
| | NW>>>  I understand that sometimes has painful consequences. If a
| | NW>>> language wants to have IDs so that authors can point into documents,
| | NW>>> the workaround is to establish a MIME type for that language and
| | NW>>> describe what fragment identifiers mean independent of validation.
| NW>> |
| NW>> | That does not give you IDs. It gives you pointers. It does not solve
| NW>> | the getElementByID problem and it does not solve the #fo selector
| NW>> | problem.
NW> |
| NW>> Right. getElementsByID() returns an empty set if you haven't validated.
NW> |
NW> | You mean, you propose that it *should* return the empty set if you
NW> | haven't validated.

NW> Isn't that what it does today (if you'll allow that an internal subset
NW> with a few attlist decls is "validation" in this context)?

It isn't validation, in this or any other context and no, that is not
what it does today. Some DOM implementations (and some CSS parsers) on
presented with a random bit of XML will only provide IDs if there is a
full DTD available and validation succeeds; some will provide them
only if there is some form of DTD available and that part mentions
some IDs; some will recognize the namespace and use a cached DTD or
some other internal data structure (which may not correspond to the
one which may or may not be linked from the instance) and some will
just say there are no IDs (erven if there is an external DTD subset
that says otherwise). All of those possibilities are justifiable based
on some reading or other of one of the relevant specifications.

There are also other less defensible implementations such as "only
HTML has IDFs" and "anything called id is an ID" and "anything called
[iI][dD] is an ID" - mention these only as evidence of confusion in
the marketplace.

This is the current mess. This is 7) Muddle along. This is "lets
insert some user agent sniffing on the server so that we can get a bit
more interoperability". And this is, to be harsh but fair, the utter
shambles that Tim Bray proposes we get comfy living in.

| NW>> Workarounds for the #fo problem could be achieved in the CSS spec
| NW>> without changing XML. (No, I don't have any specific workaround in
| NW>> mind.)
NW> |
NW> | Allow me to consider that assertion unproven, in that case, and merely
NW> | observe that fixing the IDness problem in multiple *consumers* of IDs
NW> | (probably in different ways) is clearly suboptimal to fixing it
NW> | centrally.

NW> Yep.

Thanks.

| NW>> What's being proposed here is another, independent mechanism *in
| NW>> addition to* validation.
NW> |
NW> | No, *before* validation.

NW> You can't mean "no it's not in addition to". It's clearly "in addition
NW> to" if it happens before validation and then I do validation.

OK I read you to mean "in addition to" as in "happening in parallel
with validation". I agree with your reformulation.

NW> In any event, it introduces a new opportunity for errors that hitherto
NW> did not occur.

| NW>> Like Noah said, "we've got too many typing mechanisms already".
NW> |
NW> | And like I said, not fixing this will give us plenty more as all the
NW> | unsatisfied customers invent them one per specification. I can't
NW> | believe that you are seriously proposing that.

NW> Hmm. I don't think I'd seriously considered the possibility that other
NW> specs would solve the problem by saying "in FooML, all attributes
NW> named 'id' are of type ID by definition and must appear in the infoset
NW> with that [attribute type]". But maybe they would.

I cite exhibit A, the SOAP specification, as an existence proof.

| NW>> Fair point. Let me rephrase. It provides an additional type annotation
| NW>> mechanism out in the authoring space. This provides yet another
| NW>> mechanism to do something and it may do so in ways that are sometimes
| NW>> invalid.
NW> |
NW> | Of course. Well formed authoring always has the possibility of
NW> | creating things that are then determined to be invalid - duplicate
NW> | ids, incorrect content models, missing required attributes and so on.
NW> | How is this different?

NW> Maybe it isn't. It feels different, I guess, because it will make an
NW> error that almost never-ever happens today one that occurs fairly
NW> frequently (namely, having two attributes of type ID on the same
NW> element).

Yes. It means, in effect, that well formedness constraints are well
formedness constraints and validation constraints are validation
constraints and that validation constraints are only enforced because
of validation. That seems a whole lot clearer to me.

| NW>> If you look at a document with well-formed glasses on, then again with
| NW>> validation glasses on, there are a small number of differences that
| NW>> you may perceive. These proposals all add one more thing to that set.
| NW>> I'd like to make that set smaller, not larger.
NW> |
NW> | So would I which is why I would like there to be a way to add IDness
NW> | to the infoset of well formed documents and for the W3C XML Schema to
NW> | pick that up as its input Infoset and reflect these values back in the
NW> | PSVI so that the number of differences seen with the two sets of
NW> | glasses becomes smaller: IDness is preserved after validation.

NW> You can't add something to the set and make it smaller.

You misunderstand me. If I add something to one set that is already in
the other set, then clearly the difference set gets smaller.

NW> With any of these proposals it will become possible to have IDs in
NW> the WF view and validity errors in other view in ways that do not
NW> occur today.

Half correct. You already agreed that we have IDs in the WF view
today. I am only arguing that the WF view remains a WF view and not a
"WF plus validation of some sort in some sense" view. it makes things
clearer and simpler.

NW> One logical extension of what your saying would be to remove xs:ID
NW> from XML Schema and say that IDness really is separate. Then XML
NW> Schema would have only key/keyref not id/idref and key/keyref.

That is one possibility but not my preferred option; those people who
are using W3C XML Schema to produce IDness and are happy to do so
should be able to continue the practice. This is why I prefer to
define the WF IDness in terms of contributions to an Infoset, adding
the same properties to a PreSVI as W3C XML Schema would add to a PSVI.
This is much the same as the Infoset that occurs when full DTD
validation is done during parsing and then a W3C XML Schema is used as
a secons step. The input infoset already has some xs:IDs in it.

| NW>> (Before someone points out xsi:type, let me just say I've never used
| NW>> it and I hope I never do. Everytime I think about it, it whispers "I'm
| NW>> a design flaw, but you can't quite work out what design would be
| NW>> better, can you?" Then it giggles evilly.)
NW> |
NW> | I hear Norm proposing option #10 (or is that 11) using xsi:type in the
NW> | instance (though that would need to be a child element not an
NW> | attribute now because we have multiple attributes ....)

NW> Egad! I'm not proposing that. I'm not even remotely proposing
NW> something that bears a faint resemblance to that!

Heh! Well, I proposed options that I was really not happy with, for
completeness; feel free to do the same. Its better to list an option
and explain why it is not a realistic option than it is to not list it
because of "obviousness" and have someone else read the document and
assume it was missed out because we never thought of it (though that
can happen too, and several of those options have been brought
forward.

| NW>> | Further, the validation semantic is already out in the authoring
| NW>> | space. Authors can plug away in the internal subset - particularluy in
| NW>> | those DTDs that have parameter entities in their content models
| NW>> | precisely to allow for such extension) and can even declare the entire
| NW>> | DTD in the internal subset and make it up as they go along.
NW> |
| NW>> I concede that not all uses of the internal subset are validation, but
| NW>> I tend to think of them that way.
NW> |
NW> | I agree you think of them that way. I am trying to get you not to
NW> | think of them all that way because it complicates the architecture.

NW> Document-instance schema modifications definitely complicates the
NW> architecture. There's no question about that.

Validation that is not validation but in some sense is validation
definitely complicates the architecture too.I mean, I am following XML
1.0 here. Its says there are three validation constraints on IDs and I
am saying that when validation has not occurred, those validation
constraints do not apply. This hardly seems contentious.


NW> | Okay so modifications to the instance that affect the IDness do not
NW> | concern you, ok that is good ....

NW> I think it'd be fairer to say that existing mechanisms for such
NW> modifications don't concern me as much. :-)

Here we come back to comfort and familiarity, which is important, but
can grow with time.

NW> | There seems to be no problem in terms of validation with W3C XML
NW> | Schema or with any other schema language that picks up an Infoset on
NW> | the way in, because this mechanism merely adds to the infoset at parse
NW> | time and can be defined to be the same sor tof annotation that Schema
NW> | does, so a processor that works on the PSVI need not care where the
NW> | IDness of  particular attribute came from.

NW> But it could still result in an element having multiple attributes of
NW> type ID. Are you proposing that that should no longer be an error?

No, I am proposing that, just like the XML 1.0 spec says, this is a
validity constraint. So, it is a validation error. If you validate
then you look for these sorts of errors and if you don't then you
don't.

Are you uncomfortable with an existing XML 1.0 instance with an
incomplete (decorating, not validating) DTD being parsed by an
existing non-validating but (internal and) external subset fetching
parser, and the resulting infoset, on being validated by a W3C XML
Schema parser, generating validation errors?

Are you comfortable with an existing XML 1.0 instance with a complete
DTD being parsed by an existing validating parser, and the resulting
infoset, on being validated by a W3C XML Schema parser, generating
validation errors? (For example, due to a restriction of some kind on
a string, which the DTD cannot express)?

If so, then the existing proposals are no different. If not, then the
cause of your uncomfort is how DTDs and Schemas work together (and
clearly they have to because W3C XML Schema deliberately does not
provide an entity declaration mechanism) and should perhaps be worked
out in a different thread.

NW> | It might be, but your assertion about a use case of suddenly changing
NW> | the IDness of a document and re-validating it does not establish the
NW> | worminess. I can sense that you feel unease; this might be because its
NW> | a can of worms or it might be that you have got used to treating two
NW> | concepts as the same when in fact they are architecturally different
NW> | and you are getting used to that.

NW> Maybe.

Ponder on it some more, while reading your first sentence of reply
where you talk about validation "in this context" - validation is not
context sensitive - it either happens or it doesn't. XML does not have
a class of "mayee validated" instance. Its either well formed or
valid.

| NW>> I think the worms in the can are:
NW> |
| NW>> - - New validity problems:
NW> |
| NW>>   <!DOCTYPE foo SYSTEM "foo.dtd">
| NW>>   <foo xml:id="bar"/>
NW> |
| NW>>   If foo.dtd contains
NW> |
| NW>>   <!ATTLIST foo name ID #IMPLIED>
NW> |
| NW>>   Then the former document means one thing if it's accessed with a WF
| NW>>   parser and is rejected by a validating parser.
NW> |
NW> | Yes.
NW> |
NW> | Just as it would be rejected if it said
NW> |
NW> | <!DOCTYPE foo SYSTEM "foo.dtd">
NW> |   <foo xml:lang="ja"/>
NW> |
NW> |   If foo.dtd contains
NW> |
NW> |   <!ATTLIST foo name CDATA #REQUIRED>
NW> |
NW> | Validation *always* has the chance for rejecting well formed
NW> | documents. That is what it is for.

NW> The point I've been trying to make is that these proposals introduce
NW> *a new chance*.

Yes? And the various restrictions and data types in W3C XML Schemas
introduced new chances, too. They introduced a chance that what a DTD
considered to be perfectly valid CDATA was not infact a valid gDate or
USPostalCode. This is an advantage, not a disadvantage.

I suspect that in your day to day work you use DTD valid instances all
the time and thus the root of your disquiet is that I am making well
formed instances more visible. Yes, well formed instances can turn out
to have errors or various sorts when validated. Valid instances can
turn out to have errors of various sorts when further validated to a
stricter, more restrictive or simply different set of validation
constraints.

NW> Maybe on balance that's the right thing to do. Maybe.

I am pretty sure it is.

| NW>> But it's not the same since a WF parser would not associate
| NW>> "IDness" with the 'id' attribute on foo. So xml:id really does
| NW>> introduce a new kind of error.

NW> | Yes. One that is machine detectable, which is a big advance on
NW> the "if | its a namespace that you have personal knowledge of"
NW> weasel-wording.

NW> Point taken.

OK.

| NW>> and added processing expectations and complexities (large
| NW>> and small) on top of each other again and again and again.

NW> | You persist in portraying it as complexity - who could argue for
NW> | complexity - and I will persist in showing that not doing any of
NW> these | options merely leaves great complexity in other places.

NW> Fair enough :-)

Thanks once again.

| NW>> but today I feel pretty strongly that something as complex as
| NW>> xml:idAttrs is too much.
NW> |
NW> | Unfortunately you have not really demonstrated that it is.

NW> What, I wonder, would constitute such a demonstration?

Well, something a bit more strongly articulated than statements that
well formed (or indeed valid) instances can generate validation errors
when further validated, or that this is new and makes you a little
uneasy.

NW> You haven't
NW> demonstrated that anything more than simple xml:id is necessary.

I have presented it as an option and I have pointed out both its
strengths and its drawbacks (such as the need to change existing
content). I have also said that I could live with it if its the best
we can manage and that I think we can do better. I can't really be
fairer than that.

Maybe you consider the need to revise other specifications (such as
renaming the ID in RDF/XML to xml:id and renaming the id in SDOAP to
be in the XML namespace) to be trivial modifications. They might be
but then again they might produce resistance from the authors of those
specifications or they might not adopt that solution where they would
have adopted a different solution. In those cases, that would be an
argument that a different option to xml:id is necessary.

NW> You've argued that it would be somewhat more convenient for some
NW> authors of some documents (that have legacy schemas) to be able to
NW> have nested, scoped ID declarations but you haven't convinced me
NW> that's in the 80% case.

Okay, fair enough. In my normal work I am handling multi-namespace
XML documents routinely so I do consider this to fall into the 80%
case. I also consider the insertion of XML snippets into XML templates
to be a common operation in XML processing and that this process
should continue to be simple while also allowing IDness and local
names to be preserved. Again, that seems to fall fairly easily into
the 80% or many peples paying XML work.

A solution that requires a small change also seems easier to
adopt and thus more likely to succeed. If someone picked ID or Id
instead of id andf does not much care, or put id into their own
namespace because there was not a compelling reason to do otherwise
then sure, they could switch to xml:id with little pain. I would
expect the resistance to be from people who picked PartNum or
Catalog-Number or something in a non-English language and want to keep
it that way.

It may be that these cases are not frequent, not vocal, or tend to use
DTD validation all the time anyway in which case they could be
plausibly shuffled off into the 20% and told they were too expensive
to cater for. or it might be that they are not, and we need to try to
cater for them (which still might fail). We can't really tell until
there is a more readable document for them, which can have wider
review than this one, albeit public, list.

I believe that the correct thing to do at this stage is to get wider
review and to hear what the use cases are and which solutions suit
which people.

I therefore propose that the options document be revised to include
the new options that were proposed and to add the advantages and
disadvantages that became known as a result of this discussion.  I
volunteer to write such an update, although not next week because of
travel (I am chairing a f2f meeting in Australia; email access will be
sporadic in the next week).

I would like to make one rev of such a document in this forum, to be
sure that I did not miss out or misunderstand anyones point, and then
issue the document as a W3C Note to gather feedback. This Note will
list all the options, and invite feedback as to which are preferred
options, which could be lived with, which are unacceptable, and to
also collect use cases for real-world ID usage.

I hope that the data collected will be helpful to the XML Core WG in
deciding what if anything needs to be done in this area.



-- 
 Chris                            mailto:chris@w3.org
Received on Saturday, 11 January 2003 13:51:51 UTC