RE: property value clarification

At 11:07 PM -0400 11/1/98, Babich, Alan wrote:
>"By allowing dead properties to be expressed in XML, we are giving
>people a notation and telling them to use it. They need to know either
>that their notation will be preserved or what equivalence class of
>notations is being used."
>
>Notation need *not* be preserved. Presumably the
>end user is totally oblivious to the protocol being
>sent over the wire. Thus, the end user doesn't even know
>what the notation on the wire is. Nor should he.
>The end user is only concerned with his property values
>as he understands them being preserved, and with
>their semantics being preserved, and with querying
>them.

Right. But the end user is not really relevant here. The question is what
the user's _agent_, the client program, can expect the server to do when
presented with an XML fragment that the client wants associated with a
resource.

That this XML fragment must represent something of some value to the user
is true, but not amazingly relevant at the moment. Of the writer of the
client knows what the server will do with arbitrary XML chunks, she can
then ensure that the _user's_ semantics are represented in XML, in suhc a
way that the server will preserve those semantics.

> Thus, the property *values* and their semantics
>are what need to be preserved, not some arbitrary
>notation buried in the software of the system that
>goes quietly over the wire in the dark of night.
>The value corresponds to your "equivalence class of
>notations", and is a better way to think about the
>problem, so that's what I'll do.

That's fine. When I attempted to talk about data types, I got a bunch of
untelligable assertions about the totally unconstrained data model of
WebDAV servers. I don't care how servers implement thair data models, or
whether XML is providing a data model or a notation.

In answer to your insistence that I treat XML as a notation, I phrased the
concept in terms of notational equivalence. I agree that data structure is
a more easily understood notion than equivalence class of notations (in a
complex grammar).

>It may be that the disconnect in our communications
>centers around datatypes. You seem to have one view
>(which, to me, seems to be that the data type is a
>mystery to the server), while I have another (the
>server knows what it is).

WebDAV explicitly allows the extension of property values, and explicitly
state that only some of them are significant to the server. These
"extension values" and their treatment need to be defined.

>I believe that there are two types of property values
>in the world, simple values, and compound values.

I can believe that too: types and type constructors, with their associated
instances.

>People think they *already know* what the syntax and
>semantics of simple property values are. They have been
>using them for decades. They are integers, strings,
>datetimes, etc. People know, and, therefore, servers know,
>exactly what to do with them. That is why doing anything
>different than what people think they already know will
>result in congnitive dissonance (negative transfer of
>training) at best, and failure of the standard at worst.
>It would be bad design in any case.

But given an arbitrary identifier from some arbitrary schema, and a
chanrater string encoding a value for a field in that schems, I don't, and
_can't_ make any assumption about how that string should be interpreted.

>Compound values are just a hierarchical arrangement
>of simple values. Think of C structures (with no
>pointer valued fields).

This would be reasonable if we had a notation whose data model corresponds
to this (simplistic) notion of composite. Since we don't, we either need to
explaing what the differences are, or use the more powerful model (which
has attributes, ordering, repetition, unconstrained nesting, etc.).

None of these powerful features is useless, thought they may be overkill.
Personally, I don't think so, because of he complexity of some metadate
proposals, and because we want to be extensible in ways that we don't
currently understand.

The fact that XML software is already available for free, seems like a good
argument against claims that implementation is insanely hard. I also find
it telling that a byte-string capture facility is sufficient to implement a
reliable (if less searchable). This is an extremely simple strategy to
implement  in almost any data storage model I can think of. The practical
difficulties raised so far, are annoying, but not even close to
show-stopping. They strike me as summer intern fodder, though you might not
ship summer intern code.

>The WebDAV property model is harmonious with this view
>of property values.

What property model? It says you can send XML documents, and defines a DTD
for a subset of the documents you can send.

>By induction on the hierarchy, as the hierarchy
>is traversed, you encounter nothing but simple values,
>and you know the semantics of each one. Therefore,
>the server *does* know the semantics of a whole
>*dead* compound property value. (Dead properties
>do *not* have null semantics.)

This is true only if you _add_ data type definitions not expressed in XML
(integer, date, etc). and if you outlaw features it does support
(repetition, optionality, unlimited nesting, attribute lists).

This seems weird to me, if potentially acceptable. It's certainly not
something that can be left unsaid in the standard.

>I feel very strongly that it should not be more
>complicated than this. By allowing arbitrary XML
>documents as property values, it becomes considerably
>more complicated that this, and there is no
>justification for complexifying the situation.

They're already allowed by the protocol. The question is what do they mean?

>The WebDAV draft ducked the datatypes issue,
>unfortunately. When an end user does queries, he
>has very ingrained expectations about what strings,
>integers, datetime, etc. do -- and we MUST match the end
>user's expectations to have a useful design. So, query
>forced the issue. Consequently, datatypes show up in
>the DASL draft. It turns out that there is a small
>universal set of simple datatypes adequate for most
>needs.

Can it encode a TEI header? Can it handle XML encodings of MARC records
(the LC cataloguing data format)?

>Furthermore, the XML Data effort is introducing
>datatypes to XML (note that I said XML, not just WebDAV).
>I am assuming that the XML Data effort will succeed.
>The XML Data effort uses an attribute to decorate
>the value (i.e., content) of an element with its
>datatype. XML data defines the universal datatypes
>(integer, string, datetime, etc.) and refinements of
>them (eight bit signed integer, etc.). XML Data
>illustrates the best use of attributes -- as decorations
>of the value (i.e., the element content), not as part
>of the value. (Clearly, the datatype of a property
>is not part of its value, nor is its name.)

If DAV requires the use of XML-data (and proscibes the use of other schema
formats, then that must be explicit). You can't choose a data format, leave
unspecified what that data format is supposed to mean, and then arbitrarily
restrict it later. That's not layered standards, that's intentional
booby-trapping of early implementers.

And XML-data doesn't exist yet, but XML itself already does.

>So, the client can tell the server what the datatypes
>are of each property in his hierarchical property.
>For values that are just bits, servers can typically
>store arrays of binary bytes. If they are truly arrays
>of bytes, querying them is a useless thing to do in most
>cases. However, if some compound data structure is
>obscured by being defined as an array of bytes, then
>the array of bytes should be defined as a compound
>property instead.

XML documents are sequences of characters, not arrays of bytes. There's not
even a transparent way to embed an arbitrary string of bytes in an XML
document. Clients that want to store binary data can work around this if,
and _only if_ they know what the server will do with the data they request
it to store.

>Furthermore, in the case of arrays of bytes, byte
>ordering rears its ugly head. If a client running on
>a system with a byte order different than that of the
>client that stored the value, it gets an array of
>binary bytes that has all the embedded longs, shorts,
>floats, doubles, integer64's, etc. byte swapped,
>which makes them unusable until they are unswapped.
>Byte ordering can be a good reason why an array of
>bytes should have been defined as a hierarchy of simple
>values.

Given the forgoing:
  1  a clear data model for what a server does with the XML data stream it
is fed, and
  2  knowledge of the schema (XML Data, DTD, or whatever) for the property
in question,

a client can issue a sensible query that avoids any such problems. Network
byte order is not a new invention, and it's not rocket science. It doesn't
even require special server support, if the data model is clear.

Nothing in my proposal that we use XML as-is, would prevent a server
presented with an XML-data schema from performing the kinds of search and
storage optimizations you propose.

   -- David

>
>(If you think the value is a string, not an array
>of bytes, then you don't have byteorder issues.
>But, of course, then you have the simple data type called
>String, and the server knows about its semantics.)
>
>As far as the server rewriting values, it does not -- it
>can, however, return them in a different format.
>The server merely stores and returns the abstract
>values of live and dead properties in a concrete
>serialized form. Any equivalent representation of a simple
>value will do as an input or output value. But, of course,
>the server can *not* "return any damn thing it wants". It
>must produce an equivalent value. Clients can reformat
>input values to what the server requires, and can reformat
>output values into whatever format they please. Of course,
>defining a canonical form for values can simplify life
>by reducing possibilities. That is why WebDAV defines the
>"creationdate" property to be in ISO 8601 format, and
>even replicates a tiny part of the 8601 standard in an
>appendix. But DAV rejected forcing this format on servers,
>and left them free to define, accept, and return whatever
>format they like for specific datetime properties. IMHO,
>that is the right choice for DAV. That also accommodates
>the way that many existing systems work -- they are
>liberal in accepting datetimes, but they always output
>them in a canonical way (which can usually be specified
>by the system administrator).
>
>This is very predictable behavior, not "unpredictable"
>behavior.
>
>The way servers operate today is to store the property
>values in native binary format and discard any memory of
>the character format in which the property value was
>expressed on input. No commercial server I aware of stores
>property values as XML documents. The property values
>get converted by software on the way in and the
>way out to some human readable form. There are no
>commercial implementations I am aware of that store
>property values as the literal input strings they got
>over the wire or from an application program.
>I strongly believe that this will continue to be true
>of commercial systems in the future, i.e., that the
>default behavior of most implementations will *not*
>be to store the literal XML as the property value.
>
>RDBMS's will have no trouble in principle storing
>hierarchical property values as outlined above,
>because SQL supports the universal set of basic
>datatypes -- integers, strings, datetimes, etc. . All
>you have to do is tell the server what the datatype
>is by using the XML Data's approach, if the server
>doesn't already know the datatype. (There might be a
>default data type, and that default might be String.
>I'd have to look it up to be sure.)
>
>The "game" the IETF plays (and should play) is that
>what is sent across the wire is merely an on-the-wire
>representation of the actual state of the thing being
>transported. One must to think of the property model
>in the abstract, and forget about XML or any other
>serialization format when doing so.
>Once the property model (or set of property models)
>to be supported is chosen, then the adequacy
>of any particular serialization format, e.g., XML,
>can be evaluated. That is the proper way for the
>design to proceed.
>
>All of the above should not be controversial.
>
>Metadata (information about properties) other than
>property name and data type, on the other hand, is not
>directly addressed by any draft in the WebDAV draft
>family. So, retrieval and manipulation of property
>metadata should not be part of the current discussion.
>An industrial strength metadata effort would be
>a separate effort.
>
>Alan Babich

_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://www.dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

Received on Monday, 2 November 1998 13:25:16 UTC