The xmlChunk-44 problem statement (resend)

I originally sent this message on 12 Feb, moments before I sent my
reply to this message[1]. However, this message is not in the archives,
and I never received a copy from the list, so I think perhaps it did
not get sent. Here it is again.

[1] http://lists.w3.org/Archives/Public/www-tag/2004Feb/0036.html

-- begin original message --

On 2 Feb, I accepted an action item to summarize issue xmlChunk-44 and
solicit input. Herewith is my draft summary.

XML documents are self contained. By that, I mean that all of the
questions that can be asked about a single document, or about a
particular point in a single document, can be answered definitively if
the entire document is available. Some examples of questions that one
might ask about a document are:

  - What is the base URI of the root of the document?
  - What version of XML does the document use?
  - How many namespaces does the document use?
  - How many character information items does the document contain?

Some examples of questions that one might ask about any particular
point in a document are:

  - What is the current base URI?
  - What namespaces are in-scope?
  - What is the current value of xml:lang?
  - More generally, what is the most recently seen value for any
    particular attribute?
  - How many ancestors are there?
  - How many preceding siblings are there?
  - How many following siblings are there?

Given another XML document, we can ask the additional question "are
these two documents the same"? The answer to that question clearly
depends on how you define "equal" and experience suggests that there
is no single answer that will garner universal acceptance.

At the heart of xmlChunk-44 is the observation that we sometimes want
to extract portions of an XML document and use those fragments or "chunks"
in other contexts. For example, we might want to:

  - Use a chunk as the value of a property in an RDF graph
  - Perform some operation on a portion of a document extracted with
    an XPath expression
  - Transform a small portion of a large document
  - Transmit a signed chunk inside the body of a larger document
  - Compare two chunks to see if they're the same

The question then becomes, how can we communicate context information
about the chunk so that the recipient of the chunk can get the
expected answers?

For example, consider this document:

  <?xml version="1.0" encoding="utf-8"?>
  <article xmlns="http://docbook.org/docbook-ng" version="bourbon"
           xml:lang="en" xml:base="http://example.org/not/really/here">
  <info>
    <title>Unit Test: article.001.xml</title>
    <authorgroup>
      <author>
        <personname>
          <firstname>Norman</firstname>
          <surname>Walsh</surname>
        </personname>
      </author>
    </authorgroup>
  </info>

  <para>There's no content here.</para>
  </article>

Now let's consider the "author" chunk. As I described above, we can
answer questions about the author:

  - It has the base URI "http://example.org/not/really/here"
  - It has the xml:lang "en"
  - It has the DocBook version "bourbon"

Suppose I take that chunk and place it in some new context:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns:ex="http://example.org/stuff/1.0/">
  <rdf:Description rdf:about="http://example.org/not/really/here#author">
    <ex:prop rdf:parseType="Literal"
             xmlns="http://docbook.org/docbook-ng">
      <author>
        <personname>
          <firstname>Norman</firstname>
          <surname>Walsh</surname>
        </personname>
      </author>
    </ex:prop>
  </rdf:Description>
</rdf:RDF>

I've lost important information about that chunk. I can't tell what
language it's in or what base URI it should have, for example, or what
version of DocBook it uses. (It might not be appropriate in all
applications to preserve all of the context, but it should be possible
to preserve the context when it's important to the application.)

There is also the deeper question of establishing a canonical form for the
logical XML chunk. We might, for example, wish it to be the case that the
following RDF statement

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns:ex="http://example.org/stuff/1.0/">
  <rdf:Description rdf:about="http://example.org/not/really/here#author">
    <ex:prop rdf:parseType="Literal"
             xmlns:db="http://docbook.org/docbook-ng">
      <db:author>
        <db:personname>
          <db:firstname>Norman</db:firstname>
          <db:surname>Walsh</db:surname>
        </db:personname>
      </db:author>
    </ex:prop>
  </rdf:Description>
</rdf:RDF>

be considered "the same" as the former statement.

I think the issue xmlChunk-44 asks, essentially:

 1. Should there be a standard way to communicate context information
    for a portion of an XML document?
 2. If so, what should it be?
 3. And to what extent should it provide a "canonical" form?

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

Received on Wednesday, 18 February 2004 11:13:57 UTC