Exploring the Underworld Wide Web from Rick Jelliffe on 2003-04-06 (www-tag@w3.org from April 2003)

From: Rick Jelliffe <ricko@topologi.com>
Date: Sun, 6 Apr 2003 20:44:47 +1000
To: <www-tag@w3.org>
Message-ID: <02bd01c2fc29$8b385a20$4bc8a8c0@AlletteSystems.com>
The I18n WG's Charmod draft has been in development over a long time,
and it an excellent work.  

However, there is still resistance to many of its ideas. Some of the resistance
is based on the idea that it is telling people how to write their software,
that is impractical and none of the WG's business. Michael Kay[1] has 
voiced several concerns on XML-DEV recently, for example.

Apart from the specific technical concerns about who does what where
and to whom, which are appropriate for the I18n IG forum and XML-DEV
and XML Plenary, I think there is an architectural question here which 
the Architecture document should clarify, even if by just providing a 
vocabulary of terms that specs can use.

The question is how to express that the WWW is complex, but that
it can be partitioned in ways that help understand the appropriateness
and fixity of W3C specifications for different uses.

To be concrete:--

Suggestion
-------------

Add to Architecture ideas to the effect:--

1) The "Standard" World Wide Web:  representations and protocols
 use standard specifications with no incompatible extensions
 or behaviours. The public WWW must conform to Charmod
 and WAI.  Senders may make assumptions that recipients
have will use particular software processes or have available 
particular files.  For example an HTML page: we write an
HTML page for a browser and if we expect it to know
how to process even if there is no DOCTYPE declaration.
"Public Identifiers" may be used for things that are built-in.

2) The "Extended" World Wide Web: where standard representations
 are used, but there is a layer of user defined usage.  In particular,
 an XML document that does not use a publicly-available language.
Senders and recipients must make no assumptions that the other end 
has anything other than a complete implementation,  nor that the 
other end will use any particular processing software or have 
available any particular information. 

3) The "Private" World Wide Web: this is where there is private
agreement between parties to use WWW protocols, but they
have negotiated to only use a profile of a specification or that
certain processing is expected at the recipient.

4) The "Underworld Wide Web": this is the realm of  processing software
which creates, maintains, transforms, processes, etc documents
but whose input and output are not directly available to 
strangers.  This includes, for example, data capture software
working with incomplete documents.

With these three definitions, TAG then should say:

A) The Standard WWW MUST conform to W3C specs
B) The Extended WWW MUST conform to W3C specs, 
  or compatible profiles:  these profiles are "subsets-of-agreement"
  (e.g. "we won't send the letter A") rather than "subsets-of-implementation"
  (e.g. "a processor may fail if it receives the letter A")
C) The Private WWW SHOULD conform to W3C specs, in particular
  in only using compatible profiles. However, technical practicalities
  are important.
D) The Underworld Wide Web MAY conform to W3C specs, 
  however, technical practicalities are king. 

Then TAG should say:

 i) W3C specs SHOULD distinguish which features are appropriate for
the Standard, Extended, Private and Under WWW.
ii) W3C specs MAY provide features that are not appropriate for the
Standard WWW.  

----------

Where does this get us?  Well, lets look at XML: it allows us to
say the following important and useful architectural directives:

* Any required post-processing based on the recipient implying
 certain infoset augmentations is not appropriate for the
 Extended WWW.  This means that it is not appropriate to
 require validation or schema-augmentation of XML files 
 on the Extended WWW.
 
  This is especially true with W3C XML Schemas, because 
  there is no way of assuring that the PSVI the sender wants
  will be the PSVI that the receiver gets.  This has a lot of
  consequences.

* Getting back to Charmod, it allows us to say that Standard
WWW and Extended WWW data MUST be early normalized
by the sender and clients MUST fail if they detect the problem.
 But Private and Under- WWW clients MAY not, as suits them.

* It shows why the standard character entities are appropriate
  for text/xhtml+mathml  but not for text/xml

 I think this would go a good way to alleviate the unproductive
concerns about the scope of Charmod in particular, but also
clarify other topics as well. 

For example, it enables us to suggest that XML is popular because
while it is intended as "SGML on the Web", it also provides 
excellent support for Underworld activity (entities, multiple character
sets, and PIs).   

It also lets us postulate the best-practice that representations on the Public 
and Extended WWW need to be atomically parseable: they should 
not require multiple accesses.  Under this best-practice, XML external
entities are not appropriate for the Extended WWW but they are appropriate
for the Standard WWW.  It is appropriate for HTML to use entities,
but not for SOAP data, for example. 

I suspect that many criticism of W3C technology can come down
to a lack of awareness (wheteher by the specification developers, 
by the specification editors, or by the punter outside) that a good
specification should either provide broad support for all these four
sectors of the WWW *or* be explicit about which areas it is
appropriate for.  

It provides a vocabulary and ground for profiling, for example:

 * XML Schema WG can warn that value defaulting is inherently
  unreliable for XML Schemas used on the Extended WWW
 * XQuery WG can state that, because Schema validation is
  not reliable for the Extended WWW, type-reliant queries are
  not reliable (and therefore not appropriate) for the Extended
  WWW. (Same true for XPath2 and XSLT2.)
 * XML Core WG can warn that entity boundaries SHOULD not
  be part of the infoset of a document on the Public and Extended WWW,
  but they may be part of the infoset for the UnderWWW. 


Cheers
Rick Jelliffe

(Invited expert, WC I18n IG, not speaking for them)



[1]See  http://lists.xml.org/archives/xml-dev/200304/maillist.html
(not compiled yet)
Received on Sunday, 6 April 2003 06:40:51 UTC