Re: PIs considered harmful Was: XML-SW, a thought experiment

/ Jacek Kopecky <jacek@systinet.com> was heard to say:
|  I'm speaking here as a relative newcomer to the depths of XML, 
| but I have a feeling that you wish for three things which 
| together contradict themselves:
|
|  1) maintain tight control over your vocabulary,
|  2) extend it nevertheless in specific applications,
|  3) validate the extended documents according to the original 
| tight schema.

That's not actually quite what I want. I want to ignore my
instructions for how the document should be processed when I'm testing
the validity of the document. They aren't relevant.

|  Why does the specific application not validate against a 
| specific schema? You could get the benefit of validating the 
| extensions, too.

Let's look at a concrete example.

DocBook has <variablelist>s. They're basically like HTML DLs. Suppose
I have a book that contains a whole bunch of these. I write an XSL
stylesheet to produce PDF (via XSL FOs) from this book. I print it and
the design department reviews it and says, "Yep, perfect, exactly what
the publishing specs say. Go ahead and send it to the printer."

Next, I write a stylesheet to produce HTML for online publication of
the book. This time the design department says, "You know, norm, a
bunch (but not all) of these lists look sortof awkward as HTML lists.
Could you make them into tables instead?"

Naturally, I flatly refuse. They aren't tables semantically and it
would be wrong to turn them into tables in the XML source just because
someone thinks they'd look prettier in HTML. And besides, even if I
was willing to do that, I'd have to go through the whole print
approval cycle again. I'd rather have a root canal.

What I really want here is, uh, how can I describe this? What I want
is an instruction that I can insert into my document that will tell a
particular processor that it should do something special. I want a,
wait for it, a processing instruction!

So I add a few PIs to my source document:

  <variablelist>
    <?dbhtml format="table"?>
    ...
  </variablelist>

I tweak my HTML stylesheet and voila, I'm finished in an afternoon.
And the print stylesheets still do exactly what they should. And the
design department is happy with what the HTML stylesheet produces. And
I get to go home before bedtime and have a cookie because I met all my
deadlines.

The alternative that's most often suggested to PIs is using an element
in a foreign namespace:

  <variablelist>
    <dbhtml:format-as-table/>
    ...
  </variablelist>

I'm sorry, that's just not a reasonable suggestion:

1. I have $35,000 editing, content, and workflow management system
that took six months to build, install, and debug that is built around
the DocBook schema. You want me to make a local change to that system
to support one formatting request?

2. I exchange files with 11 authors and 6 translators on 3 continents.
You want me to propagate my schema change to all of them?

3. Some of the folks that I exchange documents with work for stuffy
organizations that insist on industry standard schemas. DocBook does
not now, nor is it ever likely, to allow random namespaced cruft. You
want me to get the DocBook Technical Committee to accept a request to
change the DTD to support my formatting request? (Here's a tip, as the
chair of that TC, I know what the likely answer is going to be :-)

4. *Every* stylesheet that processes the document has to go to special
effort to deal with or ignore the extra elements. (The stock HTML stylesheets
for DocBook will turn this into <font color="red">&lt;dbhtml:format-as-table&gt;</font>,
for example.)

The only reasonable answer that I see (please, don't suggest using CSS
instead of a table; that may or may not be reasonable depending on the
browsers involved and it isn't what the design department told me to
do (and if you really wanted me to, I could come up with a similar
example that isn't amenable to a CSS solution)), is to move this
formatting information completely out of band.

But that's a lot more work and it's a lot more fragile. The PI is
entirely harmless (and invisible) to processors that don't care about
it, but provides useful information for processors that go out of
their way to look for it.

The argument that PIs are a security danger doesn't move me at all.
Anyone that implements a system that processes <?runthis cmd="rm -rf
~/"?> knows full well what door they've left open and had better take
precautions.

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM   | A great deal may be done by severity, more by
XML Standards Engineer | love, but most by clear discernment and
XML Technology Center  | impartial justice.--Goethe
Sun Microsystems, Inc. | 

Received on Wednesday, 13 February 2002 10:01:22 UTC