Directed vs. Generic syntax reprise (Was: Re: Syntax noodling) from Jeni Tennison on 2006-05-08 (public-xml-processing-model-wg@w3.org from May 2006)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Mon, 08 May 2006 09:55:28 +0100
To: public-xml-processing-model-wg@w3.org
Message-ID: <445F0780.2080100@jenitennison.com>
Hi,

Norm Walsh wrote:
> On the whole, I favor the more explicit options, but maybe there's
> something to be said for implicit names. Implicit steps and implicit
> I/O require less typing but they require the reader to know even more
> to make sense of the pipeline.

I think there are two sets of syntax-level options we have to explore:

1. directed vs. generic syntax (e.g. implicit/explicit steps)
2. defaulting attributes, elements, steps (e.g. input/output names)

This mail discusses the first. To summarise, I'd be happy with either 
syntax as long as elements (rather than attributes) were used to 
represent inputs/outputs/parameters. I also hope we don't end up with a 
situation there are options such that different users or different 
components use different syntax.

There's a real continuum between directed and generic syntax. From 
completely generic syntax:

<p:step name="p:xslt">
   <p:input name="source" ref="..."/>
   <p:input name="stylesheet" ref="..."/>
   <p:output name="result"/>
</p:step>

through using the name of the element to indicate the component:

<p:xslt>
   <p:input name="source" ref="..."/>
   <p:input name="stylesheet" ref="..."/>
   <p:output name="result"/>
</p:xslt>

through using directed syntax elements for the inputs/outputs/parameters:

<p:xslt>
   <p:source ref="..."/>
   <p:stylesheet ref="..."/>
   <p:result/>
</p:xslt>

to using directed syntax attributes for the inputs/outputs/parameters:

<p:xslt source="..." stylesheet="..."/>

There's obviously also the possibility of allowing more than one of
these syntaxes (in the way that RDF does), but I think we should avoid 
that if at all possible.

There's also the possibility of having different components have 
different XML structures for their steps. Again, I think we should try 
to avoid that if we can, as it would raise the barrier on learning to 
put together pipelines.


I find it very hard to decide between generic and directed syntax. I've 
tried to look at these options from the standpoint of the usual
criteria I try to apply when designing markup languages:

0. information capture
1. human understandability
2. ease of processing
3. maintainability/extensibility
4. size

Fundamentally, the syntax needs to be able to do what we need it to be
able to do. For example, I think we will want to allow users to embed
documents within the input definitions, e.g.

<p:xslt>
   <p:source href="document.xml" />
   <p:stylesheet>
     <xsl:stylesheet ...>
       ...
     </xsl:stylesheet>
   </p:stylesheet>
   <p:result />
</p:xslt>

We may also want to provide other meta-information about the inputs and
outputs (such as the schemas they comply with), which would be 
impossible if they were represented by attributes. I think that rules 
out using direct syntax attributes for the inputs/outputs/parameters, 
but it doesn't rule out using directed syntax child elements.

The more generic options are obviously longer and arguably less
immediately understandable. (I think it's easier to understand the
plumbing behind the step -- what's the component, what are the inputs
and outputs -- but harder to grok what the step is actually doing in the
pipeline.)

I'm imagining that users will want to write their pipeline definitions
by hand in their favourite XML-editing software, and to use schemas to
provide auto-completion. Directed syntax is better for schema validation
because it's a lot easier to hang content models off element names than
off attribute values (the old thing of XML Schema not supporting
co-occurrence constraints). In a directed syntax, the schema declaration
for the p:xslt element would mean users could get prompted for an input
and a stylesheet rather than having to look up the component definition
to work out what names have been given to the inputs, parameters and
outputs.

On the other hand, I'm also imagining that pipeline engines will also
make available their own, implementation-defined, components as well as
the standard ones that we define. Put these assumptions together, and if
we had a directed syntax we would have a situation where every pipeline
engine would effectively use a different schema (because they support
different components). I think that would rapidly become difficult for
users to handle.

[Note: I know that XSLT extension elements are essentially the same
thing and XSLT users have been able to manage. One big difference is
that XSLT is inherently un-validatable using most schema technologies,
so XML editors have built-in auto-completion assistance rather than
using schemas. We want XProc to be validatable using XML Schema & RELAX
NG. Another is that XSLT extension elements are relatively thin on the
ground compared to the built-in XSLT elements, and I'm not sure whether
we're going to have a similar situation here: what's the proportion of 
built-in components vs. engine-specific components going to be?]

In addition, I'm imagining that users will want to write their own
reusable pipelines which they reference as components. I think it will
prove difficult for those pipelines to be referenced using directed
syntax. We could end up with a definition like:

<p:pipeline name="my:process">
   ...
</p:pipeline>

in the same file as a reference like:

<my:process ... />

Schema validation goes out the window. Of course, we could treat
user-defined pipelines as different from other components, and use
something like:

<p:call-pipeline name="my:process">
   ...
</p:call-pipeline>

but if we're going to have that kind of generic syntax for user-defined
pipeline components, perhaps it be simpler to unify and use it for all
components.

Another downside of directed elements is that they make it harder to 
define a language that can be extended with documentation/test 
cases/engine-specific annotations etc. Rules like "if it's not in the 
XProc namespace, the pipeline engine can ignore it" is easier to 
understand than having (as XSLT does) attributes to indicate which 
namespaces actually should be understood by the pipeline engine.

As I say, I'm undecided.

Cheers,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com
Received on Monday, 8 May 2006 08:55:47 UTC