Minutes for the XML Processing Model WG f2f 2006 August 4 morning

Minutes 2006-08-04: Friday morning

Present:
   Norm (chair)
   Murray (host)
   Jeni (scribe)
   Henry
   Mohamed
   Alex
   Richard

Norm: What do we want to say about pipelines, pipeline libraries, 
recursive pipelines etc. First: is it reasonable to have a pipeline 
inside another pipeline?

Henry: I would like to, for modularity. It's a choice to package up 
steps into a named pipeline.

Richard: You should be able to do that, but in other programming 
languages you have multiple functions, but usually do not put functions 
inside other functions. If you do have functions inside functions, it's 
usually to give the inner function access to information in the outer 
function. There's hiding the function from the outside environment...

Henry: Hiding isn't a big deal.

Richard: If we don't want to access names in the outer pipeline, then 
you don't need this.

Henry: I don't want the inner pipelines to access information from the 
outside one. In Java, if I'm a novice, I write classes inside other classes.

Jeni: But in Java, you have to create another file. In our language, you 
don't, so why would you want to embed it?

Murray: What's the difference whether it's embedded or not?

Henry: When I pass the file to another user, it's no longer obvious 
which pipeline should be run.

Murray: Not yet, but we could provide a mechanism to say which one is 
going to be run. But I was really asking Richard why he cared...

Richard: Because in other languages, there are other semantics 
associated with nesting a function inside another one, to do with 
accessing information in the outer function.

Henry: That's why it doesn't work for me. Suppose my common code used a 
pipeline parameter.

Murray: Why isn't it all in scope when the functions are at the same 
level? It doesn't matter if our pipeline language works differently from 
programming languages.

Norm (and others): Yes it does.

Richard: We also might, in the future, want to provide some semantics to 
nested pipelines. Here's a Java example:

class foo {
   int a;
   class bar {
     int b, c;
     ...
     a = b + c;
   }
}

'a' is available in the inner class 'bar', but not from outside.

Murray: In the case the function is outside, you have to pass the 
arguments. If the function is inside, you don't have to pass the 
argument into the function: it's just there.

Henry: I want it to be a software engineering choice. If I have:

<pipe>
   S1
   S2
   S3
   ...
</pipe>

If I want to package S2 and S3 into a named pipeline. I'm happy to have 
named inputs and outputs, and to have it encapsulated, but I want to 
have parameters passed in automatically. I want parameters to be 
lexically scoped, but not ports.

Richard: A choose can use ports from outside: normally ports are 
lexically scoped as well.

Alex: What would be the problem of saying that you have to declare 
parameters: if you want to pass the parameter, then you should declare it.

Henry: I could live with that. But then it doesn't matter whether it's 
inside or outside. I have my mind on the simple user with a single 
pipeline element.

Richard: Nesting should correspond to scoping.

Alex: Nesting with encapsulation makes sense to me: the pipelines are 
only accessible in the parent pipeline.

Norm: It seems odd that it's only one level deep.

Henry: I agree with Alex.

Alex: But because it's a black box, this doesn't solve the pipeline 
library problem.

Henry: We have that as well.

Jeni: We need pipeline libraries, and they do what we need to do, so why 
make the language more complex by adding this ability?

Henry: You can't use that argument, because removing constructs from the
language doesn't make it simpler to use the language.

Murray: Using nested pipelines makes absolute sense to me.

Example:

<pipeline>
   <pipeline name="a">
   </pipeline>

   <pipeline name="b">
     <pipeline name="c">
     </pipeline>
   </pipeline>

   <step>...</step>
</pipeline>

Can you call 'a' from 'b'?

Henry: No.

Murray: Then I don't understand.

Norm: If 'b' can't call 'a', then my user who wants to modularise 
something that's common from 'a' and 'b' to 'd', and pulls it out, but 
can't call it, is completely baffled.

Murray: The step asks me to run 'b'. Surely I should be aware of 'a'. 
That's what makes sense to the naive user.

Henry: So named pipelines always get put into the pipeline library. You 
can run one of those by name.

Norm: So now 'a', 'b' and 'c' are all peers and all callable from each 
other.

Alex: The library changes as you go in: you add things to it. When you 
go inside 'b', 'c' is added to the pipeline scope.

Murray: Naive user. That 'c' is inside of 'b', and the only way I can 
run 'c' is by invoking 'b'. So I can run a pipeline that's inside of me, 
or outside of me, but no one else can run pipelines that are inside me.

Richard: I agree that we can do this, but if we do, we will have to 
decide a lot of things that are quite complicated, and we should leave 
it 'til version 2.0.

Murray: That's a good reason for not doing this.

Alex: Personally, pipeline libraries are useful, but let's leave *them* 
to version 2.0. Because import is complicated.

Henry: I assumed that you'd just specify all your pipeline libraries on 
the command line.

Norm: I think we need to have pipeline libraries. I think Richard's 
right that pipeline libraries with pipelines all at the same level is 
sufficient. We might later think it's too much work for naive users. We 
can always do that later.

Richard: I think we should do it later. We shouldn't pre-empt the 
semantics of nested pipelines, which we might add later.

Murray: We have to do the libraries, with some include mechanism. I like 
the nesting, but I understand Richard's argument that this is too much 
for us to take it on right now. I don't think we should make the 
decision now: I think we should include it in the document, say we're 
uncertain, and then later pull it, unless users come back saying that 
they really need it.

Alex: Don't we have group? Can't we use that?

Henry, Norm: It's not the same thing.

Murray: Can we call this procedure rather than pipeline?

Norm: Let's talk about that later.

Murray: How is this nesting thing not like groups?

Richard, Norm: Groups get executed when you come across them: they just 
provide some scope: you can't call them again.

Murray: Can we conflate them?

Norm: I don't like the idea of asking the public whether we should do 
something. All we'll ever get from the public is "yes, we should do it". 
We should give them the minimum, and get them to ask for more.

Alex: So can we talk about pipeline libraries?

Henry: <pipeline-library> contains zero or more <pipeline> elements. 
We're done.

Jeni: We need defaulting.

Henry: <pipeline-library> contains zero or more <pipeline> elements, and 
a default-pipeline attribute that points to one of them.

Norm: Let's get agreement on pipeline libraries.

Alex: We shouldn't have default-pipeline. We just supply the QName when 
we call the pipeline.

Norm: If you have to point to the library, then it's no cost to provide 
the name as well. To review: A pipeline library contains zero or more 
pipelines, all of which have names. (Zero-or-more or one-or-more...) I 
don't feel strongly about defaulting.

Richard: I want to just refer directly to the library, just like in C, 
you have a 'main'. A library can have a default pipeline in it, that 
gets executed if you get given the library.

Norm: Java has this functionality. It seems no effort, and has some use.

Mohamed: What about including other libraries?

Richard: We should use import rather than include. Include implies 
textual inclusion. With import, the pipeline library might be already 
compiled, and the only things that are available are some packaged 
information.

Alex: Can you import inside the pipeline library?

Richard: Yes, you have to import from the pipeline library.

Norm: I suggest we leave off default-pipeline attribute for now. The
<import> has a source attribute that points to the imported library. It
can go in <pipeline-library> and in <pipeline>

Murray: I think pipeline libraries should have a name for debugging 
purposes, so if I loaded it, debugging information would be raised.

Richard: I think it should have a name as well.

Norm: OK, optionally have a name.

Jeni: We shouldn't allow <import> within <pipeline>

Richard: You might have a single <pipeline> element in a file; you 
should be able to import pipelines into it.

Jeni: No: if you need to reuse pipelines, you have to ramp up to having 
a pipeline library.

Mohamed: You should import pipeline by QName rather than URI.

Alex: I would be happy with a <import> that excluded the URI, and tell 
the implementation you need pipelines by name.

Richard: So do you expect a catalog mechanism so that I can get 
libraries by URI when I'm not connected to the 'net? Is this our problem?

Norm: This isn't our problem, just as it isn't XSLT's or Schema's 
problem: it's implementation-defined how the documents are retrieved 
given a URI.

Henry: If I have a pipeline and Richard says he has a library. I thought 
that I had to say on the command line where the library is, but everyone 
said that was crazy. So I need an import library statement that I can 
put in my pipeline.

Jeni: You add <pipeline-library> around it and add <import>

...much discussion about the requirement for naive users to add <import> 
in their standalone <pipeline> skipped...

Norm: I'm looking for a compromise. Suppose we go back to the GCC model: 
you supply the pipeline libraries at the command line.

Richard: I think pipelines are going to be little things that they want 
to run. They don't want to have to do this at the pipeline. I think we 
should allow <import> within <pipeline> when <pipeline> is a document 
element. But in a pipeline library, you have to put it at the top leve.

Alex: So if I rip out a pipeline from the pipeline library and try to 
run it, then it would be invalid. Plus if I put a pipeline into a 
library, I need to move the <import> into the top level of the pipeline 
library.

Murray: What was the logic behind not having the wrapper with a 
standalone pipeline and putting the <import> inside that wrapper?

Norm: Most users are going to have simple pipelines, and they're not 
going to want to write the wrapper.

Alex: If <pipeline> can have <import> inside it, then it should be able 
to do that within a pipeline library.

Jeni: Are the imported pipelines visible within the <pipeline> itself or 
in the entire library?

Alex: Only in the <pipeline> that contains the <import>.

Norm: Recap:

We will have a <pipeline-library> element that can contain pipelines. It 
has an optional name. You can import pipelines from another pipeline 
library. A pipeline can also stand by itself, which can import other 
pipeline libraries. You can import a standalone pipeline.

Jeni: Circularity?

Norm: If you import a library that you've already imported, you don't 
worry: all the pipelines you import are available.

Murray: I should be able to have an import in a pipeline in a pipeline 
library, so I can cut and paste.

...

Norm: What about saying that a standalone pipeline can't be imported. We 
have a syntactic warp in allowing import within a pipeline in one place 
and not another; this is a way of getting around it.

Richard: To go back: if A imports B and C, then C shouldn't be able to 
access pipelines in B.

Alex: In XSLT, you can.

Richard: In C you can't.

Norm: In XSLT you can.

Richard: It means that there are libraries that will work in some 
contexts but not another.

Norm: We can say that if any pipeline library contains a step that 
references a pipeline that isn't imported then it's an error.

Richard: So names are globally scoped.

    A
   / \
  B  C
     |
     D

A can see things in B and C and D. B can only see things in B. C can see 
things in C and D. D can only see things in D.

Richard: So everything in the libraries that you import gets 
automatically exported. What about circularity.

    A <-+
   / \  |
  B  C  |
     |  |
     D -+

Henry: Where you start is the top (A). You stop at D.

(agreement)

Norm: The name for the import statement is <import> with an attribute 
called 'source' (this is consistent with what we do with <input>).

Alex: In pipeline libraries, we also have to deal with declaring components.

Norm: Yes, we need to deal with extension components.

Alex: We should put it in the pipeline libraries.

DECISION: We have pipeline libraries with <pipeline-library> document 
elements, with an optional name attribute and containing multiple 
pipelines. We have standalone pipelines with <pipeline> document 
element. Both can have <import source="URI" />* as children of the 
document element. This points to either a pipeline library or a 
standalone pipeline. As well as the built-in components and 
implementation-defined components, a pipeline library or a standalone 
pipeline has in scope all the pipelines of all the pipeline libraries or 
standalone pipelines that it imports, recursively.

No consensus on a default pipeline to run within a pipeline library.

BREAK

Inputs and outputs.

Henry: This isn't a proposal for naming, it's an analysis that may help.

A component is a named box with named things that data comes into and 
named things that data comes out of. We have the ability to replicate 
them, and use things to connect these boxes together. I propose 
declaring components with:

<comp name="xslt">
   <inputs>
     <port name="doc" arity="1" />
     <port name="ss" arity="1" />
   </inputs>
   <outputs>
     <port name="result" arity="1" />
   </outputs>
</comp>

and parameters go in here as well, but this discussion doesn't 
incorporate parameters.

We have something new now, which covers four language constructs: group,
for-each/viewport, choose and when. These are all containers for steps, 
with their own paired in/out at the top and out/in at the bottom. Choose 
actually looks almost like this, but the things inside are containers as 
well.

<step kind="xslt">
   <input name="doc" (source="p!x" | href="http://...")
                     [select="..."] />
</step>

This is similar to what we've talked about before, except that 
source->href and ref->source.

So how to do we do the in/out and the out/in for the containers. We have 
a combination of <port> and <input>:

<iface name="x" arity="..."
        (source="p!x" | href="http://...")
        [select="..."] />

<oface name="y" arity="..."
        (@source | @href), @select? />

Richard: What about pipelines?

Henry: Pipelines are like components, in that they have some named ports 
at the top and the bottom. But we can't call them inputs and outputs.

The value of the source attribute must always be the name of a Component 
! the name of a port on a component or the name of a port on oface.

Richard: I think pipelines have all of these things. You need to say 
what inputs they have, just like for component definitions. And you need 
to define the inputs for within the pipeline, and you need to bind an 
input for use within the pipeline.

General agreement.

Richard: An input for a pipeline doesn't have a source.

Henry: It *could*.

Richard: But it doesn't *need* it. <iface> got its source from <input>.

Jeni makes the point that the out-facing ports may have different names 
from the in-facing ports within the container.

Henry combines them by making them siblings and writes up:

<iface|oface>
   <input @name, (@source | @href), @select? />
   <port @name, @arity />
</iface|oface>

Henry: I'd like to digest this for a while before we discuss names.

Richard objects to the naming of one thing <port> and another thing 
<input> since an input is a port.

We decide to think on the naming for a while.

---

Core components
---------------

Norm: We've talked about various components like XInclude, validate, 
XSLT. What are the others?

List:

XInclude
XSLT[1|2]
validate*
xquery
load
save
identity
httprequest
aggregate
disaggregate
subsequence
escape (string to XML for RSS)
unescape (XML to string for RSS)
XPath[1|2]filter
wrap
wrap-sequence
insert (attributes|elements|change values)
ns-rename
delete (subtrees|attributes)
rename (attributes|elements)
strip whitespace
absolutize (absolutize selected URIs)
prettyprint
exec
os-access (get directory/environment variable etc)
sort (sorts elements)
regex (destructures a string)
bitbucket/sink
doc-replace (replaces an input with another one)
diff
c14n
encrypt
decrypt
sign
verify
label (adds IDs to all elements)
line number
push-tag (wrap selected elements with a wrapper)
soap-exchange
SPARQL
manifest/packaging
render XSL-FO/SVG/MathML
tagsoup
wikify
sgml-in
schema-check
apply (pipeline)
grddl (returns RDF from XML document)
STX (streaming transformation)
NVDL (namespace validation)
uptranslate
downtranslate
forward-chain-RDF
replicate
load-escaping-entity-references
save-disable-output-escaping

(During the course of generating the list)

Henry: I need two versions of load/save/identity, for different arities.

Richard: We've agreed that a sequence of one document is acceptable to a 
port with an arity of 1.

Henry: I think we should either declare arities and enforce them 
statically, but if we're not doing that, then we don't need two versions 
of load/save/identity.

Jeni: We need, for example, load as well as a href/source attribute, to 
allow the URI to be, for example, passed in as a parameter.

...

We have agreement that xml:base processing happens automatically, but we 
have to talk about what happens in terms of the base URI of outputs.

We also need to talk about security at some point.

...

Alex: We should have modules of components that vendors may implement.

(general agreement)

...

Murray: What about entities?

Henry explains a case where the entities were escaped on load and 
unescaped on save. We need to talk about character encodings in the 
pipeline: we need to provide a way of preserving a character encoding 
through the components.

Murray: I use entities for reuse: I don't want them expanded.

Norm: You have to use XInclude or other mechanism.

Richard: Nothing else in the XML stack does this.

Henry: I want a load-while-escaping-entities step.

Richard: We could have a component that turns the DTD into an XML 
document that can be passed through to a later component, that can then 
reconstruct the DTD for the entities.

We want to come back to preserving entities.

...

Henry: I'd like to talk about built-in parameters which have information 
from the XML declaration.

Richard: Encoding and version are in the Infoset already.

...

Alex: I want to have some general declarations on serialization parameters.

Henry: We should put those on the output port declaration, to give hints 
to the implementation.

...

What about core components? If no one objects, they're included...

XInclude
XSLT 1.0
validate
identity
aggregate

Alex objects to load because he wants httprequest.

Murray objects to all the rest.

We decide to take a different tack.

BREAK FOR LUNCH

Received on Friday, 4 August 2006 17:49:25 UTC