Re: Composing language descriptions: tree automata and language design? from MURATA Makoto on 1998-03-02 (www-html@w3.org from March 1998)

From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
Date: Sun, 1 Mar 1998 21:04:44 -0500 (EST)
To: Paul Prescod <papresco@technologist.com>
Cc: Dan Connolly <connolly@w3.org>, fork@xent.ics.uci.edu, www-html@w3.org
Message-Id: <199803020205.AA00358@murata.apsdc.ksp.fujixerox.co.jp>

In message "Re: Composing language descriptions: tree automata and language design?",
	Paul Prescod wrote...
> 
> And I also happen to think that tree automata is really important. But I 
> must admit that I don't see a big connection there yet. 

As for the namespace issue, forest automatons are not directly related.  Forest 
regular expressions, which are equivalent to forest automatons, are much more 
closely related.  In particular, horizontal concatenation is
immediately applicable to a schema language for XML.

> > Am I the only person who's been studying language design for
> > the last few years who hasn't studied tree autonoma?
> 
> No, I first heard of it at SGML/XML 97. My interest now is in arbitrary 
> transformations of DTDs and instances as is demonstrated in the paper. I 
> hope Murata-san can provice examples of more advanced transformations. 
> The paper only has renaming-in-context, which is not too difficult 
> without tree automata.

In the case of n-ary tree automatons, we can use any tree homomorphism.  
Not only renaming but a variety of rewriting (node-deletion, subtree-deletion, 
node-insertion, subtree-insertion, reordering, etc.) is possible.  For 
the definition of linear tree homomorphisms, see the text book by Gecsec 
and Steinby.  (But this book is not easy to get, I am afraid.)

@book{gs:Automata, 
author =	"F. G\'{e}cseg and M. Steinby",
 title =	"Tree Automata", 
publisher =	"Akad\'{e}miai Kiad\'{o}, Budapest",
 year=		"1984",}

In the case of forest automatons, there is a subtle issue:  we do not know 
how many subordinates a node has and how these subordinates are related in 
advance.  For example, given a sequence <sec> <h1/> <p/> <h1/> <p/> <p/> </sec>, 
I would like to create <sec> <subsec><head/><p/> </subsec> <subsec> <head/> 
<p/> </p> </subsec> </sec>.  My solution is to use an unambiguous regular 
expression (e.g., (h1 p*)* ) to capture underlying structures for such sequences.  

My upcoming paper for PODDP(Principles of Digital Document Processing) '98 
provides an algebra and a logic-based language that are extensions of 
the RDB and datalog.  This paper also sketches an operator for such rewriting.  
I should disclose this paper to the public soon.

In message "Composing language descriptions: tree automata and language design?", 
Dan Connolly wrote...
> I'm starting to get a "where have you been all my life?" feeling.

When I said "The use of extended CFG's as document schemas is a bad idea" 
in my talk at PODDP'96, everybody laughed.  After explaining shortcomings 
of the traditional approach and advantages of the forest automaton theory, 
I repeated the same claim.  This time, everybody was silent and serious.

The forest regular language theory was studied in 60's and 70's (away too 
early!).  Unfortunately, it has not been used for real applications, although 
it is very powerful.  Kil-Ho Shin (one of my colleagues at Fuji Xerox) 
rediscovered the forest regular language theory when he was studying structured 
documents.  Since then, I have firmly believed that the forest automaton 
theory deserves much more attention from document processing communities.  
I am happy to learn that more and more people are interested.

Cheers,

Makoto
 
Fuji Xerox Information Systems
 
Tel: +81-44-812-7230   Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp

Received on Monday, 2 March 1998 03:12:22 UTC