Re: including a schema with "HTML: The Markup Language" Clarifying TAG Re: Courtesy notification from Maciej Stachowiak on 2010-03-17 (www-tag@w3.org from March 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Wed, 17 Mar 2010 04:15:33 -0700
To: Graham Klyne <GK-lists@ninebynine.org>
Cc: Larry Masinter <LMM@acm.org>, 'Dan Connolly' <connolly@w3.org>, "'Michael(tm) Smith'" <mike@w3.org>, noah_mendelsohn@us.ibm.com, 'Paul Cotton' <paul.cotton@microsoft.com>, 'Philippe Le Hegaret' <plh@w3.org>, 'Sam Ruby' <rubys@intertwingly.net>, www-tag@w3.org
Message-id: <14205235-DE72-405F-BCEC-33E4F8706D08@apple.com>

On Mar 17, 2010, at 3:53 AM, Graham Klyne wrote:

> OK, now I understand better where you are coming from.
>
> All of which I guess underscores Larry's point: it's hard (if not  
> generally impossible) to use a grammar/schema/other-formal- 
> description to check *all* aspects of program/input correctness, but  
> that doesn't take away from the value of using one to validate those  
> aspects that are amenable to such validation.
>
> In my experience, it is often the process of expressing/reviewing a  
> language in some formalism that is of greatest value, for  
> understanding implications of and problems in its design.  I believe  
> Dan Connolly reported some similar experiences w.r.t. XQuery a few  
> years ago (Amsterdam WWW conference, developer day, IIRC).

I think that is probably true if one is truly inventing a syntax. But  
schemas for markup languages generally assume the surface syntax is  
all taken care of and describe how the resulting pieces are allowed to  
be assembled.

>
> <aside>
> (I'm not sure about the HTML lexer, but an XML lexer can't (easily)  
> be described in terms of a finite state machine because of context  
> sensitivity of the tokenization process - something I learned trying  
> to fix up an XML parser written in Haskell, which might in turn be  
> regarded in some ways as being pretty close to a general-purpose,  
> machine processable formal specification language.)
> </aside>

You can check it out yourself if you want: <http://dev.w3.org/html5/spec/Overview.html#tokenization 
 >

My hypothesis that it's expressible as an FSM is based on the fact  
that the specification is explicitly in terms of input characters and  
resulting state transitions. Although I may have missed instances of  
reading hidden unbounded state. There is also the fact that side  
effects can modify the input stream in the middle of parsing, but I  
think the tokenizer in isolation is still an FSM.

Regards,
Maciej

Received on Wednesday, 17 March 2010 11:16:07 UTC