Re: Mixed content considered harmful...

John Cowan wrote:
> 
> Can you sketch an algorithm that will convert SGML-style (or &-less
> SGML-style) content models involving #PCDATA into content models
> involving #PCDATA and #WS, where #WS is a data type that matches
> only white space, such that random white space around tags will be properly
> accounted for?

Thanks for asking.

I don't think that you would convert the content models. You leave the
content models alone and just change your matching algorithm slightly.

#PCDATA is a token that matches any character data. Given A,#PCDATA,B,
#PCDATA matches the longest stretch of character data between A and B. 
#WS matches a stretch of whitespace.

When you are parsing, you always try to match (all) characters against
#PCDATA. If that fails AND the characters are whitespace then you ignore
or suppress them. If it files but the characters are NOT whitespace then
of course you have a validity error. 

Token               Text              Result

#PCDATA             "abc"              "abc"
#PCDATA	            "   "              "abc"
#PCDATA not allowed "abc"              ERROR
#PCDATA not allowed "   "              "ignorable:[   ]"

---

The only danger is if you put datatype nodes beside each other or datatype
nodes beside PCDATA. Then you could have problems with ambiguity in the
formal grammars sense of the word (which IS a real problem). We could
handle this by disallowing content models that allow datatypes to be
adjacent or by requiring schema processors to detect and report a possible
ambiguity based on the actual definitions of the datatype.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

Diplomatic term: "Emerging Markets"
Translation: Poor countries. The great euphemism of the Asian financial
             meltdown. Investors got much more excited when they thought 
they could invest in up-and-comers than when they heard they could invest 
in the Third World.(Brills Content, Apr. 1999)

Received on Tuesday, 11 May 1999 15:53:15 UTC