- From: Paul Prescod <paul@prescod.net>
- Date: Tue, 11 May 1999 14:32:56 -0500
- To: XML Dev <xml-dev@ic.ac.uk>
- CC: www-xml-schema-comments@w3.org
John Cowan wrote: > > Can you sketch an algorithm that will convert SGML-style (or &-less > SGML-style) content models involving #PCDATA into content models > involving #PCDATA and #WS, where #WS is a data type that matches > only white space, such that random white space around tags will be properly > accounted for? Thanks for asking. I don't think that you would convert the content models. You leave the content models alone and just change your matching algorithm slightly. #PCDATA is a token that matches any character data. Given A,#PCDATA,B, #PCDATA matches the longest stretch of character data between A and B. #WS matches a stretch of whitespace. When you are parsing, you always try to match (all) characters against #PCDATA. If that fails AND the characters are whitespace then you ignore or suppress them. If it files but the characters are NOT whitespace then of course you have a validity error. Token Text Result #PCDATA "abc" "abc" #PCDATA " " "abc" #PCDATA not allowed "abc" ERROR #PCDATA not allowed " " "ignorable:[ ]" --- The only danger is if you put datatype nodes beside each other or datatype nodes beside PCDATA. Then you could have problems with ambiguity in the formal grammars sense of the word (which IS a real problem). We could handle this by disallowing content models that allow datatypes to be adjacent or by requiring schema processors to detect and report a possible ambiguity based on the actual definitions of the datatype. -- Paul Prescod - ISOGEN Consulting Engineer speaking for only himself http://itrc.uwaterloo.ca/~papresco Diplomatic term: "Emerging Markets" Translation: Poor countries. The great euphemism of the Asian financial meltdown. Investors got much more excited when they thought they could invest in up-and-comers than when they heard they could invest in the Third World.(Brills Content, Apr. 1999)
Received on Tuesday, 11 May 1999 15:53:15 UTC