RE: Regex comments

> -----Original Message-----
> From:	James Clark []
> Sent:	Tuesday, December 05, 2000 12:24 AM
> To:
> Subject:	Regex comments
> Some comments on Appendix F of Schema Part 2.
> 1. The section seems to be crying out for a formal grammar.
We thought about including a grammar, but decided against it...preferring to
following the lead of the Perl Camel book which does not use a grammar to
describe Perl's regexes.  The justification being something along the lines
that the EBNF would get pretty hairy, making the regex language
unintelligible to "normal" schema authors.  Plus, its hard to describe the
"semantics" of regexes with a grammar.  However, I can understand that a
grammar might be very helpful for implementors of processors.  I'll raise
the question of including a grammar *in addition to* the existing
> 2. The definition of character class escapes should mention "block
> escapes". (It also should say that the "valid character class escapes
> *are* ..." not "include ...".)
Agreed on both points.

> 3. The terminology in the description of category escapes is broken. 
> "Lu", "Ll" etc are not character properties but are possible values of
> the "General Category" property.  It is not satisfactory to say "the
> following table specifies the main character properties".  There needs
> to be a precise statement of exactly what is allowed as a category
> escape.  It seems like what you mean is any two-letter sequence that
> occurs as the value of the General Category property of some character,
> or the first letter of such a two-letter sequence. It would be helpful
> to refer to Section 4.5 of Unicode3.
Agreed, that was sloppy on my part.  I picked up the terminology from the
Perl 5.005_58 documention (where the \p{} syntax comes from), which says

	\pP		Match P, named property.  Use \p{Prop} for longer
	\PP		Match non-P

I will correct that to be much more specific in the use of the proper
Unicode terminology.  And yes, the intent is that all of the 2 letter values
are acceptable; and again, following Perl 5.6, the 1 letter value is a
shorthand for the character class consisting of the 2 letter values which
share that first letter (e.g., \p{L} == [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]).

> 4. It seems strange to have an escape for name characters but not for
> name start characters (the characters allowed at the beginning of a
> name).  This means I cannot conveniently write a regex that matches XML
> names. (Or cannot I do it with \c
But there *is* an escape for name start characters: \i (i for "initial
character", s was already taken).  True, the description of \i doesn't say
that it matches name start characters, but I will make sure that it does in
the next draft.

Note, this is how the Name datatype is defined:

<simpleType name='Name'>
	<restriction base='token'>
		<pattern value='\i\c*'/>

(BTW, my reading of of production [84] from XML 1.0 equates "name start
character" with [\p{L}\p{Nl}:_], which is how \i is defined.  Could it be
that that is not the correct translation of name start character and hence,
why you didn't realize that there was such an escape?)

> 5. It would be helpful to say exactly where the definitive list of block
> names is to be found: in the Blocks.txt file of the Unicode Character
> Database ( The Unicode
> standard itself doesn't quite do it: for example, the chart for 000-007F
> is enttiled "C0 Controls and Basic Latin", whereas Blocks.txt calls it
> simply "Basic Latin".

> 6. If I turn the prose description of character class subtraction into a
> grammar I get:
See my reply to your subsequent message on this point.


Received on Tuesday, 5 December 2000 14:31:44 UTC