Experimental MicroXML lexer for Python3

Folks,

It has ben a maniacal couple of weeks, with a major project deadline last
week leading into a long trip to San Jose this week.  I've seen a flurry of
activity in MicroXML, which is great.  Here's my little bit.  On the plane
back home I made progress the most of the way on a lexer for MicroXML for
the PLY parser generator on Python 3 (Requires Python 3.3 for fixes to
Unicode handling, currently in its third release candidate).

https://github.com/uogbuji/amara3/tree/master/lib/uxml

A brief example:

$ cat test1.uxml
<a b="1&amp;2">3<!--4-->5<b>spam</b></a>

$ python3 ~/dev/amara3/lib/uxml/lex.py "`cat test1.uxml`"
LexToken(STARTTAG_LEAD,'<a ',1,0)
LexToken(NAME,'b',1,3)
LexToken(EQ,'=',1,4)
LexToken(DBL_QUOTE,'"',1,5)
LexToken(CHARDATA,'1',1,6)
LexToken(AMP_ENT,'&amp;',1,7)
LexToken(CHARDATA,'2',1,12)
LexToken(DBL_QUOTE,'"',1,13)
LexToken(GT,'>',1,14)
LexToken(CHARDATA,'3',1,15)
LexToken(COMMENT,'<!--4-->',1,16)
LexToken(CHARDATA,'5',1,24)
LexToken(STARTTAG_LEAD,'<b',1,25)
LexToken(GT,'>',1,27)
LexToken(CHARDATA,'spam',1,28)
LexToken(ENDTAG,'</b>',1,32)
LexToken(ENDTAG,'</a>',1,36)

I already have a simple parser that wraps that lexer and completes the
picture and I should have that checked in as well, but I figure this might
be useful to others, especially the set of token regexes worked up from the
spec.

A couple of notes:

 * It's definitely experimental and there are a couple of bugs I'm aware
of.  We should start putting together a test suite we can all use to bring
up compliance across the various implementations.
 * Error messages are rather imprecise, as not unusual for regex-base lexers
 * Performance is likely to be so-so for large input.  I hope to switch to
a DFA-based lexer soon to address this
 * PLY is here: http://www.dabeaz.com/ply/

-- 
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Received on Sunday, 30 September 2012 06:00:15 UTC