[Prev][Next][Index][Thread]

Re: Concrete syntax, character sets



 Although, I have a fairly limited grasp on ISO 10646 / UNICODE Ver.
2 I do have experience encoding TEI documents in a variety of
unusual languages, and as such I strongly support Gavin's ideas.  I
believe that unless a strong case can be made to the difficulty of
implementation, standardizing on the ISO 10646 character repertoire 
and not a particular encoding of that repertoire, to be the best
choice. 

I agree with Tim Bray in that I do not want the  XML parser to be
responsible for encoding / decoding a particular character
representation, and as such do not want UTF-8 to be specified as its
internal format. As a matter of practicality, I realize that UCS-4
is not something that one would want to send across the "wire" and in
that sense specifying that UTF-8 be the the initial encoding for
transmission is fine.  (Although the Reuters encoding scheme that I
just read about in the UNICODE conference proceedings is very
intriguing.)  I also agree with Tim that a library of encoding
conversion routines that fuction outside of the parser would be very
useful in a reference implementation.

Let me also weigh in on the side of having all markup and content in
the same character set including GI's and attributes.

Finally, I am a little concerned with the partitioning of
functionality between XML and SGML.  In particular, if XML becomes
as successful as I hope, there may be strong reasons (i.e. product
availability / price) for not using SGML and making due with XML.  XML
therefore cannot rely too heavily on the availability of its more
functional parent to mitigate its own limitations. 


B. Todd Bauman
Graduate Student
University of Maryland, Baltimore County