W3C home > Mailing lists > Public > www-international@w3.org > July to September 2006

Re: ZWJ&XML

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 13 Sep 2006 11:38:52 +0300 (EEST)
To: Jose <jose_stephen@cdactvm.in>
cc: unicode@unicode.org, www-international@w3.org
Message-ID: <Pine.GSO.4.64.0609131108160.25762@mustatilhi.cs.tut.fi>

On Wed, 13 Sep 2006, Jose wrote:

> Unicode Technical Report #20 (Unicode in XML and other Markup 
> Languages) http://www.Unicode.org/Unicode/reports/tr20/  specifies that 
> Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with 
> in the markup.

Yes, for affecting ligature and joining behavior. I mention this because 
there is a popular word processor that uses ZWJ and ZWNJ quite 
inappropriately for line break control.

Of course, the statement is of general nature: those characters are in 
principle suitable for use in marked-up text. It does not guarantee or 
prescribe that a particular markup system allows them or that they will be 
interpreted by their Unicode semantics.

> But when an xml file with the tags written in Malayalam 
> using ZWJs (In Malayalam ZWJ is used to form certain characters) an 
> error is reported that the tag contained an invalid character.

Reported by which program? I first suspected that you may have tried to 
enter these characters but they do not appear correctly in the declared or 
implied character encoding.

But reading again, I notice that you are referring to _tags_ and might 
actually mean the use of characters in element or attribute names, as 
opposite to their use in content between tags. UTR #20 discusses the 
latter, i.e. what you can use in document content proper - together with 
markup, not _inside_ markup (tags).

The use of characters in element and attribute names is governed by the 
use of each markup language, basically in the _identifier_ syntax.
Generally, and in XML 1.0, control characters are excluded in that syntax, 
and ZWJ and ZWNJ are control characters by definition (General Category: 
Cf). Thus, an attempt to use them in element names would violate 
well-formedness constraints, and an XML parser would report an error - not 
about an invalid character per se but about a syntax error.

In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is probably 
of little practical value.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 13 September 2006 08:38:58 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:08 GMT