suggestion to add func:normalize-unicode to DTB ([Fwd: Re: UCS vs Unicode]) from Axel Polleres on 2008-09-01 (public-rif-wg@w3.org from September 2008)

From: Axel Polleres <axel.polleres@deri.org>
Date: Mon, 01 Sep 2008 06:59:04 +0100
To: "Public-Rif-Wg (E-mail)" <public-rif-wg@w3.org>
Message-ID: <48BB84A8.4090804@deri.org>
In the rdf:text task force, we were having some discussions on Unicode 
issues. It seems to make sense to include

http://www.w3.org/TR/xpath-functions/#func-normalize-unicode

for unicode normalization to the dtb functions.
Any objections about that?

Axel

-------- Original Message --------
Subject: 	Re: UCS vs Unicode
Date: 	Sat, 30 Aug 2008 23:58:22 +0100
From: 	Felix Sasaki <fsasaki@w3.org>
To: 	Boris Motik <boris.motik@comlab.ox.ac.uk>
CC: 	Richard Ishida <ishida@w3.org>, <baojie@cs.rpi.edu>, W3C Member
Archive <w3c-archive@w3.org>, <axel@polleres.net>, Ivan Herman <ivan@w3.org>


Hello Boris,

Boris Motik wrote:
  > Hello,
  >
  > Thanks for this in-depth explanation -- I really appreciate it. Just
to see whether I understood things: is it the case that the
  > main difference between UCS and Unicode is that the latter provides
the semantics to the characters, rules for string equivalence,
  > normal forms, etc.? Thus, UCS is the part of Unicode that contains
pretty much a big table of characters (which are synonymous to
  > code points)? Assuming that the answer to the thus question is "yes",
then we can call UCS strings as being "semantic-less".
  >
  > The XML Schema Datatypes specification uses UCS strings (i.e., it
defines strings as finite sequences of code points without any
  > mention of normal forms or identity rules); thus, strings in XML
Schema are equally "semantic-less". This has certain consequences
  > for OWL, which I'll illustrate on an example.
  >
  >
  > Example 1. In OWL, you can express key constraints -- that is, you
can say that each person can have at most one name. You do this
  > by introducing an axiom of this form:
  >
  > (1) FunctionalProperty( name )
  >
  > Assume now that you want to say that the object identified with ID1
has the name "René". Furthermore, assume that you've actually
  > added two statements for the name of ID1: once you use the é
character, and once you use the semantically equivalent sequence e´.
  > Thus, your OWL ontology contains the following statements:
  >
  > (2) PropertyAssertion( name ID1 "René" )
  > (3) PropertyAssertion( name ID1 "Rene´" )
  >
  > The question is now whether axioms (1)+(2)+(3) is inconsistent. The
answer to this question depends on whether the strings "René"
  > and "Rene´" are identical.
  >

These strings are not identical in XML Schema
http://www.w3.org/TR/xmlschema-2/#string


  > Now if I understood things correctly, Unicode provides semantics to
characters such that one could treat these two strings as
  > identical.
Take a look at http://www.w3.org/TR/charmod-norm/#sec-GeneralExamples .
"The string suc¸on (U+0073 U+0075 U+0063 U+0327 U+006F U+006E), where
U+0327 is the COMBINING CEDILLA, encoded in a Unicode encoding form, is
not Unicode-normalized (since the combining sequence 'c¸' (U+0063
U+0327) should appear instead as the precomposed 'ç' (U+00E7)).
It depends on what you are looking at in Unicode. "
So it depends on whether you apply normalization. XML Schema does not
require it. You should do the same.

  > In UCS, however, these two strings are not identical, because they
consist of different characters (i.e., code points).
  > Right?
  >
  > If the answer to the above question is "yes", then it seems to me
that these two strings are not identical in XML Schema either:

Correct.

  > as
  > already mentioned, XML Schema seems to use "semantic-less" strings.
But then, by adopting the XML Schema path in OWL 2, we'd also
  > have "semantic-less" strings, so "René" and "Rene´" are distinct
objects and (1)+(2)+(3) would be inconsistent.
  >
  >
  >
  > Example 2. In OWL, you can say, for example, that the name of some
person can have at most 4 characters. (This is a silly example,
  > but I want to use "René" again for illustration purposes. You can get
a more realistic example by constraining the number of
  > characters in, say, a UK post code.) You'd do this using the
following axiom:
  >
  > (4) PropertyRange( name DatatypeRestriction( xsd:string maxLength 4 ) )
  >
  > Clearly, (4)+(2) is consistent: the string "René" contains four
characters. The question is, however, what happens when you consider
  > (4)+(3).
  >
  > If we take the XML Schema path and treat strings as "semantic-less",
then the string "Rene´" has five characters, so (4)+(3) is
  > unsatisfiable in OWL 2.
  >

I think you need a normalization function to make explicit when
something is normalized or not. See e.g.
(1) http://www.w3.org/TR/xpath-functions/#func-matches
a matching function independent of collations or normalization, and
(2) http://www.w3.org/TR/xpath-functions/#func-normalize-unicode
You'd apply only (1) or (2) and (1), depending on your needs.

  >
  >
  > My final point is about the infinite supply of characters. I'm by no
means proposing to define our own version of UCS or Unicode.
  > I'm just saying that, in anticipation of future extensions to the set
of code points, we proclaim the supply of UCS characters to be
  > infinite in OWL 2. In typical applications of Unicode and UCS the
total number of characters is not of interest (as in XML Schema),
  > because such applications typically deal with concrete strings which
use concrete characters. In OWL, however, we can draw
  > conclusions about the number of strings of a certain form, which
clearly depends on the total number of characters. Defining the
  > number of characters to be infinite is a simple way of avoiding this
dependence.
  >

I  am not sure why you need to draw conclusions about the number of
total characters, but if that's necessary than yes, set it to infinite,
and as Richard pointed out, do not relate it to a specific version of
Unicode.

Felix

  > Regards,
  >
  >       Boris
  >
  >
  >
  >> -----Original Message-----
  >> From: Richard Ishida [mailto:ishida@w3.org]
  >> Sent: 28 August 2008 21:44
  >> To: 'Boris Motik'; 'Felix Sasaki'
  >> Cc: baojie@cs.rpi.edu; 'W3C Member Archive'; axel@polleres.net;
'Ivan Herman'
  >> Subject: RE: UCS vs Unicode
  >>
  >> Hi Boris,
  >>
  >> Sorry for the delayed response.
  >>
  >> Ideally I'd prefer to copy this to the i18n WG for additional
  >> comments/corrections, but I'll give it a go here.  I will, however,
copy in
  >> Felix Sasaki, who has more knowledge of the XML environment than I.
  >>
  >> See comments inline...
  >>
  >> ============
  >> Richard Ishida
  >> Internationalization Lead
  >> W3C (World Wide Web Consortium)
  >>
  >> http://www.w3.org/International/
  >> http://rishida.net/
  >>
  >>
  >>
  >>
  >>> -----Original Message-----
  >>> From: Boris Motik [mailto:boris.motik@comlab.ox.ac.uk]
  >>> Sent: 27 August 2008 09:44
  >>> To: 'Ivan Herman'
  >>> Cc: baojie@cs.rpi.edu; 'Richard Ishida'; 'W3C Member Archive';
  >>>
  >> axel@polleres.net
  >>
  >>> Subject: RE: UCS vs Unicode
  >>>
  >>> Hello,
  >>>
  >>> It is highly likely that I made an error in the distinction 
between UCS
  >>>
  >> and Unicode:
  >>
  >>> I don't claim to understand either of the specs
  >>> in sufficient detail. Richard's help will therefore be quite
appreciated
  >>>
  >> on this point.
  >>
  >>> Here is why I raised this issue. Someone told me that, in Unicode,
there
  >>>
  >> is no
  >>
  >>> intrinsic notion of a character: what you may want to
  >>> consider a character might depend on the way your string is
written. This
  >>>
  >> person
  >>
  >>> gave me an example: he said that you can represent
  >>> a particular accented letter either directly using one code point, or
  >>>
  >> indirectly,
  >>
  >>> using a sequence of code points of the form "basic
  >>> char + accent 1 + accent 2". In both cases, you are representing
one and
  >>>
  >> the same
  >>
  >>> thing; however, you are using a different number
  >>> of code points.
  >>>
  >> Yes, there are canonically equivalent sequences in Unicode.  For
example, e
  >> acute can be written using a single character, or a combination of e
plus
  >> acute.  This usually reflects the need for round-trip convertibility
between
  >> different encodings that map to Unicode.  Note, however, that most
scripts
  >> are more complicated than the Latin script, and there may be several
  >> characters involved in composing what appears to be a single form 
in the
  >> visual forms, eg. for indic scripts.  Nevertheless, a character is
still a
  >> character. e, the acute accent and e-acute are all characters in
their own
  >> right. So you represent é using one character or two characters: 
not one
  >> character or one-sort-of-compound-character.
  >>
  >> Importantly, not that ISO 10646 has *all the same codepoints* as
Unicode,
  >> ie. it has the three characters described above also.  So you don't 
gain
  >> anything from this perspective in choosing one rather than the other.
  >>
  >> Unicode characters come with semantics attached.  Some of this semantic
  >> information indicates that e-acute is equivalent to e followed by a
  >> combining acute accent.  This becomes important when you want to 
compare
  >> strings for equivalence. See the "Character decomposition mapping"
entry at
  >> http://rishida.net/scripts/uniview/?char=00E9.  The standard also
provides,
  >> as you say, Unicode normalization forms that use these semantics to
allow an
  >> implementation to normalize data in one of 4 ways before running
  >> comparisons.
  >>
  >> The characters in the Unicode 'repertoire' (ie. character set) are
mapped
  >> onto bytes in the data stream via one of three possible character
encodings:
  >> UTF-8, UTF-16, or UTF-32.  UTF-8 uses one to four bytes to represent a
  >> single character, depending on how close that character is to the
beginning
  >> of the repertoire. For more information see:
  >> http://www.w3.org/TR/charmod/#sec-Digital
  >>
  >>
  >>
  >>> It is therefore not clear whether these two strings are equivalent
  >>> or what their length is. He mentioned that
  >>> Unicode uses a notion of normal forms for this purpose.
Furthermore, this
  >>>
  >> person
  >>
  >>> suggested dropping the length facet from xsd:string
  >>> because it is unclear what the length of a string actually is
  >>>
  >> I'm not totally clear here about the length facet, but if you are
counting
  >> bytes, that's definitely not a good thing.  If you are counting
characters,
  >> it's not clear to me why you'd want to restrict the length of the 
string
  >> anyway.
  >>
  >>
  >>> Clearly, we need to settle this issue in OWL 2. I believe that a
typical
  >>>
  >> OWL 2 user
  >>
  >>> will not be able to understand full details of
  >>> writing i18n software (I certainly don't). Furthermore, notions 
such as
  >>>
  >> string length
  >>
  >>> are so ubiquitous in computer science that
  >>> explaining to people why they can't have the length facet on 
xsd:string
  >>>
  >> might be
  >>
  >>> more difficult than providing *any* (even if not a
  >>> perfect) solution.
  >>>
  >> Does this architectural spec from the W3C help ?
  >> http://www.w3.org/TR/charmod/#sec-Indexing
  >>
  >>
  >>
  >>> XML Schema seems to solve these problems without too much 
complication.
  >>> Here is how the XML Schema Datatypes 1.1 specification
  >>> defines strings:
  >>>
  >>> [[The value space of string is the set of finite-length sequences
of zero
  >>>
  >> or more
  >>
  >>> characters (as defined in [XML]) that match the
  >>> Char production from [XML]. A character is an atomic unit of
  >>>
  >> communication; it is
  >>
  >>> not further specified except to note that every
  >>> character has a corresponding Universal Character Set (UCS) code 
point,
  >>>
  >> which is
  >>
  >>> an integer.]]
  >>>
  >>> This seems simple enough: a string is a sequence of characters,
which are
  >>> identified with the notion of code points in UCS. It is
  >>> now obvious what the length of a string is and how to apply a regular
  >>>
  >> expression to
  >>
  >>> a string. The specification seems simple, and it
  >>> does not confuse me with technicalities such as normal forms or the
  >>>
  >> inability to
  >>
  >>> sensibly define what the length of a string is.
  >>>
  >>>
  >>> I thought we could use exactly the same formulation in rdf:text and
thus
  >>>
  >> simply
  >>
  >>> be on the safe side. If people are able to implement
  >>> XML Schema Datatypes, they should then be able to implement OWL 2 in
  >>> exactly the same way.
  >>>
  >> I would have thought so.  And not only XML Schema, but almost 
everything
  >> these days is based on Unicode.
  >>
  >>
  >>
  >>> Any help by an expert on this matter will be greatly appreciated.
  >>>
  >> Felix may be able to offer more advice specific to OWL.
  >>
  >>
  >>
  >>>
  >>>
  >>> Let's assume what we can agree on a basic definition of what a
character
  >>>
  >> is. Even
  >>
  >>> then, there is another problem with rdf:text that
  >>> is, I believe, peculiar to OWL. The main question is how many (UCS)
  >>>
  >> characters are
  >>
  >>> there, as the semantics of OWL ontologies
  >>> critically depends on this number. Consider the following example.
  >>>
  >>> ClassMember(
  >>>   MinCardinality( n P
  >>>     DatatypeRestriction( xsd:string length 1 )
  >>>   )
  >>>   a
  >>> )
  >>>
  >>> If the number n is smaller than the number of characters, then the
  >>>
  >> ontology is
  >>
  >>> satisfiable; if it is larger, then the ontology is
  >>> unsatisfiable. Note that we are not talking here about particular
strings
  >>>
  >> (which is
  >>
  >>> the case in XML Schema); rather, we are
  >>> quantifying over the set of all strings of length one.
  >>>
  >>> Now the problem is two-fold.
  >>>
  >>> - How many characters are there? Requiring people to go into a
particular
  >>>
  >> version
  >>
  >>> of the existing specs and determine the number by
  >>> themselves would put way too much burden on implementors and is
bound to
  >>> result in incompatibilities.
  >>>
  >>> - What happens if the character set is extended (which is what happens
  >>>
  >> quite
  >>
  >>> often)? Then, the ontology might be unsatisfiable
  >>> before the extension, but after the extension it may become
satisfiable.
  >>>
  >> Clearly,
  >>
  >>> this is an undesirable situation.
  >>>
  >>>
  >>> To avoid such problems, I thought that, for the purposes of OWL 2, we
  >>>
  >> might
  >>
  >>> simply make the set of characters to be infinite. Note
  >>> that, at any given point, we'd have ways of writing only a finite
number
  >>>
  >> of
  >>
  >>> characters; this, however, is irrelevant for
  >>> satisfiability questions such as the one above.
  >>>
  >> I can't say I understand the expression above, but I strongly
recommend that
  >> you avoid tying yourself to one version of Unicode.  This has been 
a big
  >> problem for XML and IDNA recently, and you don't want to go there.
Unicode
  >> will indeed continue adding characters for some time to come.
  >>
  >> I hope that's been of some help.  Let's regroup at this point and
see what
  >> else we need to look at.
  >>
  >> RI
  >>
  >>
  >>
  >>
  >>> Regards,
  >>>
  >>>     Boris
  >>>
  >>>
  >>>> -----Original Message-----
  >>>> From: Ivan Herman [mailto:ivan@w3.org]
  >>>> Sent: 26 August 2008 16:16
  >>>> To: Boris Motik
  >>>> Cc: baojie@cs.rpi.edu; Richard Ishida; W3C Member Archive;
  >>>>
  >> axel@polleres.net
  >>
  >>>> Subject: UCS vs Unicode
  >>>>
  >>>> Boris,
  >>>>
  >>>> I hope it is all right if I take this separately and not on the
list, it
  >>>> looks to me fairly superfluous to involve the whole group here...
  >>>>
  >>>> The issue came up on the last call, you said:
  >>>>
  >>>> [[[
  >>>> Boris Motik: The problem is that in unicode, there are several 
ways to
  >>>> represent certain characters, eg accented characters. That's why they
  >>>> have Normal Forms, etc maximally composed, maximally decomposed, etc.
  >>>> ]]]
  >>>>
  >>>> Hence you proposed to use UCS instead of Unicode for [1].
  >>>>
  >>>> I then got confused because a bunch of references told me that 
UCS and
  >>>> Unicode are identical.  I decided to ask Richard Ishida to help 
us out
  >>>> (cc-d on this mail). He is the Internationalization activity lead
at W3C
  >>>> (and Unicode guru:-). His reaction is that your information is
actually
  >>>> not true... Ie, Unicode and UCS are completely interchangeable at 
that
  >>>> level. So we would both be interested to understand what exactly the
  >>>> problem is and what you refer to. Could you clarify?
  >>>>
  >>>> (Although referring to UCS is no problem, see below, but we should
have
  >>>> a common understanding on what is happening...)
  >>>>
  >>>> While looking around for this, I also found some good references
for Jie
  >>>> and Axel (Richard, they are the editors of the document we are 
talking
  >>>> about[1]) that might be useful for the final document. Indeed, XML1.1
  >>>> indeed refers to UCS, and uses the following formulation:
  >>>>
  >>>> [[[
  >>>> [Definition: A parsed entity contains text, a sequence of characters,
  >>>> which may represent markup or character data.] [Definition: A
character
  >>>> is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC
10646].
  >>>> Legal characters are tab, carriage return, line feed, and the legal
  >>>> characters of Unicode and ISO/IEC 10646. The versions of these
standards
  >>>> cited in A.1 Normative References were current at the time this
document
  >>>> was prepared. New characters may be added to these standards by
  >>>> amendments or new editions. Consequently, XML processors MUST
accept any
  >>>> character in the range specified for Char.]
  >>>> ]]] -> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets
  >>>>
  >>>> Referring to the future possibilities is probably wise (there was
quite
  >>>> some issues in XML land around XML 1.1, because XML 1.0 restricted
  >>>> itself to Unicode version 2.0 only). Ie, something similar should be
  >>>> considered for that, too.
  >>>>
  >>>> The other important reference is [2] which says, essentially, that
even
  >>>> if we decide to use UCS, we should also refer to Unicode. 
Something to
  >>>> take into account...
  >>>>
  >>>> Thanks
  >>>>
  >>>> Ivan
  >>>>
  >>>>
  >>>> [1] http://www.w3.org/2007/OWL/wiki/InternationalizedStringSpec
  >>>> [2] http://www.w3.org/TR/charmod/#sec-RefUnicode
  >>>>
  >>>> --
  >>>>
  >>>> Ivan Herman, W3C Semantic Web Activity Lead
  >>>> Home: http://www.w3.org/People/Ivan/
  >>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
  >>>> FOAF: http://www.ivan-herman.net/foaf.rdf
  >>>>
  >
  >
  >


-- 
Dr. Axel Polleres, Digital Enterprise Research Institute (DERI)
email: axel.polleres@deri.org  url: http://www.polleres.net/

Everything is possible:
rdfs:subClassOf rdfs:subPropertyOf rdfs:Resource.
rdfs:subClassOf rdfs:subPropertyOf rdfs:subPropertyOf.
rdf:type rdfs:subPropertyOf rdfs:subClassOf.
rdfs:subClassOf rdf:type owl:SymmetricProperty.
Received on Monday, 1 September 2008 05:59:47 UTC