Re: [ACTION 189] Split special requirements into several data categories

From:   Arle Lommel <>
To:     Multilingual Web LT Public List 
Date:   17/08/2012 10:42
Subject:        Re: [ACTION 189] Split special requirements into several 
data   categories

A few notes apropos to Shaun's message on this:

Display Size:
I agree with Shaun on the displaySize comment. I think the most common 
usage scenario for something like this is a case where you have a 
fixed-width display (e.g., a calculator-style LCD display) or limited 
space, and you need to convey that the *entire* string must fit within 
some specified constraints.

There are probably instances where word length matters as well, e.g., you 
have a fixed-width display that scrolls vertically. In this case you might 
know that the display is 13 characters wide and five lines tall, so you 
don't want words over 13 characters.

I suspect the first case is the more common one, however.  <pr>I 

Forbidden Characters
Did we decide to go with just a list, or with the limited reg-ex that I 
had proposed and Yves simplified? I seemed that there was support for the 
latter, but I could be wrong. At the very least it seems Shaun and Yves 
want this, so I think we should make a decision before moving this into 
the spec. (Note that if we go that route, it would invalidate the current 
examples because they are using the comma as a separator) <pr>+1 for the 
regex (though I have no plans to implement)</pr>

Storage Size
This could be either characters or bytes, so it isn't inherently 
byte-based. Many databases have object types that have character 
limitations in length, so this proposal seems right to me if characters 
are what we want. At the same time, bytes matter a lot in other 
applications, so it's not clear which is more important.

If it is characters we want, we need not only to specify the encoding but 
also the normalization form (normally you would Form C, but there might be 
exceptions) since  is one character and  (A + combining ) is two. If 
strings do not share the same normalization, unexpected results may arise. 
So if we do want characters here, then storageEncoding would need a way to 
indicate the normalization. Of course if we're using bytes, we don't need 
this (and I don't think we particularly need to account for non-octet uses 
of byte).

There are some other potential issues I could think of with either bytes 
or characters in an XML environment: the counts you get for both would 
depend in part on interaction with preserveSpace: if preserveSpace is set 
to "default" then all strings of whitespace should be collapsed and 
counted as one space, but if preserveSpace is set to "yes" then the 
whitespace needs to be passed on and counted as bytes/characters. We 
probably need to note that explicitly because implementers of StorageSize 
will need to be aware of this interaction. (And, actually, the same issue 
applies to displaySize if it refers to total length and not word length.)

Finally, if a forbiddenCharacters value included any whitespace in it 
(e.g., forbidding any new line characters) then implementers need to be 
aware that if preserveSpace is set to "yes" there might be some 
interaction issues as well if whitespace characters are introduced in the 
content and considered significant.

Based on these uncertainties, I'll hold off implementing these in the spec 
until we have clarity. (I think that was the expectation anyway.)



On Aug 15, 2012, at 18:54 , Shaun McCance <> wrote:

On Mon, 2012-08-13 at 09:30 +0000, Michael Kruppa wrote:
Dear all,

please find attached a first draft of the data categories:

Display Size:
I think I missed the discussion on this. Is it really intended that
it limits the number of characters for each word, and not for the
whole string? Is word a well-defined concept in east Asian languages?

What's an actual use-case for this? The proposal says it an be used
to "limit the maximum number of characters to be used for each word",
which just restates the description. I'm not clear what real-world
things would require you to do word-per-word character limits.

Forbidden Characters:
"list of pointers to unicode code points identifying chars which
 may not be used"
The word "pointers" threw me. I looked and looked for a term for the
U+ lexical representation and couldn't find one. Perhaps "list of
Unicode code points using the U+ representation"?

At any rate, I do think that *at least* some sort of basic ranges
are going to be necessary. Ideally we should enable well-defined
character classes so people don't have to reinvent them.

Storage Size:
Storage seems like an inherently byte-based thing. If it's giving
a maximum number of characters, I don't see why you would specify
the storage encoding. Of course, XML files could be reserialized
using any character encoding. I assume the point of this category
is to say "This data will be pushed to another medium using this
character encoding, and when stored in that encoding, this is the
maximum number of bytes". Is that correct?


This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the sender immediately by e-mail.

Received on Friday, 17 August 2012 10:03:24 UTC