Re: [ACTION 189] Split special requirements into several data categories

A few notes apropos to Shaun's message on this:

Display Size:
I agree with Shaun on the displaySize comment. I think the most common usage scenario for something like this is a case where you have a fixed-width display (e.g., a calculator-style LCD display) or limited space, and you need to convey that the *entire* string must fit within some specified constraints.

There are probably instances where word length matters as well, e.g., you have a fixed-width display that scrolls vertically. In this case you might know that the display is 13 characters wide and five lines tall, so you don't want words over 13 characters.

I suspect the first case is the more common one, however.

Forbidden Characters
Did we decide to go with just a list, or with the limited reg-ex that I had proposed and Yves simplified? I seemed that there was support for the latter, but I could be wrong. At the very least it seems Shaun and Yves want this, so I think we should make a decision before moving this into the spec. (Note that if we go that route, it would invalidate the current examples because they are using the comma as a separator)

Storage Size
This could be either characters or bytes, so it isn't inherently byte-based. Many databases have object types that have character limitations in length, so this proposal seems right to me if characters are what we want. At the same time, bytes matter a lot in other applications, so it's not clear which is more important.

If it is characters we want, we need not only to specify the encoding but also the normalization form (normally you would Form C, but there might be exceptions) since Ä is one character and Ä (A + combining ¨) is two. If strings do not share the same normalization, unexpected results may arise. So if we do want characters here, then storageEncoding would need a way to indicate the normalization. Of course if we're using bytes, we don't need this (and I don't think we particularly need to account for non-octet uses of byte).

There are some other potential issues I could think of with either bytes or characters in an XML environment: the counts you get for both would depend in part on interaction with preserveSpace: if preserveSpace is set to "default" then all strings of whitespace should be collapsed and counted as one space, but if preserveSpace is set to "yes" then the whitespace needs to be passed on and counted as bytes/characters. We probably need to note that explicitly because implementers of StorageSize will need to be aware of this interaction. (And, actually, the same issue applies to displaySize if it refers to total length and not word length.)

Finally, if a forbiddenCharacters value included any whitespace in it (e.g., forbidding any new line characters) then implementers need to be aware that if preserveSpace is set to "yes" there might be some interaction issues as well if whitespace characters are introduced in the content and considered significant.


Based on these uncertainties, I'll hold off implementing these in the spec until we have clarity. (I think that was the expectation anyway.)

Best,

Arle

On Aug 15, 2012, at 18:54 , Shaun McCance <shaunm@gnome.org> wrote:

> On Mon, 2012-08-13 at 09:30 +0000, Michael Kruppa wrote:
>> Dear all,
>> 
>> 
>> 
>> please find attached a first draft of the data categories:
> 
> Display Size:
> I think I missed the discussion on this. Is it really intended that
> it limits the number of characters for each word, and not for the
> whole string? Is word a well-defined concept in east Asian languages?
> 
> What's an actual use-case for this? The proposal says it an be used
> to "limit the maximum number of characters to be used for each word",
> which just restates the description. I'm not clear what real-world
> things would require you to do word-per-word character limits.
> 
> Forbidden Characters:
> "list of pointers to unicode code points identifying chars which
>  may not be used"
> The word "pointers" threw me. I looked and looked for a term for the
> U+ lexical representation and couldn't find one. Perhaps "list of
> Unicode code points using the U+ representation"?
> 
> At any rate, I do think that *at least* some sort of basic ranges
> are going to be necessary. Ideally we should enable well-defined
> character classes so people don't have to reinvent them.
> 
> Storage Size:
> Storage seems like an inherently byte-based thing. If it's giving
> a maximum number of characters, I don't see why you would specify
> the storage encoding. Of course, XML files could be reserialized
> using any character encoding. I assume the point of this category
> is to say "This data will be pushed to another medium using this
> character encoding, and when stored in that encoding, this is the
> maximum number of bytes". Is that correct?
> 
> --
> Shaun
> 
> 
> 
> 
> 

Received on Friday, 17 August 2012 09:42:05 UTC