- From: Felix Sasaki <fsasaki@w3.org>
- Date: Fri, 17 Aug 2012 12:22:11 +0200
- To: Arle Lommel <arle.lommel@dfki.de>, Michael Kruppa <Michael.Kruppa@cocomore.com>
- Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
- Message-ID: <CAL58czrmidC1x2Ti4UwZEfb_hguGjk49TXtL=ML4jYteAvJRdw@mail.gmail.com>
Hi Arle, all and FYI Michael, 2012/8/17 Arle Lommel <arle.lommel@dfki.de> > A few notes apropos to Shaun's message on this: > > *Display Size:* > I agree with Shaun on the displaySize comment. I think the most common > usage scenario for something like this is a case where you have a > fixed-width display (e.g., a calculator-style LCD display) or limited > space, and you need to convey that the *entire* string must fit within some > specified constraints. > > There are probably instances where word length matters as well, e.g., you > have a fixed-width display that scrolls vertically. In this case you might > know that the display is 13 characters wide and five lines tall, so you > don't want words over 13 characters. > > I suspect the first case is the more common one, however. > > *Forbidden Characters* > Did we decide to go with just a list, or with the limited reg-ex that I > had proposed and Yves simplified? I seemed that there was support for the > latter, but I could be wrong. At the very least it seems Shaun and Yves > want this, so I think we should make a decision before moving this into the > spec. (Note that if we go that route, it would invalidate the current > examples because they are using the comma as a separator) > > *Storage Size* > This could be either characters or bytes, so it isn't inherently > byte-based. Many databases have object types that have character > limitations in length, so this proposal seems right to me if characters are > what we want. At the same time, bytes matter a lot in other applications, > so it's not clear which is more important. > > If it is characters we want, we need not only to specify the encoding but > also the normalization form (normally you would Form C, but there might be > exceptions) since Ä is one character and Ä (A + combining ¨) is two. If > strings do not share the same normalization, unexpected results may arise. > So if we do want characters here, then storageEncoding would need a way to > indicate the normalization. Of course if we're using bytes, we don't need > this (and I don't think we particularly need to account for non-octet uses > of byte). > > There are some other potential issues I could think of with either bytes > or characters in an XML environment: the counts you get for both would > depend in part on interaction with preserveSpace: if preserveSpace is set > to "default" then all strings of whitespace should be collapsed and counted > as one space, but if preserveSpace is set to "yes" then the whitespace > needs to be passed on and counted as bytes/characters. We probably need to > note that explicitly because implementers of StorageSize will need to be > aware of this interaction. (And, actually, the same issue applies to > displaySize if it refers to total length and not word length.) > > Finally, if a forbiddenCharacters value included any whitespace in it > (e.g., forbidding any new line characters) then implementers need to be > aware that if preserveSpace is set to "yes" there might be some interaction > issues as well if whitespace characters are introduced in the content and > considered significant. > > > Based on these uncertainties, I'll hold off implementing these in the spec > until we have clarity. (I think that was the expectation anyway.) > Correct, and I think Michael is going to make a new draft based on the discussions in this thread anyway - Michael? Best, Felix > > Best, > > Arle > > > On Aug 15, 2012, at 18:54 , Shaun McCance <shaunm@gnome.org> wrote: > > On Mon, 2012-08-13 at 09:30 +0000, Michael Kruppa wrote: > > Dear all, > > > > please find attached a first draft of the data categories: > > > Display Size: > I think I missed the discussion on this. Is it really intended that > it limits the number of characters for each word, and not for the > whole string? Is word a well-defined concept in east Asian languages? > > What's an actual use-case for this? The proposal says it an be used > to "limit the maximum number of characters to be used for each word", > which just restates the description. I'm not clear what real-world > things would require you to do word-per-word character limits. > > Forbidden Characters: > "list of pointers to unicode code points identifying chars which > may not be used" > The word "pointers" threw me. I looked and looked for a term for the > U+ lexical representation and couldn't find one. Perhaps "list of > Unicode code points using the U+ representation"? > > At any rate, I do think that *at least* some sort of basic ranges > are going to be necessary. Ideally we should enable well-defined > character classes so people don't have to reinvent them. > > Storage Size: > Storage seems like an inherently byte-based thing. If it's giving > a maximum number of characters, I don't see why you would specify > the storage encoding. Of course, XML files could be reserialized > using any character encoding. I assume the point of this category > is to say "This data will be pushed to another medium using this > character encoding, and when stored in that encoding, this is the > maximum number of bytes". Is that correct? > > -- > Shaun > > > > > > > -- Felix Sasaki DFKI / W3C Fellow
Received on Friday, 17 August 2012 10:37:43 UTC