Re: [ACTION 189] Split special requirements into several data categories from Felix Sasaki on 2012-08-17 (public-multilingualweb-lt@w3.org from August 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Fri, 17 Aug 2012 12:22:11 +0200
To: Arle Lommel <arle.lommel@dfki.de>, Michael Kruppa <Michael.Kruppa@cocomore.com>
Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-ID: <CAL58czrmidC1x2Ti4UwZEfb_hguGjk49TXtL=ML4jYteAvJRdw@mail.gmail.com>
Hi Arle, all and FYI Michael,

2012/8/17 Arle Lommel <arle.lommel@dfki.de>

> A few notes apropos to Shaun's message on this:
>
> *Display Size:*
> I agree with Shaun on the displaySize comment. I think the most common
> usage scenario for something like this is a case where you have a
> fixed-width display (e.g., a calculator-style LCD display) or limited
> space, and you need to convey that the *entire* string must fit within some
> specified constraints.
>
> There are probably instances where word length matters as well, e.g., you
> have a fixed-width display that scrolls vertically. In this case you might
> know that the display is 13 characters wide and five lines tall, so you
> don't want words over 13 characters.
>
> I suspect the first case is the more common one, however.
>
> *Forbidden Characters*
> Did we decide to go with just a list, or with the limited reg-ex that I
> had proposed and Yves simplified? I seemed that there was support for the
> latter, but I could be wrong. At the very least it seems Shaun and Yves
> want this, so I think we should make a decision before moving this into the
> spec. (Note that if we go that route, it would invalidate the current
> examples because they are using the comma as a separator)
>
> *Storage Size*
> This could be either characters or bytes, so it isn't inherently
> byte-based. Many databases have object types that have character
> limitations in length, so this proposal seems right to me if characters are
> what we want. At the same time, bytes matter a lot in other applications,
> so it's not clear which is more important.
>
> If it is characters we want, we need not only to specify the encoding but
> also the normalization form (normally you would Form C, but there might be
> exceptions) since Ä is one character and Ä (A + combining ¨) is two. If
> strings do not share the same normalization, unexpected results may arise.
> So if we do want characters here, then storageEncoding would need a way to
> indicate the normalization. Of course if we're using bytes, we don't need
> this (and I don't think we particularly need to account for non-octet uses
> of byte).
>
> There are some other potential issues I could think of with either bytes
> or characters in an XML environment: the counts you get for both would
> depend in part on interaction with preserveSpace: if preserveSpace is set
> to "default" then all strings of whitespace should be collapsed and counted
> as one space, but if preserveSpace is set to "yes" then the whitespace
> needs to be passed on and counted as bytes/characters. We probably need to
> note that explicitly because implementers of StorageSize will need to be
> aware of this interaction. (And, actually, the same issue applies to
> displaySize if it refers to total length and not word length.)
>
> Finally, if a forbiddenCharacters value included any whitespace in it
> (e.g., forbidding any new line characters) then implementers need to be
> aware that if preserveSpace is set to "yes" there might be some interaction
> issues as well if whitespace characters are introduced in the content and
> considered significant.
>
>
> Based on these uncertainties, I'll hold off implementing these in the spec
> until we have clarity. (I think that was the expectation anyway.)
>


Correct, and I think Michael is going to make a new draft based on the
discussions in this thread anyway - Michael?

Best,

Felix


>
> Best,
>
> Arle
>
>
> On Aug 15, 2012, at 18:54 , Shaun McCance <shaunm@gnome.org> wrote:
>
> On Mon, 2012-08-13 at 09:30 +0000, Michael Kruppa wrote:
>
> Dear all,
>
>
>
> please find attached a first draft of the data categories:
>
>
> Display Size:
> I think I missed the discussion on this. Is it really intended that
> it limits the number of characters for each word, and not for the
> whole string? Is word a well-defined concept in east Asian languages?
>
> What's an actual use-case for this? The proposal says it an be used
> to "limit the maximum number of characters to be used for each word",
> which just restates the description. I'm not clear what real-world
> things would require you to do word-per-word character limits.
>
> Forbidden Characters:
> "list of pointers to unicode code points identifying chars which
>  may not be used"
> The word "pointers" threw me. I looked and looked for a term for the
> U+ lexical representation and couldn't find one. Perhaps "list of
> Unicode code points using the U+ representation"?
>
> At any rate, I do think that *at least* some sort of basic ranges
> are going to be necessary. Ideally we should enable well-defined
> character classes so people don't have to reinvent them.
>
> Storage Size:
> Storage seems like an inherently byte-based thing. If it's giving
> a maximum number of characters, I don't see why you would specify
> the storage encoding. Of course, XML files could be reserialized
> using any character encoding. I assume the point of this category
> is to say "This data will be pushed to another medium using this
> character encoding, and when stored in that encoding, this is the
> maximum number of bytes". Is that correct?
>
> --
> Shaun
>
>
>
>
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Friday, 17 August 2012 10:37:43 UTC