[minutes] Internationalization telecon 2016-08-04 with Social Web WG

https://www.w3.org/2016/08/04-i18n-minutes.html




text version follows:


Internationalization Working Group Teleconference

04 Aug 2016

    [2]Agenda

       [2] 
https://lists.w3.org/Archives/Member/member-i18n-core/2016Aug/0001.html

    See also: [3]IRC log

       [3] http://www.w3.org/2016/08/04-i18n-irc

Attendees

    Present
           Addison, eprodrom, Steven, aaronpk, tantek, JcK,
           jasnell, rhiaro, cwebber2, r12a, Francesco, Sandro

    Regrets
    Chair
           Addison Phillips

    Scribe
           aphillip, rhiaro

Contents

      * [4]Topics
          1. [5]Agenda
          2. [6]Discussion of ActivityStreams with Social WG
          3. [7]AOB?
      * [8]Summary of Action Items
      * [9]Summary of Resolutions
      __________________________________________________________

Agenda

Discussion of ActivityStreams with Social WG

    r12a: One of the most urgent questions at the moment is how to
    go about ensuring that directionality works. I think we should
    not talk about language yet, but focus on text direction. There
    are significatn differences between how language and direction
    work

    <aaronpk> +1

    r12a: The key question is do we need a direction property to
    capture the base direction of the text
    ... There are two aspects to specifying the base direction
    ... The overall default base for a paragraph or sequence of
    paragraphs
    ... The other is directional changes inline

    <Francesco> finally I can hear you - sorry

    r12a: They're slightly different. certainly if we had a
    direction property it would only be capable of describing the
    default base direction for the paragraph as a whole and you'd
    still need other mechanisms to indicate inline changes in
    direction
    ... The AS2 spec, and micropub and webmention have the same
    problem
    ... What we say for aS2 is probably relevant for them as well
    ... We'll talka bout AS2 to keep it simple
    ... Allows two types of message. One can take html markup and
    one cannot
    ... There is another question, whether anything can be done
    about that
    ... I think we should talk about that after the direction thing
    ... Should we capture the default base direction for the
    paragraph in a separate property, or should we just rely on the
    data in order to obtain that information?
    ... One way that you can do that by relying on the data is by
    testing the first strong character in the text
    ... You miss out any weak or neutral characters until you find
    a strong one, and if it's a rtl character you say the overall
    direction for the paragraph is rtl
    ... That works a lot of the time and if you look at people's
    twitter streams you see that most of the time it works okay,
    because they tend to have just arabic text, or they have arabic
    with some embedded latin, which either is handled specially by
    twitter or is a simple embedding which doesn't produce any
    problems
    ... But there are situations where that first-strong rule can
    be duped
    ... Which is when you need to say explicitly no this is not
    actually a ltr phrase even though it starts with a ltr strong
    character
    ... Twitter actually doesn't use first strong characater to
    determine
    ... It looks at the number of characters in one direction and
    the number in another
    ... And the results due to that are unpredictable
    ... But that's the basic idea. If you use the text you can most
    of the time figure out the direction, and the rest you need to
    find a way to indicate it
    ... If you're dealing with markup you can add it there
    ... If you're dealing with name, you can add a lrm or rlm
    character at the beginning of the text
    ... (control characters)
    ... Unfortunately things are not quite so simple as that. The
    main question here is whether there is a value in having a
    separate direction property

    <Zakim> sandro, you wanted to ask about why directiion first,
    since it seems derived

    sandro: Does direction have to be managed separately from
    language? I would naively assume that if I knew the primary
    language of a text I'd know the primary direction of the text?

    aphillip: direction has a weak relation to the language. And
    language information isnt' always available or authoritative

    sandro: The order of solving these is surprising to me. If we
    solve the language problem we solve the direction problem?

    various: no

    <cwebber2> are there any languages that are both rtl and ltr?

    aphillip: Sometimes you can use language information to help
    infer the direction, but the direction you need in order to
    process it for display. It has it's own structure and needs to
    be managed in a particular way. Language does have some impact
    on display, but that's generally processes that are done
    separately from the bidi
    ... With language you're only inferring what the direction is
    likely to be for a particular paragraph

    <jasnell> there are fairly successful heuristic approaches to
    guessing the directionality from language, but it's not
    foolproof by any means

    r12a: there are many languages written in both ltr and rtl
    scripts

    sandro: that makes sense

    <aaronpk>
    [10]https://www.w3.org/International/questions/qa-bidi-unicode-
    controls

      [10] 
https://www.w3.org/International/questions/qa-bidi-unicode-controls

    <aaronpk> "how these control characters"

    aaronpk: I was reading the w3 guide on unicode controls and
    from this, unfortunately there's no anchor, but if you search
    for ^ you'll see the paragraph
    ... Speciically about the title attribute in html
    ... There's obviously no mechanism for base direction of an
    attribute in html

    Addison: You can set the base direction. You can't put markup
    inside the title

    aaronpk: The example given here is the example of mixed rtl and
    ltr text where there isn't one being dominent because the text
    is so short
    ... This seems to solve it
    ... I'm wondering why we can't just use this to sovle the
    problem everywhere?

    aphilip: If you want proper complete bidi layout then you need
    to do other thing sin order to make that happen. That can
    include using control characters
    ... A challenge is that in annotation cases you're going to be
    taking text that doesn't necessarily include that, or includes
    markup for directioanlity, and trying to get things to do the
    right thing

    jasnell: There are a number of consideratioins. If you have a
    name that's plain text and have these control characters, not
    every implementation is going to understand these control
    characters

    aaronpk: Having an extra property will need to be understood
    too

    jasnell: Control characters can work they just add extra
    complexity

    aaronpk: They have to support control characters anyway to
    suppoort bidi?

    aphilip: They ahve to support control characters to support
    unicode
    ... But again the question is where you get information in
    order to do an implementation when you're constructing text, we
    find that mostly markup generally works better than the
    invisible controls for helping people with authoring content

    aaronpk: Exaples of those?

    aphillip: Some in some articles.. the challenge is the controls
    are invisible
    ... Whereas markup is visible to people trying to get the right
    direction
    ... When somebody is authoring a tweet or an annotation on a
    document, they're not authoring markup the control characters
    are generated on the fly to get the display to look correct
    ... What the text direction property is to do is to capture the
    context of the text that's entered or selected
    ... If you snippet a piece of text from an html document, the
    base direction might be delcared as far back as the html
    element on a webpage
    ... The DOM structure knows what the base direction is and
    could populate a text direction property, even though there's
    no markup nearby on the text that's being snippeted

    aaronpk: the browser could also embed the control character in
    the text?

    aphillip: That's possible, but a more likely or simpler
    implementation is to take a piece of data you already have and
    apply it as metadata rather than having to mutate the text
    that's being clipped
    ... Or similarly if you write an android application you can
    know that your runtime environment is set to rtl and therefore
    the input control used to enter the text is a rtl base
    direction context
    ... not to say that you couldn't try to manage the control
    characters for the user, but you're interfering with their text
    by inserting or removing control characters based on the
    runtime context
    ... That's a reason why we might want to have a separate
    property
    ... That isnt' to say that we couldn't solve it by writing
    instructions instead that says your implementation must or must
    not include control characters in a particular way

    jasnell: Just to provide more context wrt the property
    approach. AS2 is a JSON based format. It is written to be
    compatible with or aligned with JSON-LD. While we can have
    objects, embedded and nested objects, there really is no
    concept of inheritance
    ... outside of the JSON-LD context within a document
    ... So if you have an object nested 3 or 4 levels deep it
    doesn't actually inherit the properties of its parent
    ... And these individual objects can be fro different sources,
    different authors
    ... In those cases declaring a base document level direction
    may not necessarily work, and we'd have to put the direction
    metadata at each object within that document
    ... So you potentially end up with multiple default direction
    properties throughout a single document

    aphillip: our best practice is to recommend that language and
    base direction information is associated with each object that
    could contain it,s o each can be set separately
    ... And it's also useful to have a document level way of saying
    the default to have a fallback so you don't have tos et it on
    every single thing
    ... JSON-LD itself doesn't provide any of this structuring

    jasnell: The point there is that adding.. I have no quarrel
    with adding this information as properties. The tradeoff though
    is that it does have a fairly significant complexity tradeoff
    for implementors. Ther'es also a backwards compat concern with
    as1
    ... Existing implementations, display name is a simple string
    as plain text without any language tagging or directionality
    ... We have made breaking changes from as1, so less of a
    concern now than it was before, but if we are going to provide
    this metadata we need to do it in a way that causes as little
    disrpution and complexity as possible

    aphillip: I think these are all optional properties, that's
    less intrusive than requiring implementations to do control
    character insertion?

    jasnell: It is less, we just need to be careful with the
    wording
    ... If we strongly recommend, it sends a signal. Implicit MUST

    r12a: I'm for the idea of putting the information in the text
    itself, and I've been trying hard to think of scenarios where
    having a separate direction property would be advantageous and
    I haven't come up witha lot, however there are a couple of
    situations that are worth mentioning
    ... james, you mentioned increased implementor complexity
    ... If you had typed the text in a field, and it knew its rtl
    because of the context from the html, the user wouldnt' type in
    any information to say this is rtl
    ... If you're working with first-strong heuristics you wouldn't
    need that
    ... But if you had started with @mention which is in latin,
    then unless you have some very special handling in the target
    to say that's a twitter handle and you should ignore it, then
    you're going to get a situation where the first-strong char is
    ltr when the resto fth emessage is rtl
    ... I don't think you could expect the users to say 'this is
    going to be wrong if it goes somewhere else'
    ... The user isn't going to think about or want to add control
    characters to do that
    ... You want to get the data that th eDOM knows about and apply
    that in some way to the text so it comes out appropriately
    ... Whether you do that by putting hte data into a property
    value or by changing the text I'm not sure which is best, but
    they both would invovle some additional complexity in terms of
    making sure that when that piece of text finially comes out
    somewhere there's informationa bout the default directionaly is
    expected to be
    ... The other issue that we have in AS2 which we didnt' have in
    WA, in WA each leaf in the object only has one text property,
    and therefore we had a direction property and a text property
    which were closely related
    ... in AS2 you can have map property with translations, summary
    and content in same object with only one direction property
    which would give a default for all of those strings which may
    be wrong
    ... that's an addtional problem with having a direction
    property

    cwebber: You were just saying that you weren't sure where it
    would be a difficultly to have markup
    ... I definitely want to support i18n. The clear case where
    it's problematic is titles, which are supposed to be just text,
    and possible to be rendered out of band, but very simply
    rendered. We don't want bold, we don't want links... just text
    ... We have one language to parse which is JSON and then you
    have another to parse which is HTML
    ... If you put HTML in a title element, it's difficult to parse
    in the first place. But it's also broad. If we permit links and
    bold and CSS in there, that's a lot more stuff to be concerned
    about than ... maybe we could reduce and say it's just <span>
    and that's all you're allowed to have
    ... Maybe that would work. It would reduce the scope, but would
    still be much more complex
    ... I have seen myself, sometimes people embed in RSS and atom
    readers you end up looking at blog entries and there are angle
    brackets rendered on the title, and I'm pretty sure that that's
    what will happen in our implementations
    ... I would like to support rtl stuff correctly, but that's why
    I feel this incling that having the control characters would be
    nicer
    ... But I have this itching feeling that we're going to end up
    with a lot of trouble if we permit html in this element

    tantek: +1 to what chris said. The only experience we have with
    formats that are not html but then try to do embedded markup
    have basically all been failures in terms of implementation
    support, interop, and dependability by anyone trying to use
    those
    ... the hypothesis that using nested html markup in json,
    theonly data we have when that hypothesis has been tested has
    shown that that is false
    ... That that solution does not work
    ... we have zero examples of that working
    ... Iw ould go so far as to say we MUST NOT add markup in these
    elements
    ... the control character approach I'm not as familar with
    ... but that seems to be a simpler solution to try
    ... Has anyone tried that and what are the results? I would
    defer to i18n for research on that

    r12a: Most people just want to type the text. Even fairly
    technical people who write in arabic or hebrew hate the control
    characters. They are hard to use. One of the problems is that
    they're invisible and you can never quite know whether you got
    it right
    ... If you try to edit something with the embedding, and you
    need start and end, and it gets really complicated

    tantek: my understanding is the same tools the user is using to
    input text would be generating the embedded markup
    ... No user would ever type in or see any control codes
    ... Nobody is advocating users typing control codes

    <r12a>
    [11]https://www.w3.org/International/wiki/Bidi_in_social_media

      [11] https://www.w3.org/International/wiki/Bidi_in_social_media

    r12a: most people don't have access to these control characters
    on their keyboard either. I did some testing ^
    ... If you put the rlm at the beginning and then try to make
    that work on twitter or facebook it doesn't actually work. They
    strip them out before posting the message

    <aphillip> most is probably too strong, but certainly mobile
    users

    r12a: There are all those disadvantages with control codes.
    What I wanted to understand was that there are properties like
    summary and content that can hold html. Where does that html
    come from? How do they end up with html in them?
    ... Maybe one answer is what you just said tantek, maybe it's
    created during the process of creating the text
    ... I was trying to understand.. people are not going to type
    in html either

    jasnell: If you look at like blog software, for the authoring
    UI they provide a plain text title field and a rich text or
    markup editor that allows the user to format the content
    ... The editing tool itself is providng the markup for those
    values
    ... THe title tends to be plain text, and that's what would end
    up in the name property
    ... Whereas a rich text editor would provide the values for
    summary and content

    r12a: I wonder how we would manage direction in that sort of
    context

    jasnell: I'm not aware of any rich text editors that have
    directionality as default option. If they do they would be
    markup oriented not control characters

    aphillip: there are a number. IN Arabic and hebrew context.
    Yahoo mail has controls for that
    ... Not necessarily obvious
    ... particulary to non-users of them

    jasnell: And they operate in terms of markup, setting the
    directional spans rather than using control characters

    aphillip: that's my understanding

    <KevinMarks> Hebrew and Arabic keyboards often have the
    relevant chars

    <aaronpk>
    [12]https://github.com/w3c/activitystreams/issues/338#issuecomm
    ent-237570361

      [12] 
https://github.com/w3c/activitystreams/issues/338#issuecomment-237570361

    tantek: I left a long but clear comment on the AS2 github
    ... tha'ts my last point, I have to leave

    jasnell: on the point of markup in name, and I made this in 338
    too, one of the primary points in use cases, the whole semantic
    of the name property, is to provide a reliably readable label
    for the object
    ... If some implementation for instance doesn't understand the
    object type, it would still have a relable fallback to use the
    label
    ... Allowing markup of any kind makes it more problematic and
    complicated
    ... We have to retain that ability in order for the open
    extensibility model to continue working as it has been
    ... That's something we cannot lose
    ... Thatw as the point, part of the earlier discussion

    aphillip: It's very hard to only permit limited forms of markup
    as well
    ... Once you kind of let some html in then you're kind of
    inviting a whole bunch of other html
    ... I don't think there's a lot of success in trying to limit
    what markup is applied
    ... It's not just bs and is and ems and strongs

    cwebber: I think, building off what James said, and what you
    just said, we have to assume that it's not possible to embed
    html in that name element. So what can we do given that it's
    really not possible?
    ... THere's a real semantic need to have a plain text name for
    that object which won't work if we have markup
    ... It seems the control characters, or an addtional property.
    Are there any other options?

    <KevinMarks_> ‏the vreating user agent can embed the control
    chars

    cwebber: We definitely want to support that, everybody wants
    this to work
    ... If we assume that markup is not possible, what can we do at
    this point?
    ... Can we simplify the conversation if we acknowledge that?

    aphillip: A property is supplying a base direction, I made that
    distinction early
    ... The base direction is not the same as providing inline
    controls to fix.. Richard has a whole bunch of examples.. text
    that needs help with multiple directions
    ... That's why we'd additionally need to look for control
    characters inside the text
    ... If you're going to have a plaintext string, you're still
    going to need control characters for perfect bidi

    jasnell: if we're not going to allow markup, to propertly
    support bidi the only way is to support control characters
    ... We do have the option right now in the json format to say
    name is an object, as an option, that has a direction and
    language property, and a value
    ... It's mroe complicated for implementors and consumers, but
    it does give us the option of declaring on a per-field basis
    without having to rely on markup
    ... What is the complexity tradeoff?

    <cwebber2> it would be possible, but a big headache to add that
    so late to all our activitystreams libraries

    aphillip: Can we describe rules for insertion and removal of
    control codes for the bidi
    ... Properties of the field... just the base direction that
    would be a property there... vs inline metadata

    <Zakim> aaronpk, you wanted to say I completely agree with
    tantek, and was never advocating that users type control
    characters themselves

    aaronpk: I'm not sure about the comment r12a made about me, I
    want to echo tantek earlier, I fully expect that the tools
    would be the ones adding the appropriate characters to the
    string, I'd never expect users to add that themselves
    ... My understanding is that the main reason html has a base
    dir property for elements is not so much so that the string
    itself is in the correct order, but that html elements can flow
    in the correct order

    aphillip: that's not correct
    ... It doesn't change the order in which the elements flow
    ... What it has to do with is how the text is processed for
    unicode base direction, but doesn't hcange what order the
    elements are presented in

    aaronpk: One reason that html needs the attribute is if you
    imagine a full width element, setting the base direction on
    that element means the text will appear on the right side of
    the screen. That won't happen in control characters..

    aphillip: that's not necessarily true

    aaronpk: html is describing the layout. In most of these json
    format we're not describing the layout, just the string
    ... we don't know what format it will be presented in
    ... html is specifically describing the presentation

    aphillip: I think that's an invalid reading of the use of dir

    <Steve_Atkin> I have to drop the call now.

    aphillip: It's the case that the dir attribute causes that kind
    of rtl display that a rtl user would expect. But it's also an
    inherant property of the text. the reason it doesn't live in
    CSS is because it's an inherant property of the text

    aaronpk: that's absolutely my point
    ... outside of the context of html, the text does not have an
    inherant presentation

    aphillip: we're not talking about presentation
    ... We're talking about if i Get a piece of text, I'm going to
    assume a base direction generally of ltr, and that will cause
    rtl text to display incorrectly

    r12a: aaron, there are two aspects of rendering
    ... One aspect is that if you know the base direction is rtl
    and you have "{arabic} w3c" that woudl determine where "w3c"
    goes in relation to the arabic text
    ... And another aspect is where the entire line of text appears
    on the page, against the left margin or against the right
    ... Sometimes you might want to sequence things rtl but keep
    them on the lefthand side
    ... If youlook at twitter and facebook dealing simply with
    strings and they detect rtl direction and they move it to the
    right side of the box. That's some processing their application
    does

    aaronpk: what I'm actually trying to say is that while html is
    describing the presentation of the whole rendering of the page,
    but AS2 does not talk about presentation at all. The
    presentation is left up to the consuming application. It feels
    wrong to use a mechanism that exists in a presentation format
    in a spec that does not talk about presentation

    aphillip: I think you're missing the point. There are two kinds
    of presentation
    ... One is what you're talking about, layout sand that sort of
    thing
    ... What html is concerned with
    ... But the data itself has a direction.. the example Richard
    gave is which side of the string do the letters "w3c" on the
    arabic, depends on the base direction of that text regardless
    of where you present it
    ... That's a property of the text, not a property of the
    presentation of the text
    ... same on a teletype, html, etc

    aaronpk: that's why I'm so interested in it being actually in
    the text, not as a property on the text

    r12a: aaron, I wanted to get some background infromation out.
    The problems we have with control characters may be something
    we have to deal with in applications rather than AS2
    ... I wanted to go back to the question chris said, what are
    the options here
    ... It seems to me that the options we are looking at currently
    are either if we know that the thing should be rtl that we
    stick a control char at the beginning of the string, or we
    stick it in an extra field
    ... I'm not sure that we're saying you sould necessarily have a
    direction property partly because it's not specific enough when
    we have multiple strings within one object
    ... I'm just saying I think we have two options
    ... We change the string, or we put some metadata alongside
    each specific string where needed

    aphillip: I think tha'ts what's necessary
    ... you can't have one text direction property that applies to
    six strings

    r12a: so which of those is the better approach

    cwebber: The control character at the start of the string will
    be fine, but having the additional metadata as a separate
    property... instead of having say name : "text" having name: {
    object } I think is going to screw up implementations just as
    much as having html in there
    ... Most of the fields in this can have html

    <cwebber2> {"name": "This is LTR", "nameDir": "ltr"}

    cwebber: The vast majority of the fields in which this applies
    is kind of a non-issue. Only name you can't
    ... So what if ^

    <KevinMarks_> ‮ inline works in reverse without implementers
    knowing

    cwebber: Just solve this for name or a few small fields where
    html is not permitted
    ... if an implementation doesn't know how to pay attention to
    nameDir they were going to fail anyway
    ... It will maybe hit the best middle ground

    <rhiaro> doesn't solve multiple directions in one name value
    though

    scribe: or stick a control character at the start

    <aaronpk> again you *need* to support control characters in
    strings in order to properly support bidirectional text (a
    string with text in both directions)

    KevinMarks: The advantage of doing it with injected control
    characters should work for anyone who is correctly using utf8
    ... whereas an extra property we're creating extra work for
    anyone creating and display
    ... in terms of most likely preservation of intent, putting it
    directly in the utf8 seems to be the strongest way to do that
    ... Maybe adding a note that creating user agents should do
    that

    <r12a>
    [13]https://www.w3.org/International/wiki/Bidi_in_social_media

      [13] https://www.w3.org/International/wiki/Bidi_in_social_media

    r12a: The additional wrinkle here, third thing at the bottom of
    that url
    ... There's a two line text input
    ... THe top line needs to be treated ltr and the second line is
    rtl
    ... If you don't do that then text is in the wrong place
    ... The rest of the stuff there shows that twitter and facebook
    don't manage this very well
    ... If the name property has multiple lines in it (haven't seen
    examples of that yet) then it's not just a question of sticking
    a control character at the beginning of the strong, it's
    putting it at the beginning of each line
    ... Same applies with summary and content where you have html

    <aphillip> line == paragraph

    r12a: Perhaps it's more likely, where you have multiple
    paragraphs
    ... You probably ought to establish the basedir for each
    paragraph
    ... Or you could put a wrapper around the whole thing like <div
    dir="rtl"
    ... There are intricacies in there I'm not terribly clear about

    jasnell: Whatever we do with the metadata, however we indicate
    this base direction, there is definitely a tradeoff cost
    ... We already have some complexity of name and nameMap
    ... I'm suspecting that the property approach is probably goign
    to be the most reliable for the base direction. Some
    combination of this property and the control codes
    ... But we need to take that time to balance the approach
    against existing complexity of name vs nameMap
    ... We should take our time, put together a proposal

    r12a: I'll try to provide some tests you can use

    jasnell: appreciate that

    aphillip: do you all want to come back next week? How shall we
    proceed?

    jasnell: works for me

    aphillip: I will reserve time next week to discuss language
    ... If there are proposals for how to discuss direction
    further, do we want to use a particular list or github issue
    for that discussion?
    ... Preferences?

    jasnell: if we can get a proposal in place by then we can
    discuss it then

    aphillip: it's taken years of our lives, so don't be
    surprised..

AOB?

    aphillip: thanks social, I'll reserve time next week

Summary of Action Items

Summary of Resolutions

    [End of minutes]

Received on Friday, 5 August 2016 12:31:43 UTC