FPWD: Internationalization Best Practices for Spec Developers

Internationalization Best Practices for Spec Developers

http://www.w3.org/TR/2015/WD-international-specs-20151020/

Abstract


This document provides a checklist of internationalization-related considerations when developing a specification. Most checklist items point to detailed supporting information in other documents. Where such information does not yet exist, it can be given a temporary home in this document. The dynamic page Internationalization Techniques: Developing specifications is automatically generated from this document. The current version is still a very early draft, and it is expected that the information will change regularly as new content is added and existing content is modified in the light of experience and discussion.Status of This Document This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/. This document provides advice to specification developers about how to incorporate requiements for international use. What is currently available here is expected to be useful immediately, but is a very early draft and the document is in flux, and will grow over time as knowledge applied in reviews and discussions can be crystallized into guidelines. Note Sending comments on this document If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome. To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document. This document was published by the Internationalization Working Group as a First Public Working Draft. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome. Publication as a First Public Working Draft does not imply endorsemet by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress. This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy. This document is governed by the 1 September 2015 W3C Process Document.Table of Contents1. Introduction2. Characters2.1 Choosing a definition of 'character'2.2 Defining a Reference Processing Model2.3 Including and excluding character ranges 2.4 Using the Private Use Area2.5 Choosingcharacter encodings2.6 Identifying character encodings2.7 Designing character escapes2.8 Storing text2.9 Specifying sort and search functionality2.10 Converting to a Common Unicode Form2.11 Handling Case Folding2.12 Defining 'string'2.13 Indexing strings2.14 Referring to Unicode characters2.15 Referencing the Unicode Standard3. Language3.1 Establishing the language of a resource as a whole3.2 Establishing the language of blocks, paragraphs, or similar chunks of content3.3 Establishing the language of inline content spans3.4 Defining language values3.5 Providing for content negotiation based on language4. Text direction4.1 Setting the bidi direction for the resource as a whole4.2 Establishing the bidi direction for blocks, paragraphs, or similar chunks of content4.3 Establishing the bidi direction for spans of inline content4.4 Enabling vertical text display4.5 Setting box positioning coordinates when text direction varies5. Typographic support5.1 Miscellaneous6. Localizability6.1 Defining elementsand attributes7. Plain text support7.1 Miscellaneous8. Case distinctions8.1 MiscellaneousA. ReferencesB. Revision LogC. Acknowledgements1. Introduction Developers of specifications need advice to ensure that what they produce will work for communities around the globe. The Internationalization (i18n) WG tries to assist working groups by reviewing specifications and engaging in discussion. Often, however, such interventions come later in the process than would be ideal, or mean that the i18n WG has to repeat the same information for each working group it interacts with. It would be better if specification developers could access a checklist of best practices, which points to explanations, examples and rationales where developers need it. Developers would then be able to build this knowledge into their work from the earliest stages, and could thereby reduce rework needed when the i18n WG reviews their specification. This document contains the beginnings of a checklist, and points to locations where you can fid explanations, examples and rationales for recommendations made. If there is no such other place, that extra information will be added to this document. It is still early days for this document, and it may also be used to develop ideas and organize them. You may prefer to use Internationalization Techniques: Developing specifications most of the time, since it uses JavasScript to help you more quickly see what's available and drill down to the information you need. (Where needed, it links to this or other documents.) There is also a non-dynamic version of the document available.2. Characters Choosing a definition of 'character' Defining a Reference Processing Model Including and excluding character ranges Using the Private Use Area Choosing character encodings Identifying character encodings Designing character escapes Storing text Specifying sort and search functionality Defining 'string' Indexing strings Referencing the Unicode Standard See the Character Model for the World Wide eb: Fundamentals for basic guidelines related to the use of characters and encodings. See the Encoding specification for further guidelines related to use of character encodings. Another Character Model document is currently in development, entitled String Matching and Searching. It looks at issues that arise when you try to compare two strings, be it identifiers or authored content. 2.1 Choosing a definition of 'character' ​Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. more ​Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. more ​Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more ​Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. more ​Specifications of protocls and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more ​Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. more ​Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. more ​When specifications use the term 'character' the specifications MUST define which meaning they intend. more ​Specifications SHOULD use specific terms, when available, instead of the general term 'character'. more 2.1.1 Links 2.1.1.1 How to's Perceptions of Characters In W3C Recommendation, Character Model for the World Wide Web. 2.1.2 See also Defining string'. 2.2 Defining a Reference Processing Model ​Textual data objects defined by protocol or format specifications MUST be in a single character encoding. more ​All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list. more ​Specifications MUST define text in terms of Unicode characters, not bytes or glyphs. more ​For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form. more ​Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as asequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification. more ​If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects. more 2.2.1 Links 2.2.1.1 How to's Digital Encoding of Characters In W3C Recommendation, Character Model for the World Wide Web. 2.2.2 See also Including and excluding character ranges. 2.3 Including and excluding character ranges ​pecifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. more ​Specifications MUST NOT allow code points above U+10FFFF. more ​Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. more ​Specifications MUST NOT allow the use of surrogate code points. more ​Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. more 2.3.1 Links 2.3.1.1 How to's Digital Encoding of Characters In W3C Recommendation, Character Model for the World Wide Web. 2.3.2 See also Using the Private Use Area. 2.4 Using the Private Use Area ​Specifications MUST NOT require the use of private use area characters with particular assignments. more ​Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points. more ​Specifications and implementations SHOULD NOT disallow the use of rivate use code points by private agreement. more ​Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. more ​Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. more 2.4.1 Links 2.4.1.1 How to's Private use code points In W3C Recommendation, Character Model for the World Wide Web. 2.4.2 See also Including and excluding character ranges. 2.5 Choosing character encodings ​Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified. more ​When designing a new protocol, format or API, specifications SHOULD require a unique character encoding. more ​When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encding, specifications SHOULD use rather than change these rules. more ​When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. more This guideline needs further consideration: utf-16 and utf-32 are not recommended these days. UTF-8 is the recommended encoding. ​Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED. more ​If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs. more This guideline needsfurther consideration: the list of character encodings recommended for Web specifications is listed in the Encoding specification. ​Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. more ​If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. more ​If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification). more ​Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them. more 2.5.1 Links 2.5.1.1 How to's Choice and identification of code points In W3C Recommendation, Character Model for the World Wid Web. 2.5.1.2 Background reading Document character set What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents? 2.6 Identifying character encodings ​Specifications MUST NOT propose the use of heuristics to determine the encoding of data. more ​Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. more 2.6.1 Links 2.6.1.1 How to's Choice and identification of code points In W3C Recommendation, Character Model for the World Wide Web. 2.7 Designing character escapes ​Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous. more It is generally recommended that character escapes be provided so that difficult to enter or edit sequences can be introduced using a plain text editor. Escape sequences are particularly useful for invisible or ambiguous Unicode characers, including zero-width spaces, soft-hyphens, various bidi controls, mongolian vowel separators, etc. For advice on use of escapes in markup, but which is mostly generalisable to other formats, see Using character escapes in markup and CSS. ​Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. more ​The number of different ways to escape a character SHOULD be minimized (ideally to one). more ​Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. more ​Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. more ​Escaped characters SHOULD be acceptable wherever their unescaped forms are;this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. more 2.7.1 Links 2.7.1.1 How to's Character escaping In W3C Recommendation, Character Model for the World Wide Web. 2.8 Storing text ​Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more ​Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more 2.8.1 Links 2.8.1.1 How to's Visual rendering and logical order In W3C Recommendation, Character Model for the World Wide Web. 2.9 Specifying sort and search functionality ​Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and orering rules for the relevant language and/or application. more ​Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. more ​Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. more ​Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. more 2.9.1 Links 2.9.1.1 How to's Units of collation In W3C Recommendation, Character Model for the World Wide Web. 2.10 Converting to a Common Unicode Form ​Specifications of text-based formats and protocols MAY specify that all or part of the textual content of that format or protocol is normalized using Unicode Normalization Form C (NFC). more ​Specifications that do not normalize MUST document or provide a health-warning if canonically quivalent but disjoint Unicode character sequences represent a security issue. more ​Specifications and implementations MUST NOT assume that content is in any particular normalization form. more ​Specifications MUST specify that string matching takes the form of "code point-by-code point" comparison of the Unicode character sequence, or, if a specific Unicode character encoding is specified, code unit-by-code unit comparison of the sequences. more ​Specifications that define a regular expression syntax MUST provide at least Basic Unicode Level 1 support per Unicode Technical Standard #18: Unicode Regular Expressions and SHOULD provide Extended or Tailored (Levels 2 and 3) support. more ​Specifications of text-based formats and protocols that, as part of their syntax definition, require that the text be in normalized form MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC. more ​A normalizing text-processing component which rceives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed. more ​Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters. more ​Where operations can produce denormalized output from normalized text input,specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional. more ​Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized. more 2.10.1 Links 2.10.1.1 How to's Converting to a Common Unicode Form In W3C Working Draft, Character Model for the World Wide Web: String Matching and Searching. 2.11 Handling Case Folding ​Case sensitive matching is RECOMMENDED as the default for new protocols and formats. more ​Because the "simple" case-fold mapping removes infrmation that can be important to forming an identity match, the "Common plus Full" (or "Unicode C+F") case fold mapping is RECOMMENDED for Unicode case-insensitive matching. more ​ASCII case-insensitive matching MUST only be applied to vocabularies that are restricted to ASCII. Unicode case-insensitivity MUST be used for all other vocabularies. more ​If the vocabulary is not restricted to ASCII or permits user-defined values that use a broader range of Unicode, ASCII case-insensitive matching MUST NOT be required. more ​The Unicode C+F case-fold form is RECOMMENDED as the case-insensitive matching for vocabularies. The Unicode C+S form MUST NOT be used for string identity matching on the Web. more ​Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of: (a) Cae-sensitive (b) Unicode case-insensitive using Unicode case-folding C+F (c) ASCII case-insensitive. more ​Specifications SHOULD NOT specify case-insensitive comparison of strings. more ​Specifications that specify case-insensitive comparison for non-ASCII vocabularies SHOULD specify Unicode case-folding C+F. more ​Specifications MAY specify ASCII case-insensitive comparison for portions of a format or protocol that are restricted to an ASCII-only vocabulary. more ​Specifications and implementations MUST NOT specify ASCII-only case-insensitive matching for values or constructs that permit non-ASCII characters. more 2.11.1 Links 2.11.1.1 How to's Handling Case Folding In W3C Working Draft, Character Model for the World Wide Web: String Matching and Searching. 2.12 Defining 'string' ​Specifications SHOULD NOT define a string as a 'byte string'. more ​The 'character string' definition SHOULD be used by most specifications. more 2.12.1 Links 2.12.1.1 How to's Sring concepts In W3C Recommendation, Character Model for the World Wide Web. 2.12.2 See also Indexing strings and Choosing a definition of 'character'. 2.13 Indexing strings ​The character string is RECOMMENDED as a basis for string indexing. more ​A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. more ​Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. more ​Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or (b) define specifically how tailoring is applied to the indexing operation. more Need to check the above recommendation, since extended grapheme clusters are now recommended. ​The use of byte strings for idexing is NOT RECOMMENDED. more ​Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation. more ​Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. more ​Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types. more ​When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. more 2.13.1 Links 2.13.1.1 How to's String indexing In W3C Recommendation, Character Model for the World Wide Web. 2.13.2 See also Defining 'string'. 2.14 Referring to Unicode characters ​Use U+XXXX syntax to represent Unicde code points in the specification. more The U+XXXX format is well understood when referring to Unicode code points in a specification. These are space separated when appearing in a sequence. No additional decoration is needed. Note that a code point may contain four, five, or six hexadecimal digits. When fewer than four digits are needed, the code point number is zero filled. E.g. U+0020. 2.15 Referencing the Unicode Standard ​Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more ​A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not chang over time. more ​All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more ​All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more 2.15.1 Links 2.15.1.1 How to's Referencing the Unicode Standard and ISO/IEC 10646 In W3C Recommendation, Character Model for the World Wide Web.3. Language Establishing the language of a resource as a whole Establishing the language of blocks, paragraphs, or similar chunks of content Establishing the language of inline content spans Defining language values Providing for content negotiation based on language 3.1 Establishing the language of a resource as a whole Here we are talking about a whole HTML page, an XML document, a Json file, a WebVTT script, etc. ​It must be possible to indicate the default text-processing language for the resoure as a whole. more It saves trouble to identify the language, or at least the default language, of the resource as a whole in one place. For example, in an HTML file, this is done by setting the lang attribute on the html element. Content within the resource should inherit the language so set, unless it is specifically overridden. ​Consider whether it is necessary to have separate declarations for the text-processing language and metadata about the expected linguistic characteristics of the intended consumer. more The text-processing language is the language that is relevant for processing the content when it comes to spell-checking, styling, voice production, etc. For such operations, only one language should be identified at a time for a given range of content. Metadata about the the intended linguistic characteristics of the consumer describes the intended users of the resource as a whole, rather than specific ranges of content. For example, a blog page for an Indian technical community may contain equl numbers of posts in English and an Hindi. It may be useful to express that, as a whole, this resource contains content in both those languages (without specifying which bits of content are in which language). 3.1.1 Links 3.1.1.1 How to's Establishing the language of a resource as a whole In Internationalization Best Practices for Spec Developers. 3.2 Establishing the language of blocks, paragraphs, or similar chunks of content The word block is used here to refer to a structural component within the resource as a whole that groups content together and separates it from adjacent content such that the boundaries between one block and another are equivalent to paragraph or section boundaries in text. For example, this could refer to a block or paragraph in XML or HTML, an object declaration in Json, a cue in WebVTT, etc. Contrast this with inline content, which describes a range within a paragraph, sentence, etc. ​By default, blocks of content should inherit the text-processing language set for the reource as a whole. more ​It should be possible to indicate a change in language for blocks of content where the language changes. more 3.2.1 Links 3.2.1.1 How to's Establishing the language of blocks, paragraphs, or similar chunks of content In Internationalization Best Practices for Spec Developers. 3.3 Establishing the language of inline content spans Here we refer to content that switches to a different language in the middle of a paragraph or string. ​It should be possible to indicate language for spans of inline text where the language changes. more Where a switch in language can affect operations on the content, such as spell-checking, rendering, styling, voice production, translation, information retrieval, and so forth, it is necessary to indicate the range of text affected and identify the language of that content. 3.3.1 Links 3.3.1.1 How to's Establishing the language of inline content spans In Internationalization Best Practices for Spec Developers. 3.4 Defining language values ​Whre language attributes already exist and are appropriate, do not create a new one.more For example, XML provides xml:lang which can be used in all XML formats to identify the text-processing language for a range of text. ​Language values should be BCP47 language tags. ​Be specific about the form of language tags you expect. The word "valid" has special meaning in BCP 47. Generally "well-formed" is a better choice. ​Reference BCP47 for language tag matching. 3.4.1 Links 3.4.1.1 How to's Defining language values In Internationalization Best Practices for Spec Developers. 3.5 Providing for content negotiation based on language ​In a multilingual environment it must be possible for the user to receive text in the language they prefer. This may depend on implicit user preferences based on the user's system or browser setup, or on user settings explicitly negotiated with the user. 3.5.1 Links 3.5.1.1 How to's Providing for content negotiation based on language In Internaionalization Best Practices for Spec Developers.4. Text direction Setting the bidi direction for the resource as a whole Establishing the bidi direction for blocks, paragraphs, or similar chunks of content Establishing the bidi direction for spans of inline content Enabling vertical text display Setting box positioning coordinates when text direction varies 4.1 Setting the bidi direction for the resource as a whole ​The content author must be able to indicate the RTL/LTR direction of the content as a whole, ie. set the overall base direction. ​The default text direction should be declared as LTR. 4.2 Establishing the bidi direction for blocks, paragraphs, or similar chunks of content ​The content author must be able to indicate parts of the text where the base direction changes. This should be achieved using attributes or metadata at a block level, and not rely on Unicode control characters. ​It must be possible to also set the direction for content fragments to auto. This means that the base directon will be determined by examining the content itself. ​A typical approach here would be to set the direction based on the first strong directional character outside of any markup, but this is not the only possible method. The algorithm used to determine directionality when direction is set to auto should match that expected by the receiver. more The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character. Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term. For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML. ​If the overall base drection is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis. ​To indicate the sides of a block of text where relative to the start and end of its contained lines, you should use 'before' and 'after' (maybe block-start/block-end – the terminology is changing), rather than 'top' and 'bottom'. ​To indicate the start/end of a line you should use 'start' and 'end' rather than 'left' and 'right'. ​Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control. 4.3 Establishing the bidi direction for spans of inline content ​It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are ecognized by the receiving application, and correctly implemented. ​It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character. Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term. For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML. ​If users use Unicod bidirectional control characters, the RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec. ​Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more The Unicode bidirectional control characters U+200F RIGHT-TO-LEFT MARK and U+200E LEFT-TO-RIGHT MARK are not sufficient on their own to manage bidirectional text. They cannot produce a different base direction for embedded text. For that you need to be able to indicate the start and end of the range of the embedded text. This is best done by markup, if available, or failing that using the other Unicode bidirectional controls mentioned just above.​Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control. ​Allow bidi attributes on all inline elements in markup that contain text. ​Provie attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL in either of these two scenarios. 4.4 Enabling vertical text display ​It should be possible to render text vertically for languages such as Japanese, Chinese, Korean, Mongolian, etc. ​Vertical text must support line progression from LTR (eg. Mongolian) and RTL (eg. Japanese) 4.5 Setting box positioning coordinates when text direction varies ​Box positioning coordinates must take into account whether the text is horizontal or vertical. more It is typical, when localizing a user interface or web page, to create mirror-images for the RTL and LTR versions. For example, it is likely that a box that appears near the left side of a window containing English content would appear near the right side of the window if the content is Arabic or Hebrew. It should preferably automatic for this to change, based on the bse direction of the current context, unless there is a strong reason for using absolute geometry. One way to achieve this is to use keywords such as start and end, rather than left and right, to indicate position.5. Typographic support Miscellaneous 5.1 Miscellaneous ​Line heights must allow for characters that are taller than English. ​Box sizes must allow for text expansion in translation. ​Ruby text alongside base text should be supported for CJK text. ​Line wrapping should take into account the special rules needed for non-Latin scripts. more Various non-Latin writing systems don't simply wrap text on inter-word spaces. They have additional rules that must be respected. For example Chinese, Japanese and Korean wrap after characters, but don't put certain characters at the start/end of a line. Thai and other SE Asian scripts wrap at word boundaries, but words are not delimited by spaces – spaces are instead used to separate phrases. Tibetan wraps after the tsek character that follows  syllable – words are not separated by spaces, and lines can break within a word. Indic and other complex scripts break at orthographic syllable boundaries, which are often two or more grapheme clusters. See the CSS Text Level 3 specification for additional background. (This tutorial provides additional examples, if needed.) ​Avoid specifying presentational tags, such as b for bold, and i for italic. more It is best to avoid presentational markup b, i or u, since it isn't interoperable across writing systems and furthermore may cause unnecessary problems for localisation. In addition, some scripts have native approaches to things such as emphasis, that do not involve, and can be very different from, bolding, italicisation, etc. In the HTML case, there was a legacy issue, but unless there is one for your specification, the recommendation is that styling be used instead to determine the presentation of the text, and that any markup or tagging should allow for general semantic approaches. For an explaation of the issues surrounding b and i tags, see Using <b> and <i> elements.6. Localizability Defining elements and attributes 6.1 Defining elements and attributes ​Do not define attribute values that will contain user readable content. Use elements for such content. more ​Provide a way for authors to annotate arbitrary inline content using a span-like element or construct. more7. Plain text support Miscellaneous 7.1 Miscellaneous ​Avoid natural language text in elements that only allow for plain text and in attribute values. ​Provide a span-like element that can be used for any text content to apply information needed for internationalization. more Internationalization information may include a change of language, bidirectional text behaviour changes, translate flags, etc.8. Case distinctions Miscellaneous 8.1 Miscellaneous ​Identifiers should be case-sensitive.A. References None.B. Revision Log tbdC. Acknowledgements tbd

Status of the Document


This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document provides advice to specification developers about how to incorporate requirements for international use. What is currently available here is expected to be useful immediately, but is a very early draft and the document is in flux, and will grow over time as knowledge applied in reviews and discussions can be crystallized into guidelines.

Note


Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document.

This document was published by the Internationalization Working Group as a First Public Working Draft. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Received on Tuesday, 20 October 2015 08:39:15 UTC