[all] Notes on the list of implementation issues from Yves Savourel on 2012-10-31 (public-multilingualweb-lt@w3.org from October 2012)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Tue, 30 Oct 2012 21:01:56 -0600
To: <public-multilingualweb-lt@w3.org>
Message-ID: <assp.0651e50619.assp.0651ceb246.002501cdb714$1657bc90$430735b0$@com>

Hi all,

Since I'll probably miss the start of the day Tuesday,
here are some notes on the issues listed for the implementations:
http://www.w3.org/International/multilingualweb/lt/wiki/Use_cases_-_high_level_summary

-- Algorithm for the building of the list of domain needs to be finalized.

Mauricio provided feedback.
The latest version includes provisions for removing duplicates and dealing with quoted values.
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain-implementation

-- Need for a common representation of the ITS data categories in XLIFF, see XLIFF_Mapping

David and I have posted email to indicate the need for extended attributes in mrk:
https://lists.oasis-open.org/archives/xliff/201210/msg00087.html
https://lists.oasis-open.org/archives/xliff/201210/msg00101.html
So far, no feedback from other TC members.

I think that if we get extended attributes in mrk the mapping then becomes just a matter of agreeing on names/namespaces, rather than trying to solve a potential impossible representations for some ITS data categories.

-- XML Schema subset of regex for allowed characters

I know Jirka pointed out a few solutions like the use of the source code of Xerces, etc. but IMO: a) that's either extra dependencies we don't really want, or quite a bit of work to adapt the code, and b) it's a fix only for Java.

I noticed that Linguaserve is also using Java and implements this data category.
So we are likely several to face the same issue.
I'm curious to learn about how other implementers have coded this data category.

-- Allowed Characters regular expression for not allowing HTML tags in content nodes where only plain text content is allowed (also Allowed Characters: Find a better way to disallow HTML tags. Currently using: [^<>])

Allowed Characters is simply not meant to address this use case.
It prescribes what characters can be used, not what patterns can be used.

-- Need to add all the document HMTL tags (wrap the content with html, head, body tags) so we can add a link to a global rules XML

No really any comments.
But it sounds like the HTML content is build from several fields.
Using XLIFF as output would have helped.

-- Troubles with namespaces in HTML5.

No comments.

-- Need to come to an agreement to map domain values to be consistent for both Lucy and DCU's MT Systems.

No comments.

-- Problems to use global rules with the provenance and quality metadata since ATLAS PW1 cannot place files on the client server.

No comments.

-- CDATA in XML for the content - How to handle Rules affecting content in CDATA from XML?

If the question is "How to apply ITS rules that are outside the CDATA section to a CDATA content"? then the answer is that the content of the CDATA section is basically a text node. So a rule with a selector on that node behaves just like any rule on a text node.

If the question is "How to run ITS rules that are located in the content of a CDATA section?" then it's a matter of treating that content as a sub document that is XML or HTML5 and apply the normal ITS process to it.
There is nothing at the ITS level that can (or should) help with this. The tool has to know it needs to convert the content of the CDATA section into XML/HTML5 and work with that.

Note that the recommended best practice for localization is to avoid CDATA (See http://www.w3.org/TR/xml-i18n-bp/#AuthCDATA), that causes many problems downstream.

If you have to use CDATA, then you also have to use tools that have filters capable of doing sub-filtering (i.e. a filter that can use another filter to extract the CDATA parts). This is difficult to implement.

-- Language Information: Use-Case? When is xml:lang or lang not enough or can't be used?

No comments.

Cheers,
-yves

Received on Wednesday, 31 October 2012 03:02:27 UTC