1. Introduction to the problem

Moki intermediate document draft has several parts with info messages that come from third parties. These parts are the following:

Tidy tool output
XHTML Grammar validity
CSS Grammar validity
Images validity

If we want to build an intermediate document not coupled to specific third parties it would be desirable having our proper message codes (better groups of them) and mappings between third parties ones and ours. In this way anyone can easily replace one tool for another.

A brief example (considering grammar validity) would be:
A fragment in moki document:

<error code="002">  <location type="line">30</location> <messages>Here would be specific tool message </ messages > <location type="line">40</location> <messages> Here would be specific tool message </ messages > </error>

In other file we would have descriptions about the error codes:

<messages> <error code="002"> <description>Brief description about what represent this code </description> </error> </messages>

A mapping between third party message and our codes:

<messages> <tool>JHOVE</tool>  <code="002"> <toolcodes> <code="001"> <code="002"> <code="003"> </toolcodes> </code> </messages>

Note that in this way messages can be easier internationalizable.

Before we can take a decision on this is necessary a deeper watch inside validation engines and analyse the feasibility/cost of this approach.

2. Analysis of validation engines

In this paragraph we will describe the tools that are nowadays chosen. The analysis is centred in the kind of output that these tools generate and if they provide the information needed by the tests.

Tag Soup [http://ccil.org/~cowan/XML/tagsoup/]

This tool has the objective of cleaning the code that is going to be the input for the checker. In the tests that we have done, only got another file tidied but nothing like a report of actions taken. So, we looked into the source code and found that this tool use SAX internally to parse the document but in an ad hoc way: when it catches something wrong, it repairs without any report of the fixing made.

Problems
Therefore, using tag-soup we can’t report any output about the actions made in moki document.

Other possibilities of the tidy tools proposed by a Sean's mail [http://lists.w3.org/Archives/Public/public-mobileok-checker/2007Mar/0046.html] are:

[TODO:These tools are pending to be analized regarding messages management]

HTMLCleaner [http://htmlcleaner.sourceforge.net/]
Java Mozilla HTML Parse [http://sourceforge.net/forum/forum.php?forum_id=681259]
JTidy [http://sourceforge.net/projects/jtidy/]
NekoHTML [http://people.apache.org/~andyc/neko/doc/html/index.html]

JHOVE [http://hul.harvard.edu/jhove/]

This tool has been selected for the validation of the images and the XHTML code. JHOVE has several modules to validate the input. Although at first sight it seems that it does not validate the basic profile [http://hul.harvard.edu/jhove/index.html], a deeper look inside the source code reveals the opposite.
We describe here the JPEG/GIF and XHTML module in a separate way.

JHOVE Images Modules

JHOVE has both modules for GIF a JPEG images. It is possible to validate the formats against the specification imposed by mobileOK Basic:
JPEG [http://hul.harvard.edu/jhove/references.html#t.81]
GIF [http://hul.harvard.edu/jhove/references.html#gif89a

Problems
The output provided by these modules does not have any kind of error identification and the messages are embed in source code (it is not possible the internationalization).
For example this is an error reported by the GifModule:

info.setMessage(new ErrorMessage("End of file reached without encountering Trailer block",_nByte) );

Another problem would be the impossibility of checking if all pixels of an image are transparent or not.JHOVE only detetcs if the alpha channel (transparency) is used, but it does not check if all pixels are transparent.

Possible Solutions

Modify JHOVE
One possibility is modify JHOVE source code. The idea would be an study of the images specifications and try to define types of message and assign them codes. Then can be modified the JHOVE API to include identification of messages and externalize them to properties files. Regarding to the transparency problem a solution could be adding the necessary source code.

Other image tools

Javax imageio package
This is the image processing package included in Sun JVM since version 1.4. It's a low level API which contains basic classes for describing image files, including metadata and thumbnails, classes for controlling the image reading/writing process. It can check the real image format of an image file but not against an specific version that means we can get if an image is GIF or JPEG (it checks against image's bytes) but not if the gif format is version 89a or not. As this is a low level library error messages are low level too and will be input/output errors messages.
A good point of this tool is that is possible to check if all pixels of an image are transparent, so we could combine this library for this purpose and the validity of the format through JHOVE.
JMagick
This library is a Java interface (through JNI) for the popular trascoder tool Image Magick.The good point of JMagick is its LGPL license ( Image Magick has GPL) and his efficiency. The bad point is that it is not oriented as a validation tool but a powerful editing tool.

JHOVE XHTML Module

The validation of the XHTML grammar is done by the JHOVE XML module. This module uses internally SAX interface. SAX makes the validation using the declared DTD and reports the messages. The problem is that although SAX uses messages codes (as small strings) internally the API only exposes the large message strings without any code.In the following internal message,we only get the error string:

XMLLangInvalid=The xml\:lang attribute value "{0}" is an invalid language identifier.

JHOVE XHTML module includes some common DTDs as resources (XHTML/HTML) but XHTML Basic/Mobile Profile DTDs are not included. For performance reasons (avoid overhead of NET connections and so on) it would be desirable including them as resources.

Possible Solutions

Modify SAX Driver
One possibility would be modify SAX driver (JHOVE uses JVM default -JAXP-, but can be configurable) to get public messages codes,internationalization and so on. But it is not a very good solution (coupled to the specific driver).

Other tools

W3C Validator
This tool is a wrapper written in Perl that internally call to openSP (aka OpenJade) [http://openjade.sourceforge.net/]. Opensp is a generic SGML parser written in C/C++. It validates against any docuement grammar and the output is composed by the code,the message and location. If we use OpenSP directly,one possible action would be writting an JNI interface to the this library.
Another option is use the W3C Validator SOAP Web Service [http://validator.w3.org/docs/api.html] but the message id that the validator uses internally has not been implemented yet in SOAP response.(So in this case we would have to implement ourselves).

JXCSS

JXCSS is SAX parser adapter for SAC parsing. Because of this JXCSS will share features (and flaws) of SAC parsing. JXCSS is a library for writing CSS document in XML format. It does not do any process on the CSS grammar.

SAC

SAC is an event driven API (like SAX) which provides access to different tokens of CSS. An SAC parser accepts two diferent handlers a DocumentHandler and an ErrorHandler. The DocumentHandler, basically, registers selectors, properties, at-rules and other events like start of the document.

SAC is a low level API, it just provides access to different tokens and in code we must check that properties have the expected value. For example looking for font-size absolute values:

public void property(String property, LexicalUnit value, boolean important) throws CSSException { if ( property.equalsIgnoreCase("font-size") ) { if ( absoluteFontSize(value) ) // Do something } } private boolean absoluteFontSize(short lexicalUnitType) { switch ( lexicalUnitType ) { case LexicalUnit.SAC_PIXEL: case LexicalUnit.SAC_INCH: case LexicalUnit.SAC_CENTIMETER: case LexicalUnit.SAC_MILLIMETER: case LexicalUnit.SAC_POINT: case LexicalUnit.SAC_PICA: return true; default: return false; } }

Some CSS properties have shorthand form (font-size could also be definided by font property) in that case we must skip the values we are not interested in. (Not sure in this moment if we will have to deal with shorthand properties but just in case)

public void property(String property, LexicalUnit value, boolean important) throws CSSException { if ( property.equalsIgnoreCase("font") ) { while ( isNotFontSizeValue(value) ) value= value.getNextLexicalUnit(); if ( absoluteFontSize(value) ) // Do something } } private boolean isFontSizeValue(LexicalUnit lu) { // font: font-style font-variant font-weight font-size/line-height font-family.... switch ( lu.getLexicalUnitType() ) { case LexicalUnit.SAC_IDENT: String value = lu.getStringValue().toLowerCase(); if ( value.equals("xx-small") || value.equals("x-small") || value.equals("small") || value.equals("xx-large") || value.equals("x-large") || value.equals("large") || value.equals("medium") || value.equals("smaller") || value.equals("larger") ) { return true; } else return false; case LexicalUnit.SAC_PIXEL: case LexicalUnit.SAC_INCH: case LexicalUnit.SAC_CENTIMETER: case LexicalUnit.SAC_MILLIMETER: case LexicalUnit.SAC_POINT: case LexicalUnit.SAC_PICA: case LexicalUnit.SAC_EM: case LexicalUnit.SAC_EX: case LexicalUnit.SAC_PERCENTAGE: return true; default: return false; } }

The strong point of SAC library is its speed. It is really fast. On the other hand SAC does not perform grammar validation it only reports lexical errors (like not closing brackets or so). For example an well formed CSS chunk but grammar invalid will be: body { non-existent-property: nonExistentValue };

Message errors are handled by ErrorHandler class and splitted in three categories: warning,error and fatal. Each error category is reported in own its method and has the message string but no error code. Error messages can be localized by the method setLocale. In this way we could at least get the error message in a locale dependent manner.

SAC is just an API and there are several implementations, the two most well-known could be Flute (from W3C) and Batik (from Apache). It is a pity but Flute library does not implement the setLocale method yet so only Batik implementation remains as a choice. Batik provides internationalization by properties files so we will need to translate them and setting the locale.

CSS-Validator

CSS-Validator is a high level API which performs grammar checking against different CSS profiles. After a style sheet is parsed there is a method to get all the selectors and for each selector you can get its properties. The first example (font-size) with this library would be something like:

org.w3c.css.css.StyleSheet ss = css.getStyleSheet(); org.w3c.css.parser.CssStyle style; org.w3c.css.properties.css1.CssFontSizeCSS2 fontSize;

java.util.Enumeration e = ss.getRules().keys(); while ( e.hasMoreElements() ) { style = ss.getStyle( (org.w3c.css.parser.CssSelectors)e.nextElement() ); Css1Style css1 = (Css1Style) style; fontSize = css1.getFontSizeCSS2(); if ( fontSize!=null && fontSize.isByUser() ) { if ( fontSize.get() instanceof org.w3c.css.values.CssLength ) { org.w3c.css.values.CssLength cssLength = (org.w3c.css.values.CssLength)fontSize.get(); if ( !cssLength.getUnit().equalsIgnoreCase("em") && !cssLength.getUnit().equalsIgnoreCase("ex") ) { // Do something } } } }

With this library we also get the font sizes defined by the shorthand font property so no more code is required to handle them. Furthermore when you parse a CSS file with this library, it transparently adds any imported style sheet so at the end you get all the styles.

Errors are handled through exceptions and only the message string is provided. As with SAC parser css-validator can localize its errors messages (this time by an ApplContext object) so it can provide messages in diferent languages. Error messages are taken from properties files and there are at least 8 translations.

Whatever CSS tool we use finally (or combination of them), neither of them use messages codes. We would have to modifie their source code.

3. Conclusions

Before reaching our conclusions, we will summary all information in the following table

Comparative table of message code and properties file usage among possible mobile checker third-party tools
Library	Message code	Properties file	Notes
Tag Soup	—	—	Not provide any kind of message
JHOVE Image Module	no	no	Validate the specific mobileOK Basic Formats
Package javax.imageio	no	no	Low level API useful for checking transparency. Any error will be wrapped by mobile checker code.
JHOVE XHTML Module	no	no	Uses SAX as validation engine
SAX	internal	no	SAX parser messages can be localized (setLocale). ¿Implementation dependent?
W3C Markup Validator	yes (see OpenSP)	yes* (see OpenSP)	A wrapper library written in Perl for OpenSP
OpenSP	yes	yes*	The properties are loaded into code during build process. This library can be useful if we build an JNI
W3C SOAP Validation Service	internal	yes*	SOAP entry point to W3C Markup Validator
SAC (Batik)	internal	yes	Useful for CSS properties searching
CSS-Validator	internal	yes	Useful for validate the CSS grammar
JXCSS	no	no	Useful for the representation of CSS in XML

None of these tools has an message management that satisfies the needs introduced at the begining of this document. So far we think about two possible solutions:

Modify the tools in order to fullfill messages codes (and internationalization and so on -The next option is implicit-).
Modify the tools in order to make them internationalizables (using properties files) and in this way make possible to select a language for moki.(e.g we could have an Spanish moki with messages of third parties in Spanish, and so on).

Perhaps the better solution is a combination of both of them. (Note that in some cases -like Tag soup it is not possible to include any kind of messages)

As the messages handling in tools are very heterogeneous,we think that a reasonable solution would be the treatment of each tool in a separate way.Not looking for the best solution in all cases but an agreement between development agility and quality.