1. Introduction to the problem

Moki intermediate document draft has several parts with info messages that come from third parties. These parts are the following:

If we want to build an intermediate document not coupled to specific third parties it would be desirable having our proper message codes (better groups of them) and mappings between third parties ones and ours. In this way anyone can easily replace one tool for another.

A brief example (considering grammar validity) would be:
A fragment in moki document:

<error code="002">
<!-- Specific tool messages -->
   <location type="line">30</location>
   <messages>Here would be specific tool message </ messages >
   <location type="line">40</location>
   <messages> Here would be specific tool message </ messages >
</error>

In other file we would have descriptions about the error codes:

<messages>
   <error code="002">
      <description>Brief description about what represent this code </description>
   </error>
</messages>

A mapping between third party message and our codes:

<messages>
   <tool>JHOVE</tool>
<!-- mappings between our codes and tool codes-->
   <code="002">
      <toolcodes>
         <code="001">
         <code="002">
         <code="003">
      </toolcodes>
   </code>          
</messages>

Note that in this way messages can be easier internationalizable.

Before we can take a decision on this is necessary a deeper watch inside validation engines and analyse the feasibility/cost of this approach.

2. Analysis of validation engines

In this paragraph we will describe the tools that are nowadays chosen. The analysis is centred in the kind of output that these tools generate and if they provide the information needed by the tests.

Tag Soup [http://ccil.org/~cowan/XML/tagsoup/]

This tool has the objective of cleaning the code that is going to be the input for the checker. In the tests that we have done, only got another file tidied but nothing like a report of actions taken. So, we looked into the source code and found that this tool use SAX internally to parse the document but in an ad hoc way: when it catches something wrong, it repairs without any report of the fixing made.

Problems
Therefore, using tag-soup we can’t report any output about the actions made in moki document.

Other possibilities of the tidy tools proposed by a Sean's mail [http://lists.w3.org/Archives/Public/public-mobileok-checker/2007Mar/0046.html] are:

[TODO:These tools are pending to be analized regarding messages management]

JHOVE [http://hul.harvard.edu/jhove/]

This tool has been selected for the validation of the images and the XHTML code. JHOVE has several modules to validate the input. Although at first sight it seems that it does not validate the basic profile [http://hul.harvard.edu/jhove/index.html], a deeper look inside the source code reveals the opposite.
We describe here the JPEG/GIF and XHTML module in a separate way.

JHOVE Images Modules

JHOVE has both modules for GIF a JPEG images. It is possible to validate the formats against the specification imposed by mobileOK Basic:
JPEG [http://hul.harvard.edu/jhove/references.html#t.81]
GIF [http://hul.harvard.edu/jhove/references.html#gif89a

Problems
The output provided by these modules does not have any kind of error identification and the messages are embed in source code (it is not possible the internationalization).
For example this is an error reported by the GifModule:

info.setMessage(new ErrorMessage("End of file reached without encountering Trailer block",_nByte) );

Another problem would be the impossibility of checking if all pixels of an image are transparent or not.JHOVE only detetcs if the alpha channel (transparency) is used, but it does not check if all pixels are transparent.

Possible Solutions
Other image tools

JHOVE XHTML Module

The validation of the XHTML grammar is done by the JHOVE XML module. This module uses internally SAX interface. SAX makes the validation using the declared DTD and reports the messages. The problem is that although SAX uses messages codes (as small strings) internally the API only exposes the large message strings without any code.In the following internal message,we only get the error string:

XMLLangInvalid=The xml\:lang attribute value "{0}" is an invalid language identifier.

JHOVE XHTML module includes some common DTDs as resources (XHTML/HTML) but XHTML Basic/Mobile Profile DTDs are not included. For performance reasons (avoid overhead of NET connections and so on) it would be desirable including them as resources.

Possible Solutions
Other tools

JXCSS

JXCSS is SAX parser adapter for SAC parsing. Because of this JXCSS will share features (and flaws) of SAC parsing. JXCSS is a library for writing CSS document in XML format. It does not do any process on the CSS grammar.

SAC

SAC is an event driven API (like SAX) which provides access to different tokens of CSS. An SAC parser accepts two diferent handlers a DocumentHandler and an ErrorHandler. The DocumentHandler, basically, registers selectors, properties, at-rules and other events like start of the document.

SAC is a low level API, it just provides access to different tokens and in code we must check that properties have the expected value. For example looking for font-size absolute values:

public void property(String property, LexicalUnit value, boolean important) throws CSSException {
    if ( property.equalsIgnoreCase("font-size") ) {
        if ( absoluteFontSize(value) )
            // Do something
        }
    }

private boolean absoluteFontSize(short lexicalUnitType) {
    switch ( lexicalUnitType ) {
        case LexicalUnit.SAC_PIXEL:
        case LexicalUnit.SAC_INCH:
        case LexicalUnit.SAC_CENTIMETER:           
        case LexicalUnit.SAC_MILLIMETER:
        case LexicalUnit.SAC_POINT:
        case LexicalUnit.SAC_PICA:
                return true;
        default:
                return false;
    }
}


Some CSS properties have shorthand form (font-size could also be definided by font property) in that case we must skip the values we are not interested in. (Not sure in this moment if we will have to deal with shorthand properties but just in case)

    public void property(String property, LexicalUnit value, boolean important) throws CSSException {
        if ( property.equalsIgnoreCase("font") ) {       
            while ( isNotFontSizeValue(value) )
                value= value.getNextLexicalUnit();
            if ( absoluteFontSize(value) )
                // Do something
        }
    }

    private boolean isFontSizeValue(LexicalUnit lu) {       
        // font: font-style font-variant font-weight font-size/line-height font-family....
        switch ( lu.getLexicalUnitType() )
        {
               case LexicalUnit.SAC_IDENT:
                String value = lu.getStringValue().toLowerCase();
                if ( value.equals("xx-small") || value.equals("x-small") || value.equals("small") ||
                     value.equals("xx-large") || value.equals("x-large") || value.equals("large") ||
                     value.equals("medium")   || value.equals("smaller") || value.equals("larger")    )
                {
                    return true;
                }
                else
                    return false;
            case LexicalUnit.SAC_PIXEL:
            case LexicalUnit.SAC_INCH:
            case LexicalUnit.SAC_CENTIMETER:           
            case LexicalUnit.SAC_MILLIMETER:
            case LexicalUnit.SAC_POINT:
            case LexicalUnit.SAC_PICA:
            case LexicalUnit.SAC_EM:
            case LexicalUnit.SAC_EX:
            case LexicalUnit.SAC_PERCENTAGE:           
                return true;
            default:
                return false;
        }
    }

The strong point of SAC library is its speed. It is really fast. On the other hand SAC does not perform grammar validation it only reports lexical errors (like not closing brackets or so). For example an well formed CSS chunk but grammar invalid will be:  body { non-existent-property: nonExistentValue };

Message errors are handled by ErrorHandler class and splitted in three categories: warning,error and fatal. Each error category is reported in own its method and has the message string but no error code. Error messages can be localized by the method setLocale. In this way we could at least get the error message in a locale dependent manner.

SAC is just an API and there are several implementations, the two most well-known could be Flute (from W3C) and Batik (from Apache). It is a pity but Flute library does not implement the setLocale method yet so only Batik implementation remains as a choice. Batik provides internationalization by properties files so we will need to translate them and setting the locale.

CSS-Validator

CSS-Validator is a high level API which performs grammar checking against different CSS profiles. After a style sheet is parsed there is a method to get all the selectors and for each selector you can get its properties. The first example (font-size) with this library would be something like:

    org.w3c.css.css.StyleSheet ss = css.getStyleSheet();
    org.w3c.css.parser.CssStyle style;
    org.w3c.css.properties.css1.CssFontSizeCSS2 fontSize;

    java.util.Enumeration e = ss.getRules().keys();
    while ( e.hasMoreElements() )    {
        style = ss.getStyle( (org.w3c.css.parser.CssSelectors)e.nextElement() );
        Css1Style css1 = (Css1Style) style;
        fontSize = css1.getFontSizeCSS2();
        if ( fontSize!=null && fontSize.isByUser() )
        {
            if ( fontSize.get() instanceof org.w3c.css.values.CssLength )
            {
                org.w3c.css.values.CssLength cssLength = (org.w3c.css.values.CssLength)fontSize.get();               
                if ( !cssLength.getUnit().equalsIgnoreCase("em") && !cssLength.getUnit().equalsIgnoreCase("ex") )
                {
                    // Do something
                }
            }
        }
    }

With this library we also get the font sizes defined by the shorthand font property so no more code is required to handle them. Furthermore when you parse a CSS file with this library, it transparently adds any imported style sheet so at the end you get all the styles.

Errors are handled through exceptions and only the message string is provided. As with SAC parser css-validator can localize its errors messages (this time by an ApplContext object) so it can provide messages in diferent languages. Error messages are taken from properties files and there are at least 8 translations.

Whatever CSS tool we use finally (or combination of them), neither of them use messages codes. We would have to modifie their source code.

3. Conclusions

Before reaching our conclusions, we will summary all information in the following table

Comparative table of message code and properties file usage among possible mobile checker third-party tools
Library Message code Properties file Notes
Tag Soup Not provide any kind of message
JHOVE Image Module no no Validate the specific mobileOK Basic Formats
Package javax.imageio no no Low level API useful for checking transparency. Any error will be wrapped by mobile checker code.
JHOVE XHTML Module no no Uses SAX as validation engine
SAX internal no SAX parser messages can be localized (setLocale). ¿Implementation dependent?
W3C Markup Validator yes (see OpenSP) yes* (see OpenSP) A wrapper library written in Perl for OpenSP
OpenSP yes yes* The properties are loaded into code during build process. This library can be useful if we build an JNI
W3C SOAP Validation Service internal yes* SOAP entry point to W3C Markup Validator
SAC (Batik) internal yes Useful for CSS properties searching
CSS-Validator internal yes Useful for validate the CSS grammar
JXCSS no no Useful for the representation of CSS in XML

None of these tools has an message management that satisfies the needs introduced at the begining of this document. So far we think about two possible solutions:

  1. Modify the tools in order to fullfill messages codes (and internationalization and so on -The next option is implicit-).
  2. Modify the tools in order to make them internationalizables (using properties files) and in this way make possible to select a language for moki.(e.g we could have an Spanish moki with messages of third parties in Spanish, and so on).

Perhaps the better solution is a combination of both of them. (Note that in some cases -like Tag soup it is not possible to include any kind of messages)

As the messages handling in tools are very heterogeneous,we think that a reasonable solution would be the treatment of each tool in a separate way.Not looking for the best solution in all cases but an agreement between development agility and quality.