tidy C interface question about XHTML Strict from Eugeny N Dzhurinsky on 2008-11-03 (html-tidy@w3.org from October to December 2008)

From: Eugeny N Dzhurinsky <bofh@redwerk.com>
Date: Mon, 3 Nov 2008 19:50:20 +0200
To: html-tidy@w3.org
Message-ID: <20081103175020.GC78786@office.redwerk.com>

Hello, all!

I want to use tidy as DLL in my Java application, I wrote the JNI bindings for
the tidy and everything works fine, except one issue: the converted documents,
which supposed to be XHTML 1.0 Strict, do not pass validation on w3.org.

For example if I have the tag <a href=".." target="_blank"> - the attribute
target is not removed.

Could you please take a look at the code below and tell me what am I missing
there?

#include "TidyWrapper.h"
#include <jni.h>
#include <tidy.h>
#include <buffio.h>

#define DEBUG 1
#define DUMP_DOCUMENT 1

JNIEXPORT jbyteArray JNICALL Java_xhtml_tidy_jni_TidyWrapper_convert
(JNIEnv *env, jobject self, jobject options, jbyteArray data) {

    TidyBuffer output = {0};
    TidyBuffer errbuf = {0};
    Bool ok;
    jboolean isCopy = 0;
    const char* input = (*env)->GetByteArrayElements(env, data, &isCopy);
    jfieldID fid;
    jclass cls;


    // get instance of class used for options
    cls = (*env)->GetObjectClass(env, options);

    TidyDoc tdoc = tidyCreate();

    tidySetErrorBuffer( tdoc, &errbuf );

    tidyOptSetBool( tdoc, TidyXhtmlOut, yes );
    tidyOptSetValue(tdoc, TidyEncoding, "UTF-8");
    tidyOptSetValue(tdoc, TidyCharEncoding, "UTF-8");

    tidyOptSetValue(tdoc, TidyDoctype, "strict");

    // parse document
    if (DEBUG)
        write(2,"Parsing document\n",17);
   
    if (DUMP_DOCUMENT) {
        write(2,"Original document\n",18);
        write(2,input,strlen(input));
    }
    tidyParseString( tdoc, input );
    
    // convert document
    tidyCleanAndRepair( tdoc );
    tidyRunDiagnostics( tdoc );
    tidyOptSetBool(tdoc, TidyForceOutput, yes);
    tidySaveBuffer( tdoc, &output );
    if (DUMP_DOCUMENT) {
        write(2,"Converted document\n",19);
        write(2,output.bp,output.size);
    }

    // create an array used to store the results
    jbyteArray results = (*env)->NewByteArray(env,output.size);

    /// copy data from resulting buffer to Java array
    (*env)->SetByteArrayRegion(env,results,0,output.size,output.bp);

    tidyBufFree( &errbuf );
    tidyBufFree( &output );
    tidyRelease( tdoc );
    
    return results;
}

While document get's the doctype of XHTML Strict, it still contains weird
attributes which don't comform the XHTML Strict specification. I can see this
in the debug output for the converted document right after it was processed by
Tidy, so it is not related to any Java code, which calls the native routine.

With tidy command-line utility HTML documents are converted just fine, I've
looked at the sources and did not find any additional settings except
TidyXhtmlOut to be set on -asxhtml command-line option.

Thank you in advance!

-- 
Best regards
Eugene Dzhurinsky

Received on Monday, 3 November 2008 18:24:51 UTC