Re: tidy C interface question about XHTML Strict from Arnaud Desitter on 2008-11-03 (html-tidy@w3.org from October to December 2008)

From: Arnaud Desitter <arnaud02@users.sourceforge.net>
Date: Mon, 3 Nov 2008 21:15:32 +0000
To: "Eugeny N Dzhurinsky" <bofh@redwerk.com>
Cc: html-tidy@w3.org
Message-ID: <a240ddd00811031315w5e223207x9edd5e85312dcdbd@mail.gmail.com>

Hi,

If I remember well, "doctype: strict"
(http://tidy.sourceforge.net/docs/quickref.html#doctype) sets
the DTD to strict but does not remove the attributes not conforming.

Could you provide an HTML example along with the tidy command line you
are using?

Last,
output = {0}; should be replaced by "tidyBufInit( &output );"

Regards,

2008/11/3 Eugeny N Dzhurinsky <bofh@redwerk.com>:
> Hello, all!
>
> I want to use tidy as DLL in my Java application, I wrote the JNI bindings for
> the tidy and everything works fine, except one issue: the converted documents,
> which supposed to be XHTML 1.0 Strict, do not pass validation on w3.org.
>
> For example if I have the tag <a href=".." target="_blank"> - the attribute
> target is not removed.
>
> Could you please take a look at the code below and tell me what am I missing
> there?
>
> #include "TidyWrapper.h"
> #include <jni.h>
> #include <tidy.h>
> #include <buffio.h>
>
> #define DEBUG 1
> #define DUMP_DOCUMENT 1
>
> JNIEXPORT jbyteArray JNICALL Java_xhtml_tidy_jni_TidyWrapper_convert
> (JNIEnv *env, jobject self, jobject options, jbyteArray data) {
>
>    TidyBuffer output = {0};
>    TidyBuffer errbuf = {0};
>    Bool ok;
>    jboolean isCopy = 0;
>    const char* input = (*env)->GetByteArrayElements(env, data, &isCopy);
>    jfieldID fid;
>    jclass cls;
>
>
>    // get instance of class used for options
>    cls = (*env)->GetObjectClass(env, options);
>
>    TidyDoc tdoc = tidyCreate();
>
>    tidySetErrorBuffer( tdoc, &errbuf );
>
>    tidyOptSetBool( tdoc, TidyXhtmlOut, yes );
>    tidyOptSetValue(tdoc, TidyEncoding, "UTF-8");
>    tidyOptSetValue(tdoc, TidyCharEncoding, "UTF-8");
>
>    tidyOptSetValue(tdoc, TidyDoctype, "strict");
>
>    // parse document
>    if (DEBUG)
>        write(2,"Parsing document\n",17);
>
>    if (DUMP_DOCUMENT) {
>        write(2,"Original document\n",18);
>        write(2,input,strlen(input));
>    }
>    tidyParseString( tdoc, input );
>
>    // convert document
>    tidyCleanAndRepair( tdoc );
>    tidyRunDiagnostics( tdoc );
>    tidyOptSetBool(tdoc, TidyForceOutput, yes);
>    tidySaveBuffer( tdoc, &output );
>    if (DUMP_DOCUMENT) {
>        write(2,"Converted document\n",19);
>        write(2,output.bp,output.size);
>    }
>
>    // create an array used to store the results
>    jbyteArray results = (*env)->NewByteArray(env,output.size);
>
>    /// copy data from resulting buffer to Java array
>    (*env)->SetByteArrayRegion(env,results,0,output.size,output.bp);
>
>    tidyBufFree( &errbuf );
>    tidyBufFree( &output );
>    tidyRelease( tdoc );
>
>    return results;
> }
>
> While document get's the doctype of XHTML Strict, it still contains weird
> attributes which don't comform the XHTML Strict specification. I can see this
> in the debug output for the converted document right after it was processed by
> Tidy, so it is not related to any Java code, which calls the native routine.
>
> With tidy command-line utility HTML documents are converted just fine, I've
> looked at the sources and did not find any additional settings except
> TidyXhtmlOut to be set on -asxhtml command-line option.
>
> Thank you in advance!
>
> --
> Best regards
> Eugene Dzhurinsky
>

Received on Tuesday, 4 November 2008 09:21:41 UTC