- From: Arnaud Desitter <arnaud02@users.sourceforge.net>
- Date: Mon, 3 Nov 2008 21:15:32 +0000
- To: "Eugeny N Dzhurinsky" <bofh@redwerk.com>
- Cc: html-tidy@w3.org
Hi,
If I remember well, "doctype: strict"
(http://tidy.sourceforge.net/docs/quickref.html#doctype) sets
the DTD to strict but does not remove the attributes not conforming.
Could you provide an HTML example along with the tidy command line you
are using?
Last,
output = {0}; should be replaced by "tidyBufInit( &output );"
Regards,
2008/11/3 Eugeny N Dzhurinsky <bofh@redwerk.com>:
> Hello, all!
>
> I want to use tidy as DLL in my Java application, I wrote the JNI bindings for
> the tidy and everything works fine, except one issue: the converted documents,
> which supposed to be XHTML 1.0 Strict, do not pass validation on w3.org.
>
> For example if I have the tag <a href=".." target="_blank"> - the attribute
> target is not removed.
>
> Could you please take a look at the code below and tell me what am I missing
> there?
>
> #include "TidyWrapper.h"
> #include <jni.h>
> #include <tidy.h>
> #include <buffio.h>
>
> #define DEBUG 1
> #define DUMP_DOCUMENT 1
>
> JNIEXPORT jbyteArray JNICALL Java_xhtml_tidy_jni_TidyWrapper_convert
> (JNIEnv *env, jobject self, jobject options, jbyteArray data) {
>
> TidyBuffer output = {0};
> TidyBuffer errbuf = {0};
> Bool ok;
> jboolean isCopy = 0;
> const char* input = (*env)->GetByteArrayElements(env, data, &isCopy);
> jfieldID fid;
> jclass cls;
>
>
> // get instance of class used for options
> cls = (*env)->GetObjectClass(env, options);
>
> TidyDoc tdoc = tidyCreate();
>
> tidySetErrorBuffer( tdoc, &errbuf );
>
> tidyOptSetBool( tdoc, TidyXhtmlOut, yes );
> tidyOptSetValue(tdoc, TidyEncoding, "UTF-8");
> tidyOptSetValue(tdoc, TidyCharEncoding, "UTF-8");
>
> tidyOptSetValue(tdoc, TidyDoctype, "strict");
>
> // parse document
> if (DEBUG)
> write(2,"Parsing document\n",17);
>
> if (DUMP_DOCUMENT) {
> write(2,"Original document\n",18);
> write(2,input,strlen(input));
> }
> tidyParseString( tdoc, input );
>
> // convert document
> tidyCleanAndRepair( tdoc );
> tidyRunDiagnostics( tdoc );
> tidyOptSetBool(tdoc, TidyForceOutput, yes);
> tidySaveBuffer( tdoc, &output );
> if (DUMP_DOCUMENT) {
> write(2,"Converted document\n",19);
> write(2,output.bp,output.size);
> }
>
> // create an array used to store the results
> jbyteArray results = (*env)->NewByteArray(env,output.size);
>
> /// copy data from resulting buffer to Java array
> (*env)->SetByteArrayRegion(env,results,0,output.size,output.bp);
>
> tidyBufFree( &errbuf );
> tidyBufFree( &output );
> tidyRelease( tdoc );
>
> return results;
> }
>
> While document get's the doctype of XHTML Strict, it still contains weird
> attributes which don't comform the XHTML Strict specification. I can see this
> in the debug output for the converted document right after it was processed by
> Tidy, so it is not related to any Java code, which calls the native routine.
>
> With tidy command-line utility HTML documents are converted just fine, I've
> looked at the sources and did not find any additional settings except
> TidyXhtmlOut to be set on -asxhtml command-line option.
>
> Thank you in advance!
>
> --
> Best regards
> Eugene Dzhurinsky
>
Received on Tuesday, 4 November 2008 09:21:41 UTC