- From: Arnaud Desitter <arnaud02@users.sourceforge.net>
- Date: Mon, 3 Nov 2008 21:15:32 +0000
- To: "Eugeny N Dzhurinsky" <bofh@redwerk.com>
- Cc: html-tidy@w3.org
Hi, If I remember well, "doctype: strict" (http://tidy.sourceforge.net/docs/quickref.html#doctype) sets the DTD to strict but does not remove the attributes not conforming. Could you provide an HTML example along with the tidy command line you are using? Last, output = {0}; should be replaced by "tidyBufInit( &output );" Regards, 2008/11/3 Eugeny N Dzhurinsky <bofh@redwerk.com>: > Hello, all! > > I want to use tidy as DLL in my Java application, I wrote the JNI bindings for > the tidy and everything works fine, except one issue: the converted documents, > which supposed to be XHTML 1.0 Strict, do not pass validation on w3.org. > > For example if I have the tag <a href=".." target="_blank"> - the attribute > target is not removed. > > Could you please take a look at the code below and tell me what am I missing > there? > > #include "TidyWrapper.h" > #include <jni.h> > #include <tidy.h> > #include <buffio.h> > > #define DEBUG 1 > #define DUMP_DOCUMENT 1 > > JNIEXPORT jbyteArray JNICALL Java_xhtml_tidy_jni_TidyWrapper_convert > (JNIEnv *env, jobject self, jobject options, jbyteArray data) { > > TidyBuffer output = {0}; > TidyBuffer errbuf = {0}; > Bool ok; > jboolean isCopy = 0; > const char* input = (*env)->GetByteArrayElements(env, data, &isCopy); > jfieldID fid; > jclass cls; > > > // get instance of class used for options > cls = (*env)->GetObjectClass(env, options); > > TidyDoc tdoc = tidyCreate(); > > tidySetErrorBuffer( tdoc, &errbuf ); > > tidyOptSetBool( tdoc, TidyXhtmlOut, yes ); > tidyOptSetValue(tdoc, TidyEncoding, "UTF-8"); > tidyOptSetValue(tdoc, TidyCharEncoding, "UTF-8"); > > tidyOptSetValue(tdoc, TidyDoctype, "strict"); > > // parse document > if (DEBUG) > write(2,"Parsing document\n",17); > > if (DUMP_DOCUMENT) { > write(2,"Original document\n",18); > write(2,input,strlen(input)); > } > tidyParseString( tdoc, input ); > > // convert document > tidyCleanAndRepair( tdoc ); > tidyRunDiagnostics( tdoc ); > tidyOptSetBool(tdoc, TidyForceOutput, yes); > tidySaveBuffer( tdoc, &output ); > if (DUMP_DOCUMENT) { > write(2,"Converted document\n",19); > write(2,output.bp,output.size); > } > > // create an array used to store the results > jbyteArray results = (*env)->NewByteArray(env,output.size); > > /// copy data from resulting buffer to Java array > (*env)->SetByteArrayRegion(env,results,0,output.size,output.bp); > > tidyBufFree( &errbuf ); > tidyBufFree( &output ); > tidyRelease( tdoc ); > > return results; > } > > While document get's the doctype of XHTML Strict, it still contains weird > attributes which don't comform the XHTML Strict specification. I can see this > in the debug output for the converted document right after it was processed by > Tidy, so it is not related to any Java code, which calls the native routine. > > With tidy command-line utility HTML documents are converted just fine, I've > looked at the sources and did not find any additional settings except > TidyXhtmlOut to be set on -asxhtml command-line option. > > Thank you in advance! > > -- > Best regards > Eugene Dzhurinsky >
Received on Tuesday, 4 November 2008 09:21:41 UTC