W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2002

Re: UTF8 without tempfiles

From: Charles Reitzel <creitzel@rcn.com>
Date: Tue, 12 Nov 2002 11:51:09 -0500
Message-Id: <4.3.2.7.2.20021112105505.023515f0@pop.rcn.com>
To: "Moshe Plotkin" <mplotkin@hotmail.com>
Cc: html-tidy@w3.org

Hi Moshe,

I think I figured out the "right thing to do" with ParseString() and 
SaveString().  Anyway, I updated TidyATL and put it up on my Tidy 
page.  Works on my system, but it did before as well.   Can you give it a 
try on yours?

http://users.rcn.com/creitzel/tidy.html#comatl

For those just tuning in, Microsoft COM uses Unicode (UTF16LE) 
internally.  So, for the ParseString() and SaveString() methods, TidyATL 
now temporarily forces the encoding to UTF16LE.  It restores the original 
setting upon return for use w/ file operations.

This seems like a general issue when using memory-to-memory operations with 
the library.  One needs to be aware that the in-memory representation can 
be different than the on disk and set the Tidy encoding options appropriately.

Btw, the function is tidySetCharEncoding(), my mistake.  You can find most 
of the tidy functions in tidy.h.  The enum definitions are in tidyenum.h 
and there are a handful of methods in buffio.h and fileio.h relating to, 
buffer and file sources and sinks.

take it easy,
Charlie


At 10:16 AM 11/12/2002 -0800, Moshe Plotkin wrote:
>B"H
>
>Thank you very much for all of your help
>
>where is tidyOptSetCharEncoding() defined? I dont think I have it in the
>version I am using.
>
>I now have everything working thanks to all of your help. However the
>Unicode still comes out garbled. heres my code:
>
>tidyBufAttach( &buf, strToFix, ::SysStringByteLen(strToFix) );
>
>//tidyOptSetCharEncoding( tdoc, _T("UTF16LE") ); //Hopefuly the CharEncoding
>//is set right in the configfile
>
>if ( rc >= 0 )
>     rc = tidyLoadConfig( tdoc, cfgFile);
>if ( rc >= 0 )
>     rc = tidyParseBuffer( tdoc, &buf ); // Parse the input
>if ( rc >= 0 )
>     rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
>if ( rc >= 0 )
>     rc = tidyRunDiagnostics( tdoc ); // Kvetch
>if ( rc >= 0 )
>     rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
>*retVal = SysAllocString((OLECHAR*)output.bp);
>
>cfgFile is the File Name of a file whose body is:
>
>indent-spaces: 4
>indent: auto
>indent-attributes: yes
>tidy-mark: no
>char-encoding: UTF16LE
>it seems is that input-encoding and output-encoding dont make a diference.
>
>buf.pb contains the string perfectly
>
>but output.bp has a few strange chars instead of the unicode chars that 
>went in
>
>thank you very much for all of your help.
>
>
>----- Original Message -----
>From: "Charles Reitzel" <<mailto:creitzel@rcn.com>creitzel@rcn.com>
>To: "Moshe Plotkin" <<mailto:mplotkin@hotmail.com>mplotkin@hotmail.com>
>Cc: <<mailto:html-tidy@w3.org>html-tidy@w3.org>
>Sent: Monday, November 11, 2002 11:05 AM
>Subject: Re: UTF8 without tempfiles
>
>
> > Hi Moshe,
> >
> > Just to get it straight, are you using TidyLib directly or via the ATL
> > wrapper?   It will make a difference.
> >
> > If using TidyLib and the BSTR directly, I'd just use UTF16LE character
> > encoding.  If the BSTR is not NULL terminated (my memory escapes me), just
> > attach a TidyBuffer to it and call tidyParseBuffer().
> >
> >
> > int tidyParseBSTR( TidyDoc tdoc, BSTR content )
> > {
> >      int rc = 0;
> >      TidyBuffer buf;
> >      tidyBufAttach( &buf, content, ::SysStringByteLen(content) );
> >      tidyOptSetCharEncoding( tdoc, _T("UTF16LE") );
> >      return tidyParseBuffer( tdoc, &buf );
> > }
> >
> > hth,
> > Charlie
> >
> >
> > At 01:19 PM 11/11/2002 -0800, Moshe Plotkin wrote:
> >
> > >B"H
> > >
> > >Actualy I work with Meir Kogan, and I have the string as a BSTR Which as
> > >far as I understand is just wchar_t * with a length.
> > >The vbscript page that I am getting it from is set to use codepage 65001
> > >i.e. utf8. So I am asuming its utf8 stored in an array of wchar_t however
> > >that works. I was thinking of redfining the tidy string (whats it called
> > >cbmtstr?) to use wchar_t and then rely on the config file to alter the
> > >internal ... or to just cast the bstr to byte* and put it in the buffer.
> > >or maybe I'm way off.
> > >
> > >thanks for all the help.
> > >
> > >
> > >  ----- Original Message -----
> > >From: "Charles Reitzel" <<mailto:creitzel@rcn.com>creitzel@rcn.com>
> > >To: "Moshe Plotkin" <<mailto:mplotkin@hotmail.com>mplotkin@hotmail.com>
> > >Cc: <<mailto:html-tidy@w3.org>html-tidy@w3.org>
> > >Sent: Monday, November 11, 2002 6:27 AM
> > >Subject: Re: UTF8 without tempfiles
> > >
> > >
> > > > Hi Moshe,
> > > >
> > > > wchar_t is usually UTF16.  What platform are you on?  It
> > > > helps to figure out if you should use Little or Big Endian
> > > > unicode (UTF16LE and UTF16BE, respectively).  If you can
> > > > manage to save your documents with a byte-order mark (two
> > > > bytes at the beginning of the file that indicate the byte
> > > > order), you can specify plain UTF16.
> > > >
> > > > For example, Intel (Windows and Linux) are LE.  Sparc
> > > > (Solaris) and PowerPC (Mac, IBM AIX) are BE.  Alpha (Linux)
> > > > can be either, but is usually LE.
> > > >
> > > > take it easy,
> > > > Charlie
> > > >
> > > > At 01:22 PM 11/10/2002 -0800, Moshe Plotkin wrote:
> > > > >B"H
> > > > >
> > > > > Can someone please send me a very simple example of
> > > > > using TidyLib with UTF8 strings.
> > > > >
> > > > >I have the data in a wchar_t* and would like to return
> > > > >a wchar_t*
> > > > >
> > > > >thank you verry much
> > > >
> >
Received on Tuesday, 12 November 2002 11:49:42 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 23:39:48 UTC