W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2002

Re: UTF8 without tempfiles

From: Charles Reitzel <creitzel@rcn.com>
Date: Mon, 11 Nov 2002 14:05:36 -0500
Message-Id: <4.3.2.7.2.20021111134310.0278b4e8@pop.rcn.com>
To: "Moshe Plotkin" <mplotkin@hotmail.com>
Cc: <html-tidy@w3.org>

Hi Moshe,

Just to get it straight, are you using TidyLib directly or via the ATL 
wrapper?   It will make a difference.

If using TidyLib and the BSTR directly, I'd just use UTF16LE character 
encoding.  If the BSTR is not NULL terminated (my memory escapes me), just 
attach a TidyBuffer to it and call tidyParseBuffer().


int tidyParseBSTR( TidyDoc tdoc, BSTR content )
{
     int rc = 0;
     TidyBuffer buf;
     tidyBufAttach( &buf, content, ::SysStringByteLen(content) );
     tidyOptSetCharEncoding( tdoc, _T("UTF16LE") );
     return tidyParseBuffer( tdoc, &buf );
}

hth,
Charlie


At 01:19 PM 11/11/2002 -0800, Moshe Plotkin wrote:

>B"H
>
>Actualy I work with Meir Kogan, and I have the string as a BSTR Which as 
>far as I understand is just wchar_t * with a length.
>The vbscript page that I am getting it from is set to use codepage 65001 
>i.e. utf8. So I am asuming its utf8 stored in an array of wchar_t however 
>that works. I was thinking of redfining the tidy string (whats it called 
>cbmtstr?) to use wchar_t and then rely on the config file to alter the 
>internal ... or to just cast the bstr to byte* and put it in the buffer. 
>or maybe I'm way off.
>
>thanks for all the help.
>
>
>  ----- Original Message -----
>From: "Charles Reitzel" <creitzel@rcn.com>
>To: "Moshe Plotkin" <mplotkin@hotmail.com>
>Cc: <html-tidy@w3.org>
>Sent: Monday, November 11, 2002 6:27 AM
>Subject: Re: UTF8 without tempfiles
>
>
> > Hi Moshe,
> >
> > wchar_t is usually UTF16.  What platform are you on?  It
> > helps to figure out if you should use Little or Big Endian
> > unicode (UTF16LE and UTF16BE, respectively).  If you can
> > manage to save your documents with a byte-order mark (two
> > bytes at the beginning of the file that indicate the byte
> > order), you can specify plain UTF16.
> >
> > For example, Intel (Windows and Linux) are LE.  Sparc
> > (Solaris) and PowerPC (Mac, IBM AIX) are BE.  Alpha (Linux)
> > can be either, but is usually LE.
> >
> > take it easy,
> > Charlie
> >
> > At 01:22 PM 11/10/2002 -0800, Moshe Plotkin wrote:
> > >B"H
> > >
> > > Can someone please send me a very simple example of
> > > using TidyLib with UTF8 strings.
> > >
> > >I have the data in a wchar_t* and would like to return
> > >a wchar_t*
> > >
> > >thank you verry much
> >
Received on Monday, 11 November 2002 14:04:26 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 23:39:48 UTC