wasm-demo/demo/ermis-f/imap-protocol/cur/1600095060.22756.mbox:2,S

MBOX-Line: From jeff.mckay at comaxis.com  Wed Nov  2 14:45:00 2011
To: imap-protocol@u.washington.edu
From: Jeff Mckay <jeff.mckay@comaxis.com>
Date: Fri Jun  8 12:34:47 2018
Subject: [Imap-protocol] Character encoding question
In-Reply-To: <alpine.BSO.2.00.1111021135050.15757@morgaine.smi.sendmail.com>
References: <4EB0C241.6060900@comaxis.com>
	<20111101230240.Horde.I0QRQoF5lbhOsM7wX2WU_SA@bigworm.curecanti.org>
	<4EB18391.4040202@comaxis.com>
	<alpine.BSO.2.00.1111021135050.15757@morgaine.smi.sendmail.com>
Message-ID: <4EB1B9DC.9080408@comaxis.com>

You're right - I understand now and have my code working.  Thanks for
your help.

Philip Guenther wrote:
> On Wed, 2 Nov 2011, Jeff Mckay wrote:
>
>> Thanks for your comments.  I'm still a bit confused. Let me clarify what
>> I am seeing in these two examples.  In the first, one of the characters
>> in question is "lower case o with acute" which is supposed to be xF3 in
>> ISO-8859-2 and xC3 xB3 in UTF-8.  The imap server represents this as
>> ampersand followed by AMP followed by a dash (I am writing out the
>> description so it does not get interpreted incorrectly somewhere).  If I
>> take the AMP and run it through a base64 decoder, I get xF3.
>>
>
> No, when you run APM (not AMP) through a base64 decoder you get *two*
> character, in hex as 00 F3.  This is the big-endian UTF-16 representation
> of "lower case o with acute".
>
>
>
>> In the second example, we have the letters Temp/New followed by a couple
>> Chinese characters that I don't know the names of.  The two Chinese
>> characters are represented in imap by ampersand followed by bUuL1Q and
>> the closing dash.  When I base64 decode this I end up with x6D x4B x8B
>> xD5.  This appears to be big-endian UTF-16.
>>
>
> Yep.  This is *exactly* what is specified by RFC 2152 ("UTF-7"), as
> modified by RFC 3501.
>
>
>
>> I have to byte-reverse each 2 byte sequence, but then I can convert it
>> to UTF8 (my target) and see the Chinese characters.
>>
>
> Uh, I think you mean you do a conversion from the UTF-16BE to the UTF-8
> that your display routines expect, right?
>
>
>
>> I could also take the original data and stick a + in front of it (ending
>> up with +bUuL1Q) and convert this from UTF7 to UTF8 and end up with
>> valid characters.  This last part I really don't understand - if it is
>> base64 encoded, how is that valid UTF7?
>>
>
> Please go read RFC 2152 again.  base64 encoding is a step in generating
> UTF-7 encoded text.
>
>
>
>> Anyway, I don't seem to have an algorithm that will work on both of
>> these examples, and no way to detect which one I should use.  Obviously
>> I am totally confused about what I am doing, but any further insight
>> would be appreciated.
>>
>
> I think you lost track of the NUL byte in the first example, and from that
> ended up thinking a different conversion was necessary.  The rules are
> consistent.  For a given &.....- chunk:
> 	strip & - delimiters
> 	base64 decode the ..... part
> 	convert that from UTF-16BE to whatever encoding you want to use
>
>
> Philip Guenther
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20111102/ea2e2f7f/attachment.html>