88 lines
3.3 KiB
Plaintext
88 lines
3.3 KiB
Plaintext
MBOX-Line: From jeff.mckay at comaxis.com Wed Nov 2 14:45:00 2011
|
|
To: imap-protocol@u.washington.edu
|
|
From: Jeff Mckay <jeff.mckay@comaxis.com>
|
|
Date: Fri Jun 8 12:34:47 2018
|
|
Subject: [Imap-protocol] Character encoding question
|
|
In-Reply-To: <alpine.BSO.2.00.1111021135050.15757@morgaine.smi.sendmail.com>
|
|
References: <4EB0C241.6060900@comaxis.com>
|
|
<20111101230240.Horde.I0QRQoF5lbhOsM7wX2WU_SA@bigworm.curecanti.org>
|
|
<4EB18391.4040202@comaxis.com>
|
|
<alpine.BSO.2.00.1111021135050.15757@morgaine.smi.sendmail.com>
|
|
Message-ID: <4EB1B9DC.9080408@comaxis.com>
|
|
|
|
You're right - I understand now and have my code working. Thanks for
|
|
your help.
|
|
|
|
Philip Guenther wrote:
|
|
> On Wed, 2 Nov 2011, Jeff Mckay wrote:
|
|
>
|
|
>> Thanks for your comments. I'm still a bit confused. Let me clarify what
|
|
>> I am seeing in these two examples. In the first, one of the characters
|
|
>> in question is "lower case o with acute" which is supposed to be xF3 in
|
|
>> ISO-8859-2 and xC3 xB3 in UTF-8. The imap server represents this as
|
|
>> ampersand followed by AMP followed by a dash (I am writing out the
|
|
>> description so it does not get interpreted incorrectly somewhere). If I
|
|
>> take the AMP and run it through a base64 decoder, I get xF3.
|
|
>>
|
|
>
|
|
> No, when you run APM (not AMP) through a base64 decoder you get *two*
|
|
> character, in hex as 00 F3. This is the big-endian UTF-16 representation
|
|
> of "lower case o with acute".
|
|
>
|
|
>
|
|
>
|
|
>> In the second example, we have the letters Temp/New followed by a couple
|
|
>> Chinese characters that I don't know the names of. The two Chinese
|
|
>> characters are represented in imap by ampersand followed by bUuL1Q and
|
|
>> the closing dash. When I base64 decode this I end up with x6D x4B x8B
|
|
>> xD5. This appears to be big-endian UTF-16.
|
|
>>
|
|
>
|
|
> Yep. This is *exactly* what is specified by RFC 2152 ("UTF-7"), as
|
|
> modified by RFC 3501.
|
|
>
|
|
>
|
|
>
|
|
>> I have to byte-reverse each 2 byte sequence, but then I can convert it
|
|
>> to UTF8 (my target) and see the Chinese characters.
|
|
>>
|
|
>
|
|
> Uh, I think you mean you do a conversion from the UTF-16BE to the UTF-8
|
|
> that your display routines expect, right?
|
|
>
|
|
>
|
|
>
|
|
>> I could also take the original data and stick a + in front of it (ending
|
|
>> up with +bUuL1Q) and convert this from UTF7 to UTF8 and end up with
|
|
>> valid characters. This last part I really don't understand - if it is
|
|
>> base64 encoded, how is that valid UTF7?
|
|
>>
|
|
>
|
|
> Please go read RFC 2152 again. base64 encoding is a step in generating
|
|
> UTF-7 encoded text.
|
|
>
|
|
>
|
|
>
|
|
>> Anyway, I don't seem to have an algorithm that will work on both of
|
|
>> these examples, and no way to detect which one I should use. Obviously
|
|
>> I am totally confused about what I am doing, but any further insight
|
|
>> would be appreciated.
|
|
>>
|
|
>
|
|
> I think you lost track of the NUL byte in the first example, and from that
|
|
> ended up thinking a different conversion was necessary. The rules are
|
|
> consistent. For a given &.....- chunk:
|
|
> strip & - delimiters
|
|
> base64 decode the ..... part
|
|
> convert that from UTF-16BE to whatever encoding you want to use
|
|
>
|
|
>
|
|
> Philip Guenther
|
|
>
|
|
>
|
|
>
|
|
|
|
-------------- next part --------------
|
|
An HTML attachment was scrubbed...
|
|
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20111102/ea2e2f7f/attachment.html>
|