124 lines
5.6 KiB
Plaintext
124 lines
5.6 KiB
Plaintext
MBOX-Line: From Pidgeot18 at verizon.net Tue Jan 14 19:04:04 2014
|
|
To: imap-protocol@u.washington.edu
|
|
From: Joshua Cranmer (In a comment <Pidgeot18@verizon.net>)
|
|
Date: Fri Jun 8 12:34:51 2018
|
|
Subject: [Imap-protocol] Email charset statistics
|
|
In-Reply-To: <CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
|
|
References: <52D5D84D.6070208@verizon.net>
|
|
<CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
|
|
Message-ID: <52D5FAA4.6030406@verizon.net>
|
|
|
|
On 1/14/2014 7:08 PM, Brandon Long wrote:
|
|
> I don't have exact numbers for you, but we recently did some large
|
|
> (>10M messages) investigations because we were doing some invasive
|
|
> changes to our mime parser.
|
|
|
|
Coincidence, I'm also doing invasive changes to our mime parser. Where
|
|
"invasive" means "rewriting from scratch." :-P
|
|
>
|
|
> And the answer is, it sucks. See also
|
|
> http://cr.yp.to/smtp/8bitmime.html It also makes the downgrade part
|
|
> of RFC5738 look pretty silly, as I imagine almost every IMAP server is
|
|
> basically doing the same thing today, sending 8bits blindly, assuming
|
|
> the headers were 7bit clean. It seems like implementing RFC5738 makes
|
|
> your server break more messages than it fixes.
|
|
|
|
The more I dig into emitting MIME properly, the more I myself am tempted
|
|
to ignore its strictures outright.
|
|
>
|
|
>
|
|
> On Tue, Jan 14, 2014 at 4:37 PM, Joshua Cranmer (In a comment)
|
|
> <Pidgeot18@verizon.net <mailto:Pidgeot18@verizon.net>> wrote:
|
|
>
|
|
> A recent concern of mine has been attempting to work out the grand
|
|
> messiness that is charsets in the context of reading and parsing
|
|
> email messages. I am not aware of any prior attempts to assess the
|
|
> practice of charsets in email, so I can only offer evidence from
|
|
> personal anecdote and culling of bug reports on open-source
|
|
> software, neither of which are a good source of information. I was
|
|
> wondering if anyone else on this list had access to a larger
|
|
> database of messages that they could check or have more specific
|
|
> generalities that are needed.
|
|
>
|
|
> The particular questions that are useful to me, in order of
|
|
> importance:
|
|
> 1. What charsets (and aliases) are needed to support mail?
|
|
>
|
|
>
|
|
> Is there a reason you wouldn't just use some existing library with
|
|
> support for just about everything?
|
|
|
|
Largely, the library we're using is deciding to attempt to be proactive
|
|
against charset proliferation and accept only a narrow selection of
|
|
charsets. Thus, anything not inside that set needs to be manually
|
|
implemented. The set of charsets may be found at
|
|
<http://encoding.spec.whatwg.org/#encodings>, if you scroll down a bit.
|
|
So far, the only extra charset I've found that I need to support is the
|
|
abominable UTF-7 (grrrrrr).
|
|
>
|
|
> 3. Do we need to support charset autodetection? Which charsets,
|
|
> and for which languages?
|
|
>
|
|
>
|
|
> Well, that depends on what you do with the data. If you don't need to
|
|
> do conversions and such, I guess it won't matter. We need to make our
|
|
> messages display by default in UTF8 in the browser, forcing users to
|
|
> work around that is pretty poor. If you mostly rely on third party
|
|
> clients, you may be able to rely on them to translate your data for you.
|
|
|
|
I need to be able to fairly reliable convert to UTF-8 (although I do
|
|
have APIs in place in the parser to be able to override charsets on a
|
|
per-message basis). I think, for the moment, Thunderbird only (attempts
|
|
to) autodetects Japanese charsets, and the functionality required to get
|
|
that working in my new API is much more difficult and annoying, so I
|
|
want to avoid it if I possibly can. Given that we've gotten away without
|
|
autodetection in most of the world thus far, I was hoping that we could
|
|
get away without it in the one or two locales that still have it.
|
|
>
|
|
> 4. How prevalent is HTML <meta charset> (or the <meta http-equiv>
|
|
> variant) in a message? How often do these parts have no charset
|
|
> declaration in their Content-Type? How often are they different,
|
|
> and who wins?
|
|
>
|
|
>
|
|
> They do differ, but we just let the auto-detection take care of it
|
|
> instead of trying to parse it directly. Yahoo used to also generate a
|
|
> lot of html entity references even in text/plain parts, not on
|
|
> purpose, but if you submit an html textdata form with unicode in it
|
|
> when the page is in a non-utf8 charset, the browser does the
|
|
> conversion using entity refs.
|
|
|
|
Since I really don't want to implement autodetection, I guess that means
|
|
I have to do the HTML <meta> scan. :-/ I was really, really hoping to
|
|
not have to write yet another (streaming) HTML parser just to figure out
|
|
what charset the message is supposed to be.
|
|
>
|
|
> 5. How prevalent are non-UTF-8 8-bit headers?
|
|
>
|
|
>
|
|
> UTF8 is the first or second most prevalent, but there were many others
|
|
> we saw (koi8-r is probably the next highest).
|
|
|
|
The algorithm I was looking at implementing was try as UTF-8 or fallback
|
|
to message body's charset if that fails. Do you have any sense on how
|
|
well that heuristic would work?
|
|
>
|
|
> 6. How prevalent are confused charsets (e.g., text that is labeled
|
|
> as ISO 8859-1 that needs to be decoded as Windows-1252)?
|
|
>
|
|
>
|
|
> Prevalent enough that we basically ignore when people say they're
|
|
> 8859-1 and don't provide it as a hint to our detector. That's not
|
|
> just for 1252, though, there are a lot of bad web forms to email
|
|
> gateways out there which do a lot of garbage.
|
|
|
|
Do you happen to have a list of confusions?
|
|
|
|
|
|
--
|
|
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
|
|
|
|
-------------- next part --------------
|
|
An HTML attachment was scrubbed...
|
|
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20140114/6c05a335/attachment.html>
|