wasm-demo/demo/ermis-f/imap-protocol/cur/1600095021.22648.mbox:2,S

MBOX-Line: From Pidgeot18 at verizon.net  Tue Jan 14 19:04:04 2014
To: imap-protocol@u.washington.edu
From: Joshua Cranmer (In a comment <Pidgeot18@verizon.net>)
Date: Fri Jun  8 12:34:51 2018
Subject: [Imap-protocol] Email charset statistics
In-Reply-To: <CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
References: <52D5D84D.6070208@verizon.net>
	<CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
Message-ID: <52D5FAA4.6030406@verizon.net>

On 1/14/2014 7:08 PM, Brandon Long wrote:
> I don't have exact numbers for you, but we recently did some large
> (>10M messages) investigations because we were doing some invasive
> changes to our mime parser.

Coincidence, I'm also doing invasive changes to our mime parser. Where
"invasive" means "rewriting from scratch." :-P
>
> And the answer is, it sucks.  See also
> http://cr.yp.to/smtp/8bitmime.html  It also makes the downgrade part
> of RFC5738 look pretty silly, as I imagine almost every IMAP server is
> basically doing the same thing today, sending 8bits blindly, assuming
> the headers were 7bit clean.  It seems like implementing RFC5738 makes
> your server break more messages than it fixes.

The more I dig into emitting MIME properly, the more I myself am tempted
to ignore its strictures outright.
>
>
> On Tue, Jan 14, 2014 at 4:37 PM, Joshua Cranmer (In a comment)
> <Pidgeot18@verizon.net <mailto:Pidgeot18@verizon.net>> wrote:
>
>     A recent concern of mine has been attempting to work out the grand
>     messiness that is charsets in the context of reading and parsing
>     email messages. I am not aware of any prior attempts to assess the
>     practice of charsets in email, so I can only offer evidence from
>     personal anecdote and culling of bug reports on open-source
>     software, neither of which are a good source of information. I was
>     wondering if anyone else on this list had access to a larger
>     database of messages that they could check or have more specific
>     generalities that are needed.
>
>     The particular questions that are useful to me, in order of
>     importance:
>     1. What charsets (and aliases) are needed to support mail?
>
>
> Is there a reason you wouldn't just use some existing library with
> support for just about everything?

Largely, the library we're using is deciding to attempt to be proactive
against charset proliferation and accept only a narrow selection of
charsets. Thus, anything not inside that set needs to be manually
implemented. The set of charsets may be found at
<http://encoding.spec.whatwg.org/#encodings>, if you scroll down a bit.
So far, the only extra charset I've found that I need to support is the
abominable UTF-7 (grrrrrr).
>
>     3. Do we need to support charset autodetection? Which charsets,
>     and for which languages?
>
>
> Well, that depends on what you do with the data.  If you don't need to
> do conversions and such, I guess it won't matter.  We need to make our
> messages display by default in UTF8 in the browser, forcing users to
> work around that is pretty poor.  If you mostly rely on third party
> clients, you may be able to rely on them to translate your data for you.

I need to be able to fairly reliable convert to UTF-8 (although I do
have APIs in place in the parser to be able to override charsets on a
per-message basis). I think, for the moment, Thunderbird only (attempts
to) autodetects Japanese charsets, and the functionality required to get
that working in my new API is much more difficult and annoying, so I
want to avoid it if I possibly can. Given that we've gotten away without
autodetection in most of the world thus far, I was hoping that we could
get away without it in the one or two locales that still have it.
>
>     4. How prevalent is HTML <meta charset> (or the <meta http-equiv>
>     variant) in a message? How often do these parts have no charset
>     declaration in their Content-Type? How often are they different,
>     and who wins?
>
>
> They do differ, but we just let the auto-detection take care of it
> instead of trying to parse it directly.  Yahoo used to also generate a
> lot of html entity references even in text/plain parts, not on
> purpose, but if you submit an html textdata form with unicode in it
> when the page is in a non-utf8 charset, the browser does the
> conversion using entity refs.

Since I really don't want to implement autodetection, I guess that means
I have to do the HTML <meta> scan. :-/ I was really, really hoping to
not have to write yet another (streaming) HTML parser just to figure out
what charset the message is supposed to be.
>
>     5. How prevalent are non-UTF-8 8-bit headers?
>
>
> UTF8 is the first or second most prevalent, but there were many others
> we saw (koi8-r is probably the next highest).

The algorithm I was looking at implementing was try as UTF-8 or fallback
to message body's charset if that fails. Do you have any sense on how
well that heuristic would work?
>
>     6. How prevalent are confused charsets (e.g., text that is labeled
>     as ISO 8859-1 that needs to be decoded as Windows-1252)?
>
>
> Prevalent enough that we basically ignore when people say they're
> 8859-1 and don't provide it as a hint to our detector.  That's not
> just for 1252, though, there are a lot of bad web forms to email
> gateways out there which do a lot of garbage.

Do you happen to have a list of confusions?


--
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20140114/6c05a335/attachment.html>