wasm-demo/demo/ermis-f/imap-protocol/cur/1600095020.22648.mbox:2,S

108 lines
4.1 KiB
Plaintext

MBOX-Line: From blong at google.com Tue Jan 14 17:08:47 2014
To: imap-protocol@u.washington.edu
From: Brandon Long <blong@google.com>
Date: Fri Jun 8 12:34:51 2018
Subject: [Imap-protocol] Email charset statistics
In-Reply-To: <52D5D84D.6070208@verizon.net>
References: <52D5D84D.6070208@verizon.net>
Message-ID: <CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
I don't have exact numbers for you, but we recently did some large (>10M
messages) investigations because we were doing some invasive changes to our
mime parser.
And the answer is, it sucks. See also
http://cr.yp.to/smtp/8bitmime.html It also makes the downgrade part of
RFC5738 look pretty silly, as I
imagine almost every IMAP server is basically doing the same thing today,
sending 8bits blindly, assuming the headers were 7bit clean. It seems like
implementing RFC5738 makes your server break more messages than it fixes.
On Tue, Jan 14, 2014 at 4:37 PM, Joshua Cranmer (In a comment) <
Pidgeot18@verizon.net> wrote:
> A recent concern of mine has been attempting to work out the grand
> messiness that is charsets in the context of reading and parsing email
> messages. I am not aware of any prior attempts to assess the practice of
> charsets in email, so I can only offer evidence from personal anecdote and
> culling of bug reports on open-source software, neither of which are a good
> source of information. I was wondering if anyone else on this list had
> access to a larger database of messages that they could check or have more
> specific generalities that are needed.
>
> The particular questions that are useful to me, in order of importance:
> 1. What charsets (and aliases) are needed to support mail?
>
Is there a reason you wouldn't just use some existing library with support
for just about everything?
> 2. How prevalent is unlabeled text in mail?
>
Prevalent enough. Even if less than 1%, if you expect billions of
messages, you'll see too many of them to ignore.
> 3. Do we need to support charset autodetection? Which charsets, and for
> which languages?
>
Well, that depends on what you do with the data. If you don't need to do
conversions and such, I guess it won't matter. We need to make our
messages display by default in UTF8 in the browser, forcing users to work
around that is pretty poor. If you mostly rely on third party clients, you
may be able to rely on them to translate your data for you.
> 4. How prevalent is HTML <meta charset> (or the <meta http-equiv> variant)
> in a message? How often do these parts have no charset declaration in their
> Content-Type? How often are they different, and who wins?
>
They do differ, but we just let the auto-detection take care of it instead
of trying to parse it directly. Yahoo used to also generate a lot of html
entity references even in text/plain parts, not on purpose, but if you
submit an html textdata form with unicode in it when the page is in a
non-utf8 charset, the browser does the conversion using entity refs.
> 5. How prevalent are non-UTF-8 8-bit headers?
>
UTF8 is the first or second most prevalent, but there were many others we
saw (koi8-r is probably the next highest).
> 6. How prevalent are confused charsets (e.g., text that is labeled as ISO
> 8859-1 that needs to be decoded as Windows-1252)?
Prevalent enough that we basically ignore when people say they're 8859-1
and don't provide it as a hint to our detector. That's not just for 1252,
though, there are a lot of bad web forms to email gateways out there which
do a lot of garbage.
I realize these answers aren't very exact and in the vein of "implement it
all", but that's basically what it looks like to us.
Brandon
>
>
> --
> Beware of bugs in the above code; I have only proved it correct, not tried
> it. -- Donald E. Knuth
>
> _______________________________________________
> Imap-protocol mailing list
> Imap-protocol@u.washington.edu
> http://mailman2.u.washington.edu/mailman/listinfo/imap-protocol
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20140114/b5d5827c/attachment.html>