108 lines
4.1 KiB
Plaintext
108 lines
4.1 KiB
Plaintext
MBOX-Line: From blong at google.com Tue Jan 14 17:08:47 2014
|
|
To: imap-protocol@u.washington.edu
|
|
From: Brandon Long <blong@google.com>
|
|
Date: Fri Jun 8 12:34:51 2018
|
|
Subject: [Imap-protocol] Email charset statistics
|
|
In-Reply-To: <52D5D84D.6070208@verizon.net>
|
|
References: <52D5D84D.6070208@verizon.net>
|
|
Message-ID: <CABa8R6tmm4eBkjMfcmpLkFvvkLB==ua3+h7P97DG3wkvCBj54g@mail.gmail.com>
|
|
|
|
I don't have exact numbers for you, but we recently did some large (>10M
|
|
messages) investigations because we were doing some invasive changes to our
|
|
mime parser.
|
|
|
|
And the answer is, it sucks. See also
|
|
http://cr.yp.to/smtp/8bitmime.html It also makes the downgrade part of
|
|
RFC5738 look pretty silly, as I
|
|
imagine almost every IMAP server is basically doing the same thing today,
|
|
sending 8bits blindly, assuming the headers were 7bit clean. It seems like
|
|
implementing RFC5738 makes your server break more messages than it fixes.
|
|
|
|
|
|
On Tue, Jan 14, 2014 at 4:37 PM, Joshua Cranmer (In a comment) <
|
|
Pidgeot18@verizon.net> wrote:
|
|
|
|
> A recent concern of mine has been attempting to work out the grand
|
|
> messiness that is charsets in the context of reading and parsing email
|
|
> messages. I am not aware of any prior attempts to assess the practice of
|
|
> charsets in email, so I can only offer evidence from personal anecdote and
|
|
> culling of bug reports on open-source software, neither of which are a good
|
|
> source of information. I was wondering if anyone else on this list had
|
|
> access to a larger database of messages that they could check or have more
|
|
> specific generalities that are needed.
|
|
>
|
|
> The particular questions that are useful to me, in order of importance:
|
|
> 1. What charsets (and aliases) are needed to support mail?
|
|
>
|
|
|
|
Is there a reason you wouldn't just use some existing library with support
|
|
for just about everything?
|
|
|
|
|
|
> 2. How prevalent is unlabeled text in mail?
|
|
>
|
|
|
|
Prevalent enough. Even if less than 1%, if you expect billions of
|
|
messages, you'll see too many of them to ignore.
|
|
|
|
|
|
> 3. Do we need to support charset autodetection? Which charsets, and for
|
|
> which languages?
|
|
>
|
|
|
|
Well, that depends on what you do with the data. If you don't need to do
|
|
conversions and such, I guess it won't matter. We need to make our
|
|
messages display by default in UTF8 in the browser, forcing users to work
|
|
around that is pretty poor. If you mostly rely on third party clients, you
|
|
may be able to rely on them to translate your data for you.
|
|
|
|
|
|
> 4. How prevalent is HTML <meta charset> (or the <meta http-equiv> variant)
|
|
> in a message? How often do these parts have no charset declaration in their
|
|
> Content-Type? How often are they different, and who wins?
|
|
>
|
|
|
|
They do differ, but we just let the auto-detection take care of it instead
|
|
of trying to parse it directly. Yahoo used to also generate a lot of html
|
|
entity references even in text/plain parts, not on purpose, but if you
|
|
submit an html textdata form with unicode in it when the page is in a
|
|
non-utf8 charset, the browser does the conversion using entity refs.
|
|
|
|
|
|
> 5. How prevalent are non-UTF-8 8-bit headers?
|
|
>
|
|
|
|
UTF8 is the first or second most prevalent, but there were many others we
|
|
saw (koi8-r is probably the next highest).
|
|
|
|
|
|
> 6. How prevalent are confused charsets (e.g., text that is labeled as ISO
|
|
> 8859-1 that needs to be decoded as Windows-1252)?
|
|
|
|
|
|
Prevalent enough that we basically ignore when people say they're 8859-1
|
|
and don't provide it as a hint to our detector. That's not just for 1252,
|
|
though, there are a lot of bad web forms to email gateways out there which
|
|
do a lot of garbage.
|
|
|
|
I realize these answers aren't very exact and in the vein of "implement it
|
|
all", but that's basically what it looks like to us.
|
|
|
|
Brandon
|
|
|
|
|
|
>
|
|
>
|
|
> --
|
|
> Beware of bugs in the above code; I have only proved it correct, not tried
|
|
> it. -- Donald E. Knuth
|
|
>
|
|
> _______________________________________________
|
|
> Imap-protocol mailing list
|
|
> Imap-protocol@u.washington.edu
|
|
> http://mailman2.u.washington.edu/mailman/listinfo/imap-protocol
|
|
>
|
|
-------------- next part --------------
|
|
An HTML attachment was scrubbed...
|
|
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20140114/b5d5827c/attachment.html>
|