37 lines
1.8 KiB
Plaintext
37 lines
1.8 KiB
Plaintext
MBOX-Line: From Pidgeot18 at verizon.net Mon Apr 28 21:24:15 2014
|
|
To: imap-protocol@u.washington.edu
|
|
From: Joshua Cranmer <Pidgeot18@verizon.net>
|
|
Date: Fri Jun 8 12:34:52 2018
|
|
Subject: [Imap-protocol] Email charset statistics
|
|
In-Reply-To: <52D5D84D.6070208@verizon.net>
|
|
References: <52D5D84D.6070208@verizon.net>
|
|
Message-ID: <535F296F.9090807@verizon.net>
|
|
|
|
On 1/14/2014 6:37 PM, Joshua Cranmer wrote:
|
|
> A recent concern of mine has been attempting to work out the grand
|
|
> messiness that is charsets in the context of reading and parsing email
|
|
> messages. I am not aware of any prior attempts to assess the practice
|
|
> of charsets in email, so I can only offer evidence from personal
|
|
> anecdote and culling of bug reports on open-source software, neither
|
|
> of which are a good source of information. I was wondering if anyone
|
|
> else on this list had access to a larger database of messages that
|
|
> they could check or have more specific generalities that are needed.
|
|
|
|
In an attempt to put some qualitative numbers on the statistics here, I
|
|
ended up testing the largest body of RFC 822-style messages I could
|
|
think of that was publicly available: recently-posted Usenet messages.
|
|
While Usenet and the email aren't the same thing, I'd generally expect
|
|
Usenet to be slightly worse in passing around 8-bit messages, so it's at
|
|
least a useful proxy to see how bad the situation is for some things,
|
|
but not all (e.g., good luck drawing any conclusion about HTML email
|
|
charset questions). My findings I've posted on my blog at
|
|
<http://quetzalcoatal.blogspot.com/2014/03/understanding-email-charsets.html>,
|
|
complete with a list of some recommendations I've gleaned from the data set.
|
|
|
|
Now, off to make it my personal mission to kill x-mac-croatian. :-)
|
|
|
|
--
|
|
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
|
|
|
|
|