wasm-demo/demo/ermis-f/imap-protocol/cur/1600095019.22648.mbox:2,S

34 lines
1.6 KiB
Plaintext

MBOX-Line: From Pidgeot18 at verizon.net Tue Jan 14 16:37:33 2014
To: imap-protocol@u.washington.edu
From: Joshua Cranmer (In a comment <Pidgeot18@verizon.net>)
Date: Fri Jun 8 12:34:51 2018
Subject: [Imap-protocol] Email charset statistics
Message-ID: <52D5D84D.6070208@verizon.net>
A recent concern of mine has been attempting to work out the grand
messiness that is charsets in the context of reading and parsing email
messages. I am not aware of any prior attempts to assess the practice of
charsets in email, so I can only offer evidence from personal anecdote
and culling of bug reports on open-source software, neither of which are
a good source of information. I was wondering if anyone else on this
list had access to a larger database of messages that they could check
or have more specific generalities that are needed.
The particular questions that are useful to me, in order of importance:
1. What charsets (and aliases) are needed to support mail?
2. How prevalent is unlabeled text in mail?
3. Do we need to support charset autodetection? Which charsets, and for
which languages?
4. How prevalent is HTML <meta charset> (or the <meta http-equiv>
variant) in a message? How often do these parts have no charset
declaration in their Content-Type? How often are they different, and who
wins?
5. How prevalent are non-UTF-8 8-bit headers?
6. How prevalent are confused charsets (e.g., text that is labeled as
ISO 8859-1 that needs to be decoded as Windows-1252)?
--
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth