34 lines
1.6 KiB
Plaintext
34 lines
1.6 KiB
Plaintext
MBOX-Line: From Pidgeot18 at verizon.net Tue Jan 14 16:37:33 2014
|
|
To: imap-protocol@u.washington.edu
|
|
From: Joshua Cranmer (In a comment <Pidgeot18@verizon.net>)
|
|
Date: Fri Jun 8 12:34:51 2018
|
|
Subject: [Imap-protocol] Email charset statistics
|
|
Message-ID: <52D5D84D.6070208@verizon.net>
|
|
|
|
A recent concern of mine has been attempting to work out the grand
|
|
messiness that is charsets in the context of reading and parsing email
|
|
messages. I am not aware of any prior attempts to assess the practice of
|
|
charsets in email, so I can only offer evidence from personal anecdote
|
|
and culling of bug reports on open-source software, neither of which are
|
|
a good source of information. I was wondering if anyone else on this
|
|
list had access to a larger database of messages that they could check
|
|
or have more specific generalities that are needed.
|
|
|
|
The particular questions that are useful to me, in order of importance:
|
|
1. What charsets (and aliases) are needed to support mail?
|
|
2. How prevalent is unlabeled text in mail?
|
|
3. Do we need to support charset autodetection? Which charsets, and for
|
|
which languages?
|
|
4. How prevalent is HTML <meta charset> (or the <meta http-equiv>
|
|
variant) in a message? How often do these parts have no charset
|
|
declaration in their Content-Type? How often are they different, and who
|
|
wins?
|
|
5. How prevalent are non-UTF-8 8-bit headers?
|
|
6. How prevalent are confused charsets (e.g., text that is labeled as
|
|
ISO 8859-1 that needs to be decoded as Windows-1252)?
|
|
|
|
--
|
|
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
|
|
|
|
|