MBOX-Line: From Pidgeot18 at verizon.net  Tue Jan 14 16:37:33 2014
To: imap-protocol@u.washington.edu
From: Joshua Cranmer (In a comment <Pidgeot18@verizon.net>)
Date: Fri Jun  8 12:34:51 2018
Subject: [Imap-protocol] Email charset statistics
Message-ID: <52D5D84D.6070208@verizon.net>

A recent concern of mine has been attempting to work out the grand 
messiness that is charsets in the context of reading and parsing email 
messages. I am not aware of any prior attempts to assess the practice of 
charsets in email, so I can only offer evidence from personal anecdote 
and culling of bug reports on open-source software, neither of which are 
a good source of information. I was wondering if anyone else on this 
list had access to a larger database of messages that they could check 
or have more specific generalities that are needed.

The particular questions that are useful to me, in order of importance:
1. What charsets (and aliases) are needed to support mail?
2. How prevalent is unlabeled text in mail?
3. Do we need to support charset autodetection? Which charsets, and for 
which languages?
4. How prevalent is HTML <meta charset> (or the <meta http-equiv> 
variant) in a message? How often do these parts have no charset 
declaration in their Content-Type? How often are they different, and who 
wins?
5. How prevalent are non-UTF-8 8-bit headers?
6. How prevalent are confused charsets (e.g., text that is labeled as 
ISO 8859-1 that needs to be decoded as Windows-1252)?

-- 
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth