MBOX-Line: From Pidgeot18 at verizon.net Tue Jan 14 16:37:33 2014 To: imap-protocol@u.washington.edu From: Joshua Cranmer (In a comment ) Date: Fri Jun 8 12:34:51 2018 Subject: [Imap-protocol] Email charset statistics Message-ID: <52D5D84D.6070208@verizon.net> A recent concern of mine has been attempting to work out the grand messiness that is charsets in the context of reading and parsing email messages. I am not aware of any prior attempts to assess the practice of charsets in email, so I can only offer evidence from personal anecdote and culling of bug reports on open-source software, neither of which are a good source of information. I was wondering if anyone else on this list had access to a larger database of messages that they could check or have more specific generalities that are needed. The particular questions that are useful to me, in order of importance: 1. What charsets (and aliases) are needed to support mail? 2. How prevalent is unlabeled text in mail? 3. Do we need to support charset autodetection? Which charsets, and for which languages? 4. How prevalent is HTML (or the variant) in a message? How often do these parts have no charset declaration in their Content-Type? How often are they different, and who wins? 5. How prevalent are non-UTF-8 8-bit headers? 6. How prevalent are confused charsets (e.g., text that is labeled as ISO 8859-1 that needs to be decoded as Windows-1252)? -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth