MBOX-Line: From brong at fastmail.fm Tue Apr 7 17:39:29 2015 To: imap-protocol@u.washington.edu From: Bron Gondwana Date: Fri Jun 8 12:34:54 2018 Subject: [Imap-protocol] SEARCH semantics In-Reply-To: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz> References: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz> Message-ID: <1428453569.748713.250513633.3671D3CF@webmail.messagingengine.com> On Wed, Apr 8, 2015, at 09:38 AM, David Harris wrote: > I'm in the process of completely rewriting the SEARCH logic in my IMAP server - > the old code was done in a hurry and was, quite frankly, ridiculously bad, but that's > another story. > > As I get into testing cases, I've come across a number of areas where RFC3501 > and the various sub-documents that I know about are... uh, "vague". I'd like to get a > take on how other implementors view them. > > 1: BODY: When a SEARCH BODY expression is issued, how should "BODY" be > interpreted? Is there an assumption that the server should choose the best > candidate for a displayable message body, parse and normalize it, then search > that? Or should it simply be taken as a raw scan of the message? How much > unarmouring and character set normalization is assumed? Cyrus streams each body part through decoding (qp/base64) and charset handing (generates a stream of int32 unicode codepoints) - which then feeds into the search engine to look for matches. If any part matches, then the message matches. > 2: Headers: when any of the header search expressions is issued, is the > assumption that the raw header should be searched, or should RFC2047 > encoded-words be reduced and normalized before attempting the comparison? Likewise - there's a header parser which generates the unicode points for search. > 3: The following search is valid, according to the syntax in RFC3501: > > xx SEARCH OR OR > > and allows an OR expression to cover three terms instead of just two. As such, it > seems quite useful, but it would certainly have mystified my old search code (it was > rubbish, as I've pointed out), and I was wondering how generally safe it would be to > use this type of expression? Very. That's totally standard, and anything which doesn't support it is totally bogus. > 4: I'm pretty sure I'm right on this one, but the following expression: > > xx SEARCH OR ( ) exp4 > > will only result in a match if either is a match, or ALL of , > and are a match. Could someone wiser than me confirm this? I'm > assuming there is no way to perform a search with a long list of OR conditions > without doing a lot of calisthenics on the search string (multiple OR conditions > strung together). It's hardly calisthenics, it's just prefix notation. You can just as well do OR A OR B OR C D depending whether you want the tree to bias right or bias left. Even this is valid OR OR A OR B C D As is obvious when you write it out as a tree. OR - OR = - A = - OR = = - B = = - C - D > I apologize if any of these are dealt with in RFCs outside RFC3501 - I struggle to > keep track of all the various sub-documents relating to the protocol these days. > > Thanks in advance for any advice. Cheers, Bron. -- Bron Gondwana brong@fastmail.fm