MBOX-Line: From imap at maclean.com Tue Apr 7 17:58:01 2015 To: imap-protocol@u.washington.edu From: Pete Maclean Date: Fri Jun 8 12:34:54 2018 Subject: [Imap-protocol] SEARCH semantics In-Reply-To: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz> References: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz> Message-ID: David, I went through what you are going through a couple of years ago. My original SEARCH implementation was also a shambles and it was only the coming of a major new customer that prompted me to rework it. I never had a single complaint about the original code though which I put down to the lack of support for SEARCH in clients. Today I think it is much more important to have good SEARCH especially with IMAP servers more and more fronting email archives in addition to conventional email servers. I am well aware that the specification is fuzzy and I know that some implementations take great liberties. Some servers, for example, treat text searches as word-based while the RFC demands that they be string-based. An excuse for making them word-based is that the data just happens to be word-indexed. While I cannot support this, I suppose it is not too terrible because users these days are so accustomed to word-based searches (because that is what Web search engines do) that they might be surprised at the results of a string-based search. Let me now tell you how I implement things: 1. BODY. I thread through all the MIME parts in the message and select only those that have a Content-Type of "text" or "message". I convert each such part to Unicode and then apply the search criteria. I make no attempt to search parts that would typically be considered attachments. If, in an HTML part, a phrase being searched for is broken up by tags, it will not be found. Likewise if it contains entities. I could do better in this regard and your bringing up the subject may prompt me to review a number of my own choices. 2. Headers. I unfold headers and normalize everything to Unicode before searching. 3. xx SEARCH OR OR . I have no idea how safe it is to use such an expression but my server handles it beautifully. 4. xx SEARCH OR ( ) exp4. I share your understanding of this expression. I also added support for ESEARCH when I did my revamp but have little idea of how much it gets used. Pete At 07:38 PM 4/7/2015, David Harris wrote: >I'm in the process of completely rewriting the SEARCH logic in my >IMAP server - >the old code was done in a hurry and was, quite frankly, >ridiculously bad, but that's >another story. > >As I get into testing cases, I've come across a number of areas where RFC3501 >and the various sub-documents that I know about are... uh, "vague". >I'd like to get a >take on how other implementors view them. > >1: BODY: When a SEARCH BODY expression is issued, how should "BODY" be >interpreted? Is there an assumption that the server should choose the best >candidate for a displayable message body, parse and normalize it, then search >that? Or should it simply be taken as a raw scan of the message? How much >unarmouring and character set normalization is assumed? > >2: Headers: when any of the header search expressions is issued, is the >assumption that the raw header should be searched, or should RFC2047 >encoded-words be reduced and normalized before attempting the comparison? > >3: The following search is valid, according to the syntax in RFC3501: > > xx SEARCH OR OR > >and allows an OR expression to cover three terms instead of just >two. As such, it >seems quite useful, but it would certainly have mystified my old >search code (it was >rubbish, as I've pointed out), and I was wondering how generally >safe it would be to >use this type of expression? > >4: I'm pretty sure I'm right on this one, but the following expression: > > xx SEARCH OR ( ) exp4 > >will only result in a match if either is a match, or ALL of >, >and are a match. Could someone wiser than me confirm this? I'm >assuming there is no way to perform a search with a long list of OR >conditions >without doing a lot of calisthenics on the search string (multiple >OR conditions >strung together). > >I apologize if any of these are dealt with in RFCs outside RFC3501 - >I struggle to >keep track of all the various sub-documents relating to the protocol >these days. > >Thanks in advance for any advice. > >Cheers! > >-- David -- > >------------------ David Harris -+- Pegasus Mail ---------------------- >Box 5451, Dunedin, New Zealand | e-mail: David.Harris@pmail.gen.nz > Phone: +64 3 453-6880 | Fax: +64 3 453-6612 > >Schoolboy howler for the day: > "A census taker is the man who goes from home to home > increasing the population." > > >_______________________________________________ >Imap-protocol mailing list >Imap-protocol@u.washington.edu >http://mailman13.u.washington.edu/mailman/listinfo/imap-protocol