127 lines
5.0 KiB
Plaintext
127 lines
5.0 KiB
Plaintext
MBOX-Line: From imap at maclean.com Tue Apr 7 17:58:01 2015
|
|
To: imap-protocol@u.washington.edu
|
|
From: Pete Maclean <imap@maclean.com>
|
|
Date: Fri Jun 8 12:34:54 2018
|
|
Subject: [Imap-protocol] SEARCH semantics
|
|
In-Reply-To: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz>
|
|
References: <55246A71.27553.323D151D@David.Harris.pmail.gen.nz>
|
|
Message-ID: <mailman.21.1528486494.22076.imap-protocol@mailman13.u.washington.edu>
|
|
|
|
David, I went through what you are going through a couple of years
|
|
ago. My original SEARCH implementation was also a shambles and it
|
|
was only the coming of a major new customer that prompted me to
|
|
rework it. I never had a single complaint about the original code
|
|
though which I put down to the lack of support for SEARCH in
|
|
clients. Today I think it is much more important to have good SEARCH
|
|
especially with IMAP servers more and more fronting email archives in
|
|
addition to conventional email servers.
|
|
|
|
I am well aware that the specification is fuzzy and I know that some
|
|
implementations take great liberties. Some servers, for example,
|
|
treat text searches as word-based while the RFC demands that they be
|
|
string-based. An excuse for making them word-based is that the data
|
|
just happens to be word-indexed. While I cannot support this, I
|
|
suppose it is not too terrible because users these days are so
|
|
accustomed to word-based searches (because that is what Web search
|
|
engines do) that they might be surprised at the results of a
|
|
string-based search.
|
|
|
|
Let me now tell you how I implement things:
|
|
|
|
1. BODY. I thread through all the MIME parts in the message and
|
|
select only those that have a Content-Type of "text" or "message". I
|
|
convert each such part to Unicode and then apply the search
|
|
criteria. I make no attempt to search parts that would typically be
|
|
considered attachments. If, in an HTML part, a phrase being searched
|
|
for is broken up by tags, it will not be found. Likewise if it
|
|
contains entities. I could do better in this regard and your
|
|
bringing up the subject may prompt me to review a number of my own choices.
|
|
|
|
2. Headers. I unfold headers and normalize everything to Unicode
|
|
before searching.
|
|
|
|
3. xx SEARCH OR OR <exp1> <exp2> <exp3>. I have no idea how safe it
|
|
is to use such an expression but my server handles it beautifully.
|
|
|
|
4. xx SEARCH OR (<exp1> <exp2> <exp3>) exp4. I share your
|
|
understanding of this expression.
|
|
|
|
I also added support for ESEARCH when I did my revamp but have little
|
|
idea of how much it gets used.
|
|
|
|
Pete
|
|
|
|
At 07:38 PM 4/7/2015, David Harris wrote:
|
|
>I'm in the process of completely rewriting the SEARCH logic in my
|
|
>IMAP server -
|
|
>the old code was done in a hurry and was, quite frankly,
|
|
>ridiculously bad, but that's
|
|
>another story.
|
|
>
|
|
>As I get into testing cases, I've come across a number of areas where RFC3501
|
|
>and the various sub-documents that I know about are... uh, "vague".
|
|
>I'd like to get a
|
|
>take on how other implementors view them.
|
|
>
|
|
>1: BODY: When a SEARCH BODY expression is issued, how should "BODY" be
|
|
>interpreted? Is there an assumption that the server should choose the best
|
|
>candidate for a displayable message body, parse and normalize it, then search
|
|
>that? Or should it simply be taken as a raw scan of the message? How much
|
|
>unarmouring and character set normalization is assumed?
|
|
>
|
|
>2: Headers: when any of the header search expressions is issued, is the
|
|
>assumption that the raw header should be searched, or should RFC2047
|
|
>encoded-words be reduced and normalized before attempting the comparison?
|
|
>
|
|
>3: The following search is valid, according to the syntax in RFC3501:
|
|
>
|
|
> xx SEARCH OR OR <exp1> <exp2> <exp3>
|
|
>
|
|
>and allows an OR expression to cover three terms instead of just
|
|
>two. As such, it
|
|
>seems quite useful, but it would certainly have mystified my old
|
|
>search code (it was
|
|
>rubbish, as I've pointed out), and I was wondering how generally
|
|
>safe it would be to
|
|
>use this type of expression?
|
|
>
|
|
>4: I'm pretty sure I'm right on this one, but the following expression:
|
|
>
|
|
> xx SEARCH OR (<exp1> <exp2> <exp3>) exp4
|
|
>
|
|
>will only result in a match if either <exp4> is a match, or ALL of
|
|
><exp1>, <exp2>
|
|
>and <exp3> are a match. Could someone wiser than me confirm this? I'm
|
|
>assuming there is no way to perform a search with a long list of OR
|
|
>conditions
|
|
>without doing a lot of calisthenics on the search string (multiple
|
|
>OR conditions
|
|
>strung together).
|
|
>
|
|
>I apologize if any of these are dealt with in RFCs outside RFC3501 -
|
|
>I struggle to
|
|
>keep track of all the various sub-documents relating to the protocol
|
|
>these days.
|
|
>
|
|
>Thanks in advance for any advice.
|
|
>
|
|
>Cheers!
|
|
>
|
|
>-- David --
|
|
>
|
|
>------------------ David Harris -+- Pegasus Mail ----------------------
|
|
>Box 5451, Dunedin, New Zealand | e-mail: David.Harris@pmail.gen.nz
|
|
> Phone: +64 3 453-6880 | Fax: +64 3 453-6612
|
|
>
|
|
>Schoolboy howler for the day:
|
|
> "A census taker is the man who goes from home to home
|
|
> increasing the population."
|
|
>
|
|
>
|
|
>_______________________________________________
|
|
>Imap-protocol mailing list
|
|
>Imap-protocol@u.washington.edu
|
|
>http://mailman13.u.washington.edu/mailman/listinfo/imap-protocol
|
|
|
|
|