129 lines
5.0 KiB
Plaintext
129 lines
5.0 KiB
Plaintext
MBOX-Line: From mrc at CAC.Washington.EDU Fri Apr 14 11:30:20 2006
|
|
To: imap-protocol@u.washington.edu
|
|
From: Mark Crispin <mrc@CAC.Washington.EDU>
|
|
Date: Fri Jun 8 12:34:37 2018
|
|
Subject: [Imap-protocol] SEARCH
|
|
In-Reply-To: <1145029156.10727.81.camel@hurina>
|
|
References: <1145029156.10727.81.camel@hurina>
|
|
Message-ID: <Pine.OSX.4.64.0604141039280.15258@pangtzu.panda.com>
|
|
|
|
On Fri, 14 Apr 2006, Timo Sirainen wrote:
|
|
> If I read this correctly, searching logic works like this:
|
|
>
|
|
> If search charset is US-ASCII, server either
|
|
> a) simply does a substring match for the entire message, or
|
|
> b) decodes MIME parts based on Content-Transfer-Encoding header and
|
|
> decodes the MIME headers themselves, and then does substring matching
|
|
>
|
|
> If search charset is not US-ASCII, only b) is allowed.
|
|
|
|
My personal opinion (not "Mr. IMAP Protocol"):
|
|
|
|
(a) is the IMAP2 interpretation, and (b) is the IMAP4 interpretation; and
|
|
that a full IMAP4 server always does (b).
|
|
|
|
However, since US-ASCII is the only mandatory to implement search charset,
|
|
an IMAP4 server which only implements US-ASCII and only does (a) remains
|
|
compliant. This is a compatibility-with-the-past wart, and new server
|
|
implementations should not do this (except as a development step)
|
|
|
|
At a minimum, UTF-8 SHOULD be supported as a search charset.
|
|
|
|
(a) has known problems. It has false positives and false negatives
|
|
because it does not decode the content. This occurs even with ASCII.
|
|
|
|
> If the search key is invalid for the given character set, should server
|
|
> return BAD error to client? Are non-ASCII characters in search key
|
|
> invalid for US-ASCII charset?
|
|
|
|
I'm not certain what you mean by "invalid".
|
|
|
|
Do you mean "contain a codepoint that is not in that charset"? If so, I
|
|
think a failed match is better than a BAD, since it may be that the server
|
|
has an obsolete version of that charset's definition.
|
|
|
|
> What about if search key contains non-ASCII characters but no charset
|
|
> parameter is given? Currently I assume this means just doing a substring
|
|
> search from messages without doing any charset conversions (i;octet
|
|
> comparator).
|
|
|
|
It can mean whatever you want, although perhaps a failed match is best.
|
|
Or maybe a BAD in this case, because the specification does denounce use
|
|
of 8-bit strings without a charset identification in section 4.3.1
|
|
|
|
It's not defined, and in such cases servers can do as they want. As with
|
|
other undefined situations, it may be defined in the future for servers
|
|
that advertise an extension.
|
|
|
|
Clients which do undefined things are broken.
|
|
|
|
> I don't see it clearly mentioned how searching MIME parts should work,
|
|
> but since it only talks about substring matching, I assume that it
|
|
> shouldn't really care about MIME parts that much.
|
|
|
|
I don't understand this comment. You should apply content transfer
|
|
decoding. Most people canonicalize search keys and MIME parts into UTF-8
|
|
prior to matching. Also look into case coercion and decomposition,
|
|
although the IMAP i18n/stringprep specification will specify this (and
|
|
more) in detail so maybe you want to hold off.
|
|
|
|
> Especially BODY searching talks about searching from message bodies. Are
|
|
> MIME part headers part of a message body? I guess not, because UW-IMAP
|
|
> skips them.
|
|
|
|
TEXT searches search the entire message including RFC 2822 and MIME
|
|
metadata.
|
|
|
|
BODY searches omit RFC 2822 and MIME metadata.
|
|
|
|
> More interesting are MIME footer and trailer sections. Should they be
|
|
> searched? UW-IMAP skips them.
|
|
|
|
I consider these not to be part of a message at all for any MIME-savvy
|
|
application.
|
|
|
|
> What about MIME boundary lines? UW-IMAP
|
|
> searches these, but not if you include its "--" prefix in search key.
|
|
|
|
Are you certain that you aren't confusing BODY and TEXT searches? A TEXT
|
|
search would find them, because they appear in the MIME header.
|
|
|
|
> Is "Header: value" searching required to work? I think it is, and works
|
|
> with UW-IMAP.
|
|
|
|
What do you mean by this? If you're talking about a TEXT search, then it
|
|
may or may not work depending upon the octets in a message. You should be
|
|
using a "HEADER Header: value" search instead.
|
|
|
|
> Is "line\r\nline2" (as literal of course with real CR+LF)
|
|
> searching required to work in message body? Again, I think so and works
|
|
> with UW-IMAP.
|
|
|
|
Yes, it should in a TEXT search. But see below.
|
|
|
|
> But then is "Header: value\r\nHeader2: value2" searching
|
|
> required to work? I don't see why not, but this doesn't work anymore
|
|
> with UW-IMAP.
|
|
|
|
Once again, I'd like to understand what you mean by this.
|
|
|
|
If you're talking about a TEXT search, I don't see why it shouldn't work,
|
|
although it might be that you have a mailbox format that uses UNIX-style
|
|
newlines and the data was not CRLF-converted.
|
|
|
|
HEADER Header: {....}
|
|
value\r\nHeader2: value2
|
|
will certainly not work.
|
|
|
|
I don't think that it is useful for a client to have newlines in a search
|
|
key. Some servers try to do fuzzy matching, so for example if you search
|
|
for "Joe's trip to Paris" there will be a match even if it was broken by a
|
|
newline.
|
|
|
|
-- Mark --
|
|
|
|
http://panda.com/mrc
|
|
Democracy is two wolves and a sheep deciding what to eat for lunch.
|
|
Liberty is a well-armed sheep contesting the vote.
|
|
|