81 lines
4.2 KiB
Plaintext
81 lines
4.2 KiB
Plaintext
MBOX-Line: From mrc+imap at panda.com Tue Nov 1 09:33:40 2011
|
|
To: imap-protocol@u.washington.edu
|
|
From: Mark Crispin <mrc+imap@panda.com>
|
|
Date: Fri Jun 8 12:34:47 2018
|
|
Subject: [Imap-protocol] Cyrus and RFC5255
|
|
In-Reply-To: <1320148200.20407.140660993144221@webmail.messagingengine.com>
|
|
References: <8F0DA5FA-FB07-4AFA-9C58-8F0927998343@ghoti.org><alpine.OSX.2.00.1110301823210.9034@hsinghsing.panda.com><CABa8R6uyrHJ6AqQoGfMcLztG-zovaVK4FEFep8ibpYc2-6b7ig@mail.gmail.com><alpine.OSX.2.00.1110311352000.9034@hsinghsing.panda.com><CABa8R6ugaAADsrvaS1Nnau5bi7k9728b+x00NcG1Nbxfk9xwsA@mail.gmail.com><75E51A44-91E4-48D1-BA82-789C82583ABC@iki.fi><alpine.OSX.2.00.1110311532190.9034@hsinghsing.panda.com>
|
|
<1320102709.31835.140660992941501@webmail.messagingengine.com>
|
|
<alpine.OSX.2.00.1110311633050.9034@hsinghsing.panda.com>
|
|
<1320148200.20407.140660993144221@webmail.messagingengine.com>
|
|
Message-ID: <alpine.OSX.2.00.1111010850250.9034@hsinghsing.panda.com>
|
|
|
|
On Tue, 1 Nov 2011, Bron Gondwana wrote:
|
|
>> RFC 5255 explicitly requires that you apply i;unicode-casemap in searches
|
|
>> as part of level 1 compliance.
|
|
> The response when I mentioned it to our project manager was "it's often nice
|
|
> not to worry about a vs ? when searching - and have it find both".
|
|
|
|
Careful.
|
|
|
|
i;unicode-casemap is designed to be a simple collator/comparator that even
|
|
a baby programmer can implement correctly. It is not intended to be
|
|
something that people can fork off all sorts of random non-interoperable
|
|
variants.
|
|
|
|
It also formalized, and moderately amended, what Cyrus has done from its
|
|
inception in searching Unicode strings.
|
|
|
|
You will probably need to define a different comparator for that purpose
|
|
(e.g., i;unicode-casemap-ignore-diacriticals). Beyond that, you will
|
|
quickly find yourself in a swamp filled with alligators (or crocodiles if
|
|
you prefer). Even the modest step of an "ignore-diacriticals" comparator
|
|
will get you wet above the knee.
|
|
|
|
If you want to get into the type of matching you are talking about, you
|
|
will wind up needing to do a full-fledged implementation of i18n collation
|
|
and comparison, which more likely that not includes locale sensitivity.
|
|
This is not something to be half-assed or hackish on. There are standards
|
|
and rules; and in some cases these are enforced in national laws.
|
|
|
|
I strongly urge you, BEFORE embarking upon such a project, to get involved
|
|
with the various groups involved with i18n collation and comparison and
|
|
seek their advice.
|
|
|
|
I did not do i;unicode-casemap in a vacuum; I sought their advice and
|
|
after their screams of anguished horror, these guys gave good advice which
|
|
I took serious and acted upon. One of the things that was important to
|
|
them was that, while (reluctantly) accepting the "we need something that
|
|
even a baby programmer can implement", they wanted to draw the line and
|
|
say "do this, or do it right."
|
|
|
|
With this said, I don't particularly object to ignore-diacriticals
|
|
searching; but I also note that the concept is locale-dependent. In some
|
|
languages, the diacritical form indicates accent or sound; in others it is
|
|
a completely unrelated character (and the latter group already is
|
|
infuritated by i;unicode-casemap).
|
|
|
|
CJK is another part of the swamp. For example, U+5FB0 ? and U+5FB7 ?
|
|
are fundamentally the same character; they have the same meaning and
|
|
differ only by an added stroke in the Chinese/Korean form that the
|
|
Japanese form lacks. Yet at least one Chinese character set has both
|
|
forms. Adult CJK native speakers would say that the two should match in
|
|
search; and many would have to have that one stroke difference pointed out
|
|
to them before they'd notice it.
|
|
|
|
But that's just a simple case. CJK is full of these, and most are far more
|
|
complicated. There are lots of cases where the equivalency is one way;
|
|
that is, A is equivalent to B, but B is NOT equivalent to A (or worse is
|
|
SOMETIMES equivalent to A). At this point, the swamp reptiles are over
|
|
your head.
|
|
|
|
The bottom line is that, whatever you do, seek the advice of the language
|
|
folks. Your implementation will have to be tempered by realism; but at
|
|
least you can avoid a mistake. Undoing a mistake is far more costly.
|
|
|
|
-- Mark --
|
|
|
|
http://panda.com/mrc
|
|
Democracy is two wolves and a sheep deciding what to eat for lunch.
|
|
Liberty is a well-armed sheep contesting the vote.
|