99 lines
3.7 KiB
Plaintext
99 lines
3.7 KiB
Plaintext
From: sweeting at neuronet.com.my (sweeting at neuronet.com.my)
|
|
Date: Sun, 11 Apr 1999 05:03:50 GMT
|
|
Subject: Search Engines, Chinese and Python.
|
|
References: <7eo32a$tbr$1@nnrp1.dejanews.com>
|
|
Message-ID: <7epafj$qlg$1@nnrp1.dejanews.com>
|
|
Content-Length: 3475
|
|
X-UID: 500
|
|
|
|
I have a sneaky feeling that somebody on the Python list first mentioned
|
|
this URL a couple of years back but I've just rediscovered the great
|
|
resource on CJK processing (I knew I bookmarked it for a reason) :
|
|
|
|
http://www.ora.com/people/authors/lunde/cjk_inf.html
|
|
|
|
so I answer the second half of my own question.
|
|
Now if maybe the Infoseek guys are interested in porting their
|
|
engine to the most widely-spoken language in the world ;-)
|
|
|
|
chas
|
|
|
|
*just as an aside, it was due to Infoseek that I first looked at Python;
|
|
always thought it was the best search engine, read that they used Python just
|
|
as I was despairing with another P-language... haven't looked back since. :)
|
|
|
|
sweeting at neuronet.com.my wrote:
|
|
> The "how do you build a search engine in Python ?" question has been
|
|
> asked and answered enough times so I'll spare you all the agony. Given
|
|
> the choice, I'd use Ultraseek*, WAIS or something rather than rebuild
|
|
> this again myself; but I need this to work in Chinese. Forseeing a
|
|
> great demand for this (for myself and in general) and failing to find a
|
|
> decent ready-made solution, I figured that I may as well have a stab at
|
|
> it. (If all else fails, at least I may improve my Mandarin).
|
|
>
|
|
> Snipping from a thread last December :
|
|
>
|
|
> [snip]
|
|
> >Richard Jones <richard.jones at fulcrum.com.au> wrote:
|
|
> >: The short answer is "you don't".
|
|
> >
|
|
> The big answer might be (this not a Gadlfy answer, but hey):
|
|
>
|
|
> (This uses indexing on item submission to speed fetching.)
|
|
>
|
|
> A pair of (g)dm's. One that stores your entries under some unique per
|
|
> item id.
|
|
>
|
|
> Churn through each new item looking for words. "Stop and stem" this
|
|
> list (ie. kill "and" "at" "the", standardise case, collapse "runner",
|
|
> "running" -> "run" or whatever. Can be tricky :-) You can automate
|
|
> the stop-list just by counting word occurences)
|
|
> The HTML parser and rfc822 et all could also be used to pull out
|
|
> details for searches like "url:www.host.com".
|
|
>
|
|
> The second file holds a relation between words and the documents that
|
|
> contain them. ie. an inverted list.
|
|
>
|
|
> Query comes in: search for "python programming":
|
|
> list1 = db2["python"]
|
|
> list2 = db2["programming"]
|
|
>
|
|
> The intersection of the list are the documents that contain both
|
|
> words. As things get big, you may need to overload the getitem
|
|
> to return a smaller list and store things like the number times
|
|
> the word appears in an item (you can then sort the inverted lists
|
|
> on this attribute).
|
|
>
|
|
> Hope this helps.
|
|
>
|
|
> (Start small!)
|
|
> -- James Preston, waiting for Godot.
|
|
> [/snip]
|
|
>
|
|
> Is it really as easy as that ?
|
|
>
|
|
> It seems that the real work is in the indexing and this is going to be
|
|
> even more of a chore with Chinese because words aren't separated by
|
|
> spaces - so we'll also have to build a parsing engine to work that out :(
|
|
>
|
|
> If anybody has worked with Chinese text and has any caveats with regards the
|
|
> above project or programming double-byte characters in general, I'm all ears..
|
|
> I'm still struggling with getting my servers/scripts to write Chinese to
|
|
> the screen of Chinese-Windows machines let alone programming this into a
|
|
> database.
|
|
>
|
|
> Thank you very much,
|
|
>
|
|
> chas
|
|
>
|
|
> -----------== Posted via Deja News, The Discussion Network ==----------
|
|
> http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
|
|
>
|
|
|
|
-----------== Posted via Deja News, The Discussion Network ==----------
|
|
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
|
|
|
|
|
|
|
|
|