152 lines
4.9 KiB
Plaintext
152 lines
4.9 KiB
Plaintext
|
From: Brian at digicool.com (Brian Lloyd)
|
||
|
Date: Fri, 23 Apr 1999 14:34:45 GMT
|
||
|
Subject: Python too slow for real world
|
||
|
Message-ID: <613145F79272D211914B0020AFF6401914DAD8@gandalf.digicool.com>
|
||
|
Content-Length: 4724
|
||
|
X-UID: 430
|
||
|
|
||
|
> Hi All,
|
||
|
>
|
||
|
> first off all: Sorry for that slightly provoking subject ;-) ...
|
||
|
>
|
||
|
> I just switched from perl to python because I think python makes live
|
||
|
> easyer in bigger software projects. However I found out that perl is
|
||
|
> more then 10 times faster then python in solving the
|
||
|
> following probelm:
|
||
|
>
|
||
|
> I've got a file (130 MB) with ~ 300000 datasets of the form:
|
||
|
>
|
||
|
> >px0034 hypothetical protein or whatever description
|
||
|
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
|
||
|
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
|
||
|
> WGATLDTFFGMIFSKM
|
||
|
>
|
||
|
> The word floowing the '>' is an identifier, the uppercase
|
||
|
> letters in the
|
||
|
> lines following the identifier are the data. Now I want to read and
|
||
|
> write the contens of that file excluding some entries (given by a
|
||
|
> dictionary with identifiers, e.g. 'px0034').
|
||
|
>
|
||
|
> The following python code does the job:
|
||
|
>
|
||
|
> from re import *
|
||
|
> from sys import *
|
||
|
>
|
||
|
> def read_write(i, o, exclude):
|
||
|
> name = compile('^>(\S+)') # regex to fetch the identifier
|
||
|
> l = i.readline()
|
||
|
> while l:
|
||
|
> if l[0] == '>': # are we in new dataset?
|
||
|
> m = name.search(l)
|
||
|
> if m and exclude.has_key(m.group(1)): # excluding current
|
||
|
> dataset?
|
||
|
> l = i.readline()
|
||
|
> while l and l[0] != '>': # skip this dataset
|
||
|
> l = i.readline()
|
||
|
> pass
|
||
|
> o.write(l)
|
||
|
> l = i.readline()
|
||
|
>
|
||
|
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
|
||
|
> read_write(f, stdout, {}) # for a simple test I don't exclude
|
||
|
> anything!
|
||
|
>
|
||
|
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An
|
||
|
> appropiate
|
||
|
> perl script does the same job in 32 sec (Same method, same loop
|
||
|
> structure)!
|
||
|
>
|
||
|
> Since I've to call this routine about 1500 times it's a very big
|
||
|
> difference in time and not realy accaptable.
|
||
|
>
|
||
|
> I'd realy like to know why python is so slow (or perl is so fast?) and
|
||
|
> what I can do to improove speed of that routine.
|
||
|
>
|
||
|
> I don't want to switch back to perl - but honestly, is python
|
||
|
> the right
|
||
|
> language to process souch huge amount of data?
|
||
|
>
|
||
|
> ...
|
||
|
> Please do convince me being a python programmer does not mean
|
||
|
> being slow
|
||
|
> ;-)
|
||
|
>
|
||
|
> Thanks very much for any help,
|
||
|
>
|
||
|
> Arne
|
||
|
>
|
||
|
|
||
|
Arne,
|
||
|
|
||
|
While I'm not going to go near comparing Python to Perl, I
|
||
|
will comment that different languages are just that - different.
|
||
|
As such, the approach you would take in one language may not be
|
||
|
the most appropriate (or comparable in speed or efficiency) to
|
||
|
the approach you would take in another.
|
||
|
|
||
|
The question here (IMHO) is not Python's appropriateness for processing
|
||
|
large datasets (a fair number of scientist-types do this all the time),
|
||
|
or even the speed of Python in general, but using the most appropriate
|
||
|
algorithms in the context of the language in use.
|
||
|
|
||
|
For example, Perl is very regex-centric, so your example Perl
|
||
|
implementation is probably perfectly appropriate for Perl. Python
|
||
|
tends to be more optimized for the general case, and if it were
|
||
|
_me_, I wouldn't bother with using regular expressions in this
|
||
|
case,. Since you have a predictable file format, there are more
|
||
|
specific (and efficient) Python tools that you could use here.
|
||
|
|
||
|
There are also some general optimizations that can be used in
|
||
|
places where speed is an issue, such as avoiding repeated
|
||
|
attribute lookups (esp. in loops). This version of your read_write
|
||
|
function uses the same basic algorithm, but forgoes re for more
|
||
|
specific tools (slicing, string.split) and has some examples of
|
||
|
optimizations to mimimize attribute lookups. I haven't timed it
|
||
|
or anything, but I'd be surprised if it wasn't noticeably
|
||
|
faster.
|
||
|
|
||
|
Hope this helps!
|
||
|
|
||
|
|
||
|
import sys, string
|
||
|
|
||
|
def read_write(input, output, exclude):
|
||
|
|
||
|
# These assignments will save us a lot of attribute
|
||
|
# lookups over the course of the big loop...
|
||
|
|
||
|
ignore=exclude.has_key
|
||
|
readline=input.readline
|
||
|
write=output.write
|
||
|
split=string.split
|
||
|
|
||
|
line=readline()
|
||
|
while line:
|
||
|
if line[0]=='>':
|
||
|
# knowing that the first char is a '>' and that
|
||
|
# the rest of the chars up to the first space are
|
||
|
# the id, we can avoid using re here...
|
||
|
key=split(line)[0][1:]
|
||
|
if ignore(key):
|
||
|
line=readline()
|
||
|
while line and line[0] != '>':
|
||
|
# skip this record
|
||
|
line=readline()
|
||
|
continue
|
||
|
write(line)
|
||
|
line=readline()
|
||
|
|
||
|
|
||
|
file=open('my_very_big_data_file','r') # datafile with ~300000 records
|
||
|
read_write(f, sys.stdout, {})
|
||
|
|
||
|
|
||
|
|
||
|
Brian Lloyd brian at digicool.com
|
||
|
Software Engineer 540.371.6909
|
||
|
Digital Creations http://www.digicool.com
|
||
|
|
||
|
|
||
|
|
||
|
|