128 lines
3.6 KiB
Plaintext
128 lines
3.6 KiB
Plaintext
From: gjohnson at showmaster.com (Tony Johnson)
|
|
Date: Fri, 23 Apr 1999 16:03:57 GMT
|
|
Subject: Python too slow for real world
|
|
In-Reply-To: <372068E6.16A4A90@icrf.icnet.uk>
|
|
References: <372068E6.16A4A90@icrf.icnet.uk>
|
|
Message-ID: <000401be8da2$e5172430$7153cccf@showmaster.com>
|
|
Content-Length: 3295
|
|
X-UID: 38
|
|
|
|
I find python syntax less taxing then perl's (IE less lines) You may need
|
|
to check your python code and see how you can optimize it further...
|
|
|
|
Tony Johnson
|
|
System Administrator
|
|
Demand Publishing Inc.
|
|
|
|
|
|
-----Original Message-----
|
|
From: python-list-request at cwi.nl [mailto:python-list-request at cwi.nl]On
|
|
Behalf Of Arne Mueller
|
|
Sent: Friday, April 23, 1999 7:35 AM
|
|
To: python-list at cwi.nl
|
|
Subject: Python too slow for real world
|
|
|
|
|
|
Hi All,
|
|
|
|
first off all: Sorry for that slightly provoking subject ;-) ...
|
|
|
|
I just switched from perl to python because I think python makes live
|
|
easyer in bigger software projects. However I found out that perl is
|
|
more then 10 times faster then python in solving the following probelm:
|
|
|
|
I've got a file (130 MB) with ~ 300000 datasets of the form:
|
|
|
|
>px0034 hypothetical protein or whatever description
|
|
LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
|
|
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
|
|
WGATLDTFFGMIFSKM
|
|
|
|
The word floowing the '>' is an identifier, the uppercase letters in the
|
|
lines following the identifier are the data. Now I want to read and
|
|
write the contens of that file excluding some entries (given by a
|
|
dictionary with identifiers, e.g. 'px0034').
|
|
|
|
The following python code does the job:
|
|
|
|
from re import *
|
|
from sys import *
|
|
|
|
def read_write(i, o, exclude):
|
|
name = compile('^>(\S+)') # regex to fetch the identifier
|
|
l = i.readline()
|
|
while l:
|
|
if l[0] == '>': # are we in new dataset?
|
|
m = name.search(l)
|
|
if m and exclude.has_key(m.group(1)): # excluding current
|
|
dataset?
|
|
l = i.readline()
|
|
while l and l[0] != '>': # skip this dataset
|
|
l = i.readline()
|
|
pass
|
|
o.write(l)
|
|
l = i.readline()
|
|
|
|
f = open('my_very_big_data_file','r') # datafile with ~300000 records
|
|
read_write(f, stdout, {}) # for a simple test I don't exclude anything!
|
|
|
|
It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
|
|
perl script does the same job in 32 sec (Same method, same loop
|
|
structure)!
|
|
|
|
Since I've to call this routine about 1500 times it's a very big
|
|
difference in time and not realy accaptable.
|
|
|
|
I'd realy like to know why python is so slow (or perl is so fast?) and
|
|
what I can do to improove speed of that routine.
|
|
|
|
I don't want to switch back to perl - but honestly, is python the right
|
|
language to process souch huge amount of data?
|
|
|
|
If you want to generate a test set you could use the following lines to
|
|
print 10000 datasets to stdout:
|
|
|
|
for i in xrange(1, 10001):
|
|
print
|
|
'>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
|
|
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
|
|
WGATLDTFFGMIFSKM\n' % i
|
|
|
|
And if you don't believe me that perl does the job quicker you can try
|
|
the perl code below:
|
|
|
|
#!/usr/local/bin/perl -w
|
|
open(IN,"test.dat");
|
|
my %ex = ();
|
|
read_write(%ex);
|
|
|
|
sub read_write{
|
|
|
|
$l = <IN>;
|
|
OUTER: while( defined $l ){
|
|
if( (($x) = $l =~ /^>(\S+)/) ){
|
|
if( exists $ex{$x} ){
|
|
$l = <IN>;
|
|
while( defined $l && !($l =~ /^>(\S+)/) ){
|
|
$l = <IN>;
|
|
}
|
|
next OUTER;
|
|
}
|
|
}
|
|
print $l;
|
|
$l = <IN>;
|
|
}
|
|
}
|
|
|
|
Please do convince me being a python programmer does not mean being slow
|
|
;-)
|
|
|
|
Thanks very much for any help,
|
|
|
|
Arne
|
|
|
|
|
|
|
|
|
|
|