153 lines
4.4 KiB
Plaintext
153 lines
4.4 KiB
Plaintext
From: jmrober1 at ingr.com (Joseph Robertson)
|
|
Date: Fri, 23 Apr 1999 08:32:53 -0500
|
|
Subject: Python too slow for real world
|
|
References: <372068E6.16A4A90@icrf.icnet.uk>
|
|
Message-ID: <37207685.F29BE1AB@ingr.com>
|
|
Content-Length: 4202
|
|
X-UID: 1605
|
|
|
|
Hi,
|
|
|
|
For what you state here, you don't even really need to read the 'data' at
|
|
all.
|
|
Just read your descriptors, and store the offsets and len of the data in a
|
|
dictionary (i.e. index it).
|
|
|
|
readline
|
|
if first char == >
|
|
get id
|
|
get current position using seek method
|
|
store id, pos in dict
|
|
#for each id, we now have its byte posisition in the file
|
|
|
|
Then have a filter method which keeps or discards the records by criteria.
|
|
|
|
for each key in dict
|
|
if key passes filter test
|
|
store key in filtered dict
|
|
|
|
Then only at the time you really need that data do you go get it.
|
|
for each in filtered_dict
|
|
use seek to position
|
|
read data until next line with > at 0
|
|
|
|
This way you can create views on your data without actually trying to load it
|
|
all. The tradeoff of course is memory for fileaccess time, but I found
|
|
fileaccess to be faster than doing all the work 'up front'. Besides my
|
|
project reached the point where we ran out of memory often, some datasets are
|
|
on 8+ cdroms!
|
|
|
|
Hope that was relevant, but maybe I misunderstood the question.
|
|
Joe Robertson,
|
|
jmrobert at ro.com
|
|
|
|
|
|
|
|
|
|
Arne Mueller wrote:
|
|
|
|
> Hi All,
|
|
>
|
|
> first off all: Sorry for that slightly provoking subject ;-) ...
|
|
>
|
|
> I just switched from perl to python because I think python makes live
|
|
> easyer in bigger software projects. However I found out that perl is
|
|
> more then 10 times faster then python in solving the following probelm:
|
|
>
|
|
> I've got a file (130 MB) with ~ 300000 datasets of the form:
|
|
>
|
|
> >px0034 hypothetical protein or whatever description
|
|
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
|
|
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
|
|
> WGATLDTFFGMIFSKM
|
|
>
|
|
> The word floowing the '>' is an identifier, the uppercase letters in the
|
|
> lines following the identifier are the data. Now I want to read and
|
|
> write the contens of that file excluding some entries (given by a
|
|
> dictionary with identifiers, e.g. 'px0034').
|
|
>
|
|
> The following python code does the job:
|
|
>
|
|
> from re import *
|
|
> from sys import *
|
|
>
|
|
> def read_write(i, o, exclude):
|
|
> name = compile('^>(\S+)') # regex to fetch the identifier
|
|
> l = i.readline()
|
|
> while l:
|
|
> if l[0] == '>': # are we in new dataset?
|
|
> m = name.search(l)
|
|
> if m and exclude.has_key(m.group(1)): # excluding current
|
|
> dataset?
|
|
> l = i.readline()
|
|
> while l and l[0] != '>': # skip this dataset
|
|
> l = i.readline()
|
|
> pass
|
|
> o.write(l)
|
|
> l = i.readline()
|
|
>
|
|
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
|
|
> read_write(f, stdout, {}) # for a simple test I don't exclude anything!
|
|
>
|
|
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
|
|
> perl script does the same job in 32 sec (Same method, same loop
|
|
> structure)!
|
|
>
|
|
> Since I've to call this routine about 1500 times it's a very big
|
|
> difference in time and not realy accaptable.
|
|
>
|
|
> I'd realy like to know why python is so slow (or perl is so fast?) and
|
|
> what I can do to improove speed of that routine.
|
|
>
|
|
> I don't want to switch back to perl - but honestly, is python the right
|
|
> language to process souch huge amount of data?
|
|
>
|
|
> If you want to generate a test set you could use the following lines to
|
|
> print 10000 datasets to stdout:
|
|
>
|
|
> for i in xrange(1, 10001):
|
|
> print
|
|
> '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
|
|
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
|
|
> WGATLDTFFGMIFSKM\n' % i
|
|
>
|
|
> And if you don't believe me that perl does the job quicker you can try
|
|
> the perl code below:
|
|
>
|
|
> #!/usr/local/bin/perl -w
|
|
> open(IN,"test.dat");
|
|
> my %ex = ();
|
|
> read_write(%ex);
|
|
>
|
|
> sub read_write{
|
|
>
|
|
> $l = <IN>;
|
|
> OUTER: while( defined $l ){
|
|
> if( (($x) = $l =~ /^>(\S+)/) ){
|
|
> if( exists $ex{$x} ){
|
|
> $l = <IN>;
|
|
> while( defined $l && !($l =~ /^>(\S+)/) ){
|
|
> $l = <IN>;
|
|
> }
|
|
> next OUTER;
|
|
> }
|
|
> }
|
|
> print $l;
|
|
> $l = <IN>;
|
|
> }
|
|
> }
|
|
>
|
|
> Please do convince me being a python programmer does not mean being slow
|
|
> ;-)
|
|
>
|
|
> Thanks very much for any help,
|
|
>
|
|
> Arne
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|