wasm-demo/demo/ermis-f/python_m/cur/1605

From: jmrober1 at ingr.com (Joseph Robertson)
Date: Fri, 23 Apr 1999 08:32:53 -0500
Subject: Python too slow for real world
References: <372068E6.16A4A90@icrf.icnet.uk>
Message-ID: <37207685.F29BE1AB@ingr.com>
Content-Length: 4202
X-UID: 1605

Hi,

For what you state here, you don't even really need to read the 'data' at
all.
Just read your descriptors, and store the offsets and len of the data in a
dictionary (i.e. index it).

readline
if first char == >
    get id
    get current position using seek method
    store id, pos in dict
    #for each id, we now have its byte posisition in the file

Then have a filter method which keeps or discards the records by criteria.

for each key in dict
    if key passes filter test
        store key in filtered dict

Then only at the time you really need that data do you go get it.
for each in filtered_dict
    use seek to position
    read data until next line with > at 0

This way you can create views on your data without actually trying to load it
all.  The tradeoff of course is memory for fileaccess time, but I found
fileaccess to be faster than doing all the work 'up front'.  Besides my
project reached the point where we ran out of memory often, some datasets are
on 8+ cdroms!

Hope that was relevant, but maybe I misunderstood the question.
Joe Robertson,
jmrobert at ro.com


Arne Mueller wrote:

> Hi All,
>
> first off all: Sorry for that slightly provoking subject ;-) ...
>
> I just switched from perl to python because I think python makes live
> easyer in bigger software projects. However I found out that perl is
> more then 10 times faster then python in solving the following probelm:
>
> I've got a file (130 MB) with ~ 300000 datasets of the form:
>
> >px0034 hypothetical protein or whatever description
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
> WGATLDTFFGMIFSKM
>
> The word floowing the '>' is an identifier, the uppercase letters in the
> lines following the identifier are the data. Now I want to read and
> write the contens of that file excluding some entries (given by a
> dictionary with identifiers, e.g. 'px0034').
>
> The following python code does the job:
>
> from re import *
> from sys import *
>
> def read_write(i, o, exclude):
>     name = compile('^>(\S+)') # regex to fetch the identifier
>     l = i.readline()
>     while l:
>         if l[0] == '>': # are we in new dataset?
>             m = name.search(l)
>             if m and exclude.has_key(m.group(1)): # excluding current
> dataset?
>                 l = i.readline()
>                 while l and l[0] != '>':  # skip this dataset
>                     l = i.readline()
>                     pass
>         o.write(l)
>         l = i.readline()
>
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude anything!
>
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!
>
> Since I've to call this routine about 1500 times it's a very big
> difference in time and not realy accaptable.
>
> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.
>
> I don't want to switch back to perl - but honestly, is python the right
> language to process souch huge amount of data?
>
> If you want to generate a test set you could use the following lines to
> print 10000 datasets to stdout:
>
> for i in xrange(1, 10001):
>     print
> '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
> WGATLDTFFGMIFSKM\n' % i
>
> And if you don't believe me that perl does the job quicker you can try
> the perl code below:
>
> #!/usr/local/bin/perl -w
> open(IN,"test.dat");
> my %ex = ();
> read_write(%ex);
>
> sub read_write{
>
>   $l = <IN>;
>  OUTER: while( defined $l ){
>     if( (($x) = $l =~ /^>(\S+)/) ){
>       if( exists $ex{$x} ){
>         $l = <IN>;
>         while( defined $l && !($l =~ /^>(\S+)/) ){
>           $l = <IN>;
>         }
>         next OUTER;
>       }
>     }
>     print $l;
>     $l = <IN>;
>   }
> }
>
> Please do convince me being a python programmer does not mean being slow
> ;-)
>
>         Thanks very much for any help,
>
>         Arne