wasm-demo/demo/ermis-f/python_m/cur/0038

From: gjohnson at showmaster.com (Tony Johnson)
Date: Fri, 23 Apr 1999 16:03:57 GMT
Subject: Python too slow for real world
In-Reply-To: <372068E6.16A4A90@icrf.icnet.uk>
References: <372068E6.16A4A90@icrf.icnet.uk>
Message-ID: <000401be8da2$e5172430$7153cccf@showmaster.com>
Content-Length: 3295
X-UID: 38

I find python syntax less taxing then perl's (IE less lines)  You may need
to check your python code and see how you can optimize it further...

Tony Johnson
System Administrator
Demand Publishing Inc.


-----Original Message-----
From: python-list-request at cwi.nl [mailto:python-list-request at cwi.nl]On
Behalf Of Arne Mueller
Sent: Friday, April 23, 1999 7:35 AM
To: python-list at cwi.nl
Subject: Python too slow for real world


Hi All,

first off all: Sorry for that slightly provoking subject ;-) ...

I just switched from perl to python because I think python makes live
easyer in bigger software projects. However I found out that perl is
more then 10 times faster then python in solving the following probelm:

I've got a file (130 MB) with ~ 300000 datasets of the form:

>px0034 hypothetical protein or whatever description
LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
WGATLDTFFGMIFSKM

The word floowing the '>' is an identifier, the uppercase letters in the
lines following the identifier are the data. Now I want to read and
write the contens of that file excluding some entries (given by a
dictionary with identifiers, e.g. 'px0034').

The following python code does the job:

from re import *
from sys import *

def read_write(i, o, exclude):
    name = compile('^>(\S+)') # regex to fetch the identifier
    l = i.readline()
    while l:
        if l[0] == '>': # are we in new dataset?
            m = name.search(l)
            if m and exclude.has_key(m.group(1)): # excluding current
dataset?
                l = i.readline()
                while l and l[0] != '>':  # skip this dataset
                    l = i.readline()
                    pass
        o.write(l)
        l = i.readline()

f = open('my_very_big_data_file','r') # datafile with ~300000 records
read_write(f, stdout, {}) # for a simple test I don't exclude anything!

It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
perl script does the same job in 32 sec (Same method, same loop
structure)!

Since I've to call this routine about 1500 times it's a very big
difference in time and not realy accaptable.

I'd realy like to know why python is so slow (or perl is so fast?) and
what I can do to improove speed of that routine.

I don't want to switch back to perl - but honestly, is python the right
language to process souch huge amount of data?

If you want to generate a test set you could use the following lines to
print 10000 datasets to stdout:

for i in xrange(1, 10001):
    print
'>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
WGATLDTFFGMIFSKM\n' % i

And if you don't believe me that perl does the job quicker you can try
the perl code below:

#!/usr/local/bin/perl -w
open(IN,"test.dat");
my %ex = ();
read_write(%ex);

sub read_write{

  $l = <IN>;
 OUTER: while( defined $l ){
    if( (($x) = $l =~ /^>(\S+)/) ){
      if( exists $ex{$x} ){
	$l = <IN>;
	while( defined $l && !($l =~ /^>(\S+)/) ){
	  $l = <IN>;
	}
	next OUTER;
      }
    }
    print $l;
    $l = <IN>;
  }
}

Please do convince me being a python programmer does not mean being slow
;-)

	Thanks very much for any help,

	Arne