From: mlh at idt.ntnu.no (Magnus L. Hetland)
Date: 23 Apr 1999 15:55:53 +0200
Subject: Python too slow for real world
References: <372068E6.16A4A90@icrf.icnet.uk>
Message-ID: <y0jhfq7o2za.fsf@vier.idi.ntnu.no>
Content-Length: 2475
X-UID: 377                                                  

Arne Mueller <a.mueller at icrf.icnet.uk> writes:

> Hi All,
> 
> first off all: Sorry for that slightly provoking subject ;-) ...
[...]
> 
> The following python code does the job:
[...]
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude
> anything!

Well -- re is known to be slow. If you have to be fast, maybe you
should try not to use regular expressions; You could perhaps use
something from the string module (several options there) or maybe even
consider fixed-length fields for the identifiers, which should speed
up things a bit.

> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!

Hm. Perl probably has a more efficient implementation of Perl regexes
than Python, naturally enough...

> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.

Well -- at least I have made one suggestion... Though it may not
explain it all...

> 
> I don't want to switch back to perl - but honestly, is python the right
> language to process souch huge amount of data?
> 
> If you want to generate a test set you could use the following lines to
> print 10000 datasets to stdout:
> 
> for i in xrange(1, 10001):
>     print
> '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
> WGATLDTFFGMIFSKM\n' % i
> 
> And if you don't believe me that perl does the job quicker you can try
> the perl code below:
[...]

OK. Using your testset, I tried the following program (It may not work
exactly like your script...)

I have made the assumption that all the id's have a constant length of
7.

----------

import fileinput

exclude = {'px00003': 1}
skip = 0

for line in fileinput.input():
    if line[0] == '>':
        id = line[1:8]
        if exclude.has_key(id):
            skip = 1
        else:
            skip = 0
    if not skip:
        print line,

-----------

It took about 12 seconds.

> 
> Please do convince me being a python programmer does not mean being slow
> ;-)
> 

At least I tried...

> 	Thanks very much for any help,
> 
> 	Arne

--
               > Hi! I'm the signature virus 99!
  Magnus       > Copy me into your signature and join the fun!
  Lie
  Hetland        http://arcadia.laiv.org <arcadia at laiv.org>