From: Brian at digicool.com (Brian Lloyd)
Date: Fri, 23 Apr 1999 14:34:45 GMT
Subject: Python too slow for real world
Message-ID: <613145F79272D211914B0020AFF6401914DAD8@gandalf.digicool.com>
Content-Length: 4724
X-UID: 430                                                  

> Hi All,
> 
> first off all: Sorry for that slightly provoking subject ;-) ...
> 
> I just switched from perl to python because I think python makes live
> easyer in bigger software projects. However I found out that perl is
> more then 10 times faster then python in solving the 
> following probelm:
> 
> I've got a file (130 MB) with ~ 300000 datasets of the form:
> 
> >px0034 hypothetical protein or whatever description
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
> WGATLDTFFGMIFSKM
> 
> The word floowing the '>' is an identifier, the uppercase 
> letters in the
> lines following the identifier are the data. Now I want to read and
> write the contens of that file excluding some entries (given by a
> dictionary with identifiers, e.g. 'px0034').
> 
> The following python code does the job:
> 
> from re import *
> from sys import *
> 
> def read_write(i, o, exclude):
>     name = compile('^>(\S+)') # regex to fetch the identifier
>     l = i.readline()
>     while l:
>         if l[0] == '>': # are we in new dataset?
>             m = name.search(l)
>             if m and exclude.has_key(m.group(1)): # excluding current
> dataset?
>                 l = i.readline()
>                 while l and l[0] != '>':  # skip this dataset
>                     l = i.readline()
>                     pass
>         o.write(l)
>         l = i.readline()
> 
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude 
> anything!
> 
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An 
> appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!
> 
> Since I've to call this routine about 1500 times it's a very big
> difference in time and not realy accaptable.
> 
> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.
> 
> I don't want to switch back to perl - but honestly, is python 
> the right
> language to process souch huge amount of data?
> 
> ...
> Please do convince me being a python programmer does not mean 
> being slow
> ;-)
> 
> 	Thanks very much for any help,
> 
> 	Arne
> 

Arne,

  While I'm not going to go near comparing Python to Perl, I
will comment that different languages are just that - different.
As such, the approach you would take in one language may not be
the most appropriate (or comparable in speed or efficiency) to 
the approach you would take in another.

The question here (IMHO) is not Python's appropriateness for processing
large datasets (a fair number of scientist-types do this all the time),
or even the speed of Python in general, but using the most appropriate 
algorithms in the context of the language in use.

For example, Perl is very regex-centric, so your example Perl
implementation is probably perfectly appropriate for Perl. Python
tends to be more optimized for the general case, and if it were
_me_, I wouldn't bother with using regular expressions in this
case,. Since you have a predictable file format, there are more
specific (and efficient) Python tools that you could use here.

There are also some general optimizations that can be used in
places where speed is an issue, such as avoiding repeated
attribute lookups (esp. in loops). This version of your read_write
function uses the same basic algorithm, but forgoes re for more
specific tools (slicing, string.split) and has some examples of
optimizations to mimimize attribute lookups. I haven't timed it
or anything, but I'd be surprised if it wasn't noticeably
faster.

Hope this helps!


import sys, string

def read_write(input, output, exclude):

    # These assignments will save us a lot of attribute
    # lookups over the course of the big loop...

    ignore=exclude.has_key
    readline=input.readline
    write=output.write
    split=string.split
    
    line=readline()
    while line:
        if line[0]=='>':
            # knowing that the first char is a '>' and that
            # the rest of the chars up to the first space are
            # the id, we can avoid using re here...
            key=split(line)[0][1:]
            if ignore(key):
                line=readline()
                while line and line[0] != '>':
                    # skip this record
                    line=readline()
                continue
        write(line)
        line=readline()


file=open('my_very_big_data_file','r') # datafile with ~300000 records
read_write(f, sys.stdout, {})


Brian Lloyd        brian at digicool.com
Software Engineer  540.371.6909              
Digital Creations  http://www.digicool.com