From: Brian at digicool.com (Brian Lloyd) Date: Fri, 23 Apr 1999 14:34:45 GMT Subject: Python too slow for real world Message-ID: <613145F79272D211914B0020AFF6401914DAD8@gandalf.digicool.com> Content-Length: 4724 X-UID: 430 > Hi All, > > first off all: Sorry for that slightly provoking subject ;-) ... > > I just switched from perl to python because I think python makes live > easyer in bigger software projects. However I found out that perl is > more then 10 times faster then python in solving the > following probelm: > > I've got a file (130 MB) with ~ 300000 datasets of the form: > > >px0034 hypothetical protein or whatever description > LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN > RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA > WGATLDTFFGMIFSKM > > The word floowing the '>' is an identifier, the uppercase > letters in the > lines following the identifier are the data. Now I want to read and > write the contens of that file excluding some entries (given by a > dictionary with identifiers, e.g. 'px0034'). > > The following python code does the job: > > from re import * > from sys import * > > def read_write(i, o, exclude): > name = compile('^>(\S+)') # regex to fetch the identifier > l = i.readline() > while l: > if l[0] == '>': # are we in new dataset? > m = name.search(l) > if m and exclude.has_key(m.group(1)): # excluding current > dataset? > l = i.readline() > while l and l[0] != '>': # skip this dataset > l = i.readline() > pass > o.write(l) > l = i.readline() > > f = open('my_very_big_data_file','r') # datafile with ~300000 records > read_write(f, stdout, {}) # for a simple test I don't exclude > anything! > > It took 503.90 sec on a SGI Power Challange (R10000 CPU). An > appropiate > perl script does the same job in 32 sec (Same method, same loop > structure)! > > Since I've to call this routine about 1500 times it's a very big > difference in time and not realy accaptable. > > I'd realy like to know why python is so slow (or perl is so fast?) and > what I can do to improove speed of that routine. > > I don't want to switch back to perl - but honestly, is python > the right > language to process souch huge amount of data? > > ... > Please do convince me being a python programmer does not mean > being slow > ;-) > > Thanks very much for any help, > > Arne > Arne, While I'm not going to go near comparing Python to Perl, I will comment that different languages are just that - different. As such, the approach you would take in one language may not be the most appropriate (or comparable in speed or efficiency) to the approach you would take in another. The question here (IMHO) is not Python's appropriateness for processing large datasets (a fair number of scientist-types do this all the time), or even the speed of Python in general, but using the most appropriate algorithms in the context of the language in use. For example, Perl is very regex-centric, so your example Perl implementation is probably perfectly appropriate for Perl. Python tends to be more optimized for the general case, and if it were _me_, I wouldn't bother with using regular expressions in this case,. Since you have a predictable file format, there are more specific (and efficient) Python tools that you could use here. There are also some general optimizations that can be used in places where speed is an issue, such as avoiding repeated attribute lookups (esp. in loops). This version of your read_write function uses the same basic algorithm, but forgoes re for more specific tools (slicing, string.split) and has some examples of optimizations to mimimize attribute lookups. I haven't timed it or anything, but I'd be surprised if it wasn't noticeably faster. Hope this helps! import sys, string def read_write(input, output, exclude): # These assignments will save us a lot of attribute # lookups over the course of the big loop... ignore=exclude.has_key readline=input.readline write=output.write split=string.split line=readline() while line: if line[0]=='>': # knowing that the first char is a '>' and that # the rest of the chars up to the first space are # the id, we can avoid using re here... key=split(line)[0][1:] if ignore(key): line=readline() while line and line[0] != '>': # skip this record line=readline() continue write(line) line=readline() file=open('my_very_big_data_file','r') # datafile with ~300000 records read_write(f, sys.stdout, {}) Brian Lloyd brian at digicool.com Software Engineer 540.371.6909 Digital Creations http://www.digicool.com