From: gjohnson at showmaster.com (Tony Johnson) Date: Fri, 23 Apr 1999 16:03:57 GMT Subject: Python too slow for real world In-Reply-To: <372068E6.16A4A90@icrf.icnet.uk> References: <372068E6.16A4A90@icrf.icnet.uk> Message-ID: <000401be8da2$e5172430$7153cccf@showmaster.com> Content-Length: 3295 X-UID: 38 I find python syntax less taxing then perl's (IE less lines) You may need to check your python code and see how you can optimize it further... Tony Johnson System Administrator Demand Publishing Inc. -----Original Message----- From: python-list-request at cwi.nl [mailto:python-list-request at cwi.nl]On Behalf Of Arne Mueller Sent: Friday, April 23, 1999 7:35 AM To: python-list at cwi.nl Subject: Python too slow for real world Hi All, first off all: Sorry for that slightly provoking subject ;-) ... I just switched from perl to python because I think python makes live easyer in bigger software projects. However I found out that perl is more then 10 times faster then python in solving the following probelm: I've got a file (130 MB) with ~ 300000 datasets of the form: >px0034 hypothetical protein or whatever description LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA WGATLDTFFGMIFSKM The word floowing the '>' is an identifier, the uppercase letters in the lines following the identifier are the data. Now I want to read and write the contens of that file excluding some entries (given by a dictionary with identifiers, e.g. 'px0034'). The following python code does the job: from re import * from sys import * def read_write(i, o, exclude): name = compile('^>(\S+)') # regex to fetch the identifier l = i.readline() while l: if l[0] == '>': # are we in new dataset? m = name.search(l) if m and exclude.has_key(m.group(1)): # excluding current dataset? l = i.readline() while l and l[0] != '>': # skip this dataset l = i.readline() pass o.write(l) l = i.readline() f = open('my_very_big_data_file','r') # datafile with ~300000 records read_write(f, stdout, {}) # for a simple test I don't exclude anything! It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate perl script does the same job in 32 sec (Same method, same loop structure)! Since I've to call this routine about 1500 times it's a very big difference in time and not realy accaptable. I'd realy like to know why python is so slow (or perl is so fast?) and what I can do to improove speed of that routine. I don't want to switch back to perl - but honestly, is python the right language to process souch huge amount of data? If you want to generate a test set you could use the following lines to print 10000 datasets to stdout: for i in xrange(1, 10001): print '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\ RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\ WGATLDTFFGMIFSKM\n' % i And if you don't believe me that perl does the job quicker you can try the perl code below: #!/usr/local/bin/perl -w open(IN,"test.dat"); my %ex = (); read_write(%ex); sub read_write{ $l = ; OUTER: while( defined $l ){ if( (($x) = $l =~ /^>(\S+)/) ){ if( exists $ex{$x} ){ $l = ; while( defined $l && !($l =~ /^>(\S+)/) ){ $l = ; } next OUTER; } } print $l; $l = ; } } Please do convince me being a python programmer does not mean being slow ;-) Thanks very much for any help, Arne