From: mlh at idt.ntnu.no (Magnus L. Hetland) Date: 23 Apr 1999 15:55:53 +0200 Subject: Python too slow for real world References: <372068E6.16A4A90@icrf.icnet.uk> Message-ID: Content-Length: 2475 X-UID: 377 Arne Mueller writes: > Hi All, > > first off all: Sorry for that slightly provoking subject ;-) ... [...] > > The following python code does the job: [...] > f = open('my_very_big_data_file','r') # datafile with ~300000 records > read_write(f, stdout, {}) # for a simple test I don't exclude > anything! Well -- re is known to be slow. If you have to be fast, maybe you should try not to use regular expressions; You could perhaps use something from the string module (several options there) or maybe even consider fixed-length fields for the identifiers, which should speed up things a bit. > It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate > perl script does the same job in 32 sec (Same method, same loop > structure)! Hm. Perl probably has a more efficient implementation of Perl regexes than Python, naturally enough... > I'd realy like to know why python is so slow (or perl is so fast?) and > what I can do to improove speed of that routine. Well -- at least I have made one suggestion... Though it may not explain it all... > > I don't want to switch back to perl - but honestly, is python the right > language to process souch huge amount of data? > > If you want to generate a test set you could use the following lines to > print 10000 datasets to stdout: > > for i in xrange(1, 10001): > print > '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\ > RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\ > WGATLDTFFGMIFSKM\n' % i > > And if you don't believe me that perl does the job quicker you can try > the perl code below: [...] OK. Using your testset, I tried the following program (It may not work exactly like your script...) I have made the assumption that all the id's have a constant length of 7. ---------- import fileinput exclude = {'px00003': 1} skip = 0 for line in fileinput.input(): if line[0] == '>': id = line[1:8] if exclude.has_key(id): skip = 1 else: skip = 0 if not skip: print line, ----------- It took about 12 seconds. > > Please do convince me being a python programmer does not mean being slow > ;-) > At least I tried... > Thanks very much for any help, > > Arne -- > Hi! I'm the signature virus 99! Magnus > Copy me into your signature and join the fun! Lie Hetland http://arcadia.laiv.org