wasm-demo/demo/ermis-f/python_m/cur/1571

120 lines
3.6 KiB
Plaintext

From: dalke at bioreason.com (Andrew Dalke)
Date: Thu, 15 Apr 1999 17:35:51 -0600
Subject: for in benchmark interested
References: <37157CE7.EFB470CA@inka.de>
Message-ID: <371677D7.EACDA7B7@bioreason.com>
Content-Length: 3368
X-UID: 1571
Friedrich Dominicus <Friedrich.Dominicus at inka.de> said:
> Some time ago I have opened a thread about optimization in Python. I
> have now done a simple benchmark and used Python for that too. If
> s.o is interested, take a look at: http://edt.vol.cz:81/bench
How about a unix command line implementation, like
cat file | tr " \t\r" "\n\n\n" | sort | uniq -c | tail +2
It's a bit slower than the C version, but only by a factor of
two or so, and a lot easier to write.
Another solution might be:
echo "" | spellin > myhlist
spell -d myhlist < file | sort | uniq -c
but that strips punctuation so doesn't give the same results.
A third is:
deroff -w file | sort | uniq -c
which also doesn't give exactly what you want. Still, it is
arguable that both these cases are better than the other solutions
given different definitions of the word "word."
BTW, the C code has some problems:
*** add (char *)
"count_words.l", line 34: error(3232): a value of type "void *"
cannot be used to initialize an entity of type "char *"
char* str = malloc(yyleng+1); strcpy(str, yytext);
^
There are a few problems with "inline" that my non-gcc C compiler
doesn't like. I'm not sure that inline is a C keyword.
I also get a core dump when I run it:
bioreason8> ./count_words_c ./count_words.c
Bus error (core dumped)
bioreason8> dbx ./count_words_c core
dbx version 7.1 Dec 3 1996 17:03:19
Core from signal SIGBUS: Bus error
(dbx) where
> 0 hash(string = 0x100170e9 = "line") ["/home/u05/dalke/tmp/hash.c":104, 0x100037c8]
1 search(table = 0x10017010, key = 0x100170e8 = "#line")
["/home/u05/dalke/tmp/hash.c":181, 0x10003b00]
2 run() ["/home/u05/dalke/tmp/count_words.l":35, 0x100033e0]
3 yylex() ["/home/u05/dalke/tmp/count_words.l":27, 0x1000198c]
4 main(argc = 2, argv = 0x7fff2f24) ["/home/u05/dalke/tmp/count_words.l":59,
0x100034fc]
5 __start()
["/vince/6.2-mar09/work/irix/lib/libc/libc_n32_M3/csu/crt1text.s":166,
0x100015e8]
(dbx) list 92
92 /*
93 Hashes a string to produce an unsigned short, which should be
94 sufficient for most purposes.
95 */
96
97 unsigned hash(char *string)
98 {
99 unsigned int ret_val = 0;
100 int i;
101
(dbx) l
102 while (*string)
103 {
* 104 i = *( int *)string;
105 ret_val ^= i;
106 ret_val <<= 1;
107 string ++;
108 }
109 return ret_val;
110 }
111
The conversion on line 104 is really wrong. It should likely be
i = (int)(*string)
Also, I tried it (with the fix) against a core dump:
bioreason8> ls -l /usr/tmp/core
-rw-r--r-- 1 root sys 5357568 Apr 8 00:45 /usr/tmp/core
bioreason8> ./count_words_c /usr/tmp/core > /dev/null
After about 6 minutes I killed the job. Probably doesn't like the
embedded NULs. This is an example of "fuzz" testing. See
http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-95-1268
Finally, I see the lex code is processed with:
count_words.c: count_words.l
flex -B -ocount_words.c count_words.l
You should add a "-f" to that. That gave me a 15% speedup over just -B, against
the code at http://caml.inria.fr/caml-list/0899.html, from whence
you based this thread originally.
Andrew
dalke at acm.org