120 lines
3.6 KiB
Plaintext
120 lines
3.6 KiB
Plaintext
From: dalke at bioreason.com (Andrew Dalke)
|
|
Date: Thu, 15 Apr 1999 17:35:51 -0600
|
|
Subject: for in benchmark interested
|
|
References: <37157CE7.EFB470CA@inka.de>
|
|
Message-ID: <371677D7.EACDA7B7@bioreason.com>
|
|
Content-Length: 3368
|
|
X-UID: 1571
|
|
|
|
Friedrich Dominicus <Friedrich.Dominicus at inka.de> said:
|
|
> Some time ago I have opened a thread about optimization in Python. I
|
|
> have now done a simple benchmark and used Python for that too. If
|
|
> s.o is interested, take a look at: http://edt.vol.cz:81/bench
|
|
|
|
How about a unix command line implementation, like
|
|
|
|
cat file | tr " \t\r" "\n\n\n" | sort | uniq -c | tail +2
|
|
|
|
It's a bit slower than the C version, but only by a factor of
|
|
two or so, and a lot easier to write.
|
|
|
|
|
|
Another solution might be:
|
|
echo "" | spellin > myhlist
|
|
spell -d myhlist < file | sort | uniq -c
|
|
|
|
but that strips punctuation so doesn't give the same results.
|
|
|
|
A third is:
|
|
deroff -w file | sort | uniq -c
|
|
|
|
which also doesn't give exactly what you want. Still, it is
|
|
arguable that both these cases are better than the other solutions
|
|
given different definitions of the word "word."
|
|
|
|
|
|
BTW, the C code has some problems:
|
|
|
|
*** add (char *)
|
|
|
|
"count_words.l", line 34: error(3232): a value of type "void *"
|
|
cannot be used to initialize an entity of type "char *"
|
|
char* str = malloc(yyleng+1); strcpy(str, yytext);
|
|
^
|
|
|
|
There are a few problems with "inline" that my non-gcc C compiler
|
|
doesn't like. I'm not sure that inline is a C keyword.
|
|
|
|
I also get a core dump when I run it:
|
|
|
|
bioreason8> ./count_words_c ./count_words.c
|
|
Bus error (core dumped)
|
|
bioreason8> dbx ./count_words_c core
|
|
dbx version 7.1 Dec 3 1996 17:03:19
|
|
Core from signal SIGBUS: Bus error
|
|
(dbx) where
|
|
> 0 hash(string = 0x100170e9 = "line") ["/home/u05/dalke/tmp/hash.c":104, 0x100037c8]
|
|
1 search(table = 0x10017010, key = 0x100170e8 = "#line")
|
|
["/home/u05/dalke/tmp/hash.c":181, 0x10003b00]
|
|
2 run() ["/home/u05/dalke/tmp/count_words.l":35, 0x100033e0]
|
|
3 yylex() ["/home/u05/dalke/tmp/count_words.l":27, 0x1000198c]
|
|
4 main(argc = 2, argv = 0x7fff2f24) ["/home/u05/dalke/tmp/count_words.l":59,
|
|
0x100034fc]
|
|
5 __start()
|
|
["/vince/6.2-mar09/work/irix/lib/libc/libc_n32_M3/csu/crt1text.s":166,
|
|
0x100015e8]
|
|
(dbx) list 92
|
|
92 /*
|
|
93 Hashes a string to produce an unsigned short, which should be
|
|
94 sufficient for most purposes.
|
|
95 */
|
|
96
|
|
97 unsigned hash(char *string)
|
|
98 {
|
|
99 unsigned int ret_val = 0;
|
|
100 int i;
|
|
101
|
|
(dbx) l
|
|
102 while (*string)
|
|
103 {
|
|
* 104 i = *( int *)string;
|
|
105 ret_val ^= i;
|
|
106 ret_val <<= 1;
|
|
107 string ++;
|
|
108 }
|
|
109 return ret_val;
|
|
110 }
|
|
111
|
|
|
|
|
|
The conversion on line 104 is really wrong. It should likely be
|
|
|
|
i = (int)(*string)
|
|
|
|
Also, I tried it (with the fix) against a core dump:
|
|
bioreason8> ls -l /usr/tmp/core
|
|
-rw-r--r-- 1 root sys 5357568 Apr 8 00:45 /usr/tmp/core
|
|
bioreason8> ./count_words_c /usr/tmp/core > /dev/null
|
|
|
|
After about 6 minutes I killed the job. Probably doesn't like the
|
|
embedded NULs. This is an example of "fuzz" testing. See
|
|
http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-95-1268
|
|
|
|
|
|
Finally, I see the lex code is processed with:
|
|
|
|
count_words.c: count_words.l
|
|
flex -B -ocount_words.c count_words.l
|
|
|
|
You should add a "-f" to that. That gave me a 15% speedup over just -B, against
|
|
the code at http://caml.inria.fr/caml-list/0899.html, from whence
|
|
you based this thread originally.
|
|
|
|
|
|
Andrew
|
|
dalke at acm.org
|
|
|
|
|
|
|
|
|