On Thu, 18 Dec 2003 00:13:48 +0000 (UTC), A.Sloman@cs.bham.ac.uk
wrote:
>That reminds me: I have been thinking of increasing the size of the
>pop11 dictionary. Currently it is 1023, and when poplog has compiled
>quite a lot of code, many of the hash buckets in the dictionary have
>more than 10 items and some over 20, which would slow down consword.
>
>You can check this by looking at the output of dic_distrib(); which
>prints out bucket sizes.
Groan! Why oh why doesn't it leave them on the stack, so that
[% dic_distrib() %]
would give a list that could be analysed?
>Since memory is now so plentiful and cheap, I was thinking it should be
>increased to at least 5 times its size, maybe 5119 (a prime number).
>
>Any comments?
If it's easy to do (change one constant?) it would be interesting to
see some before-and-after benchmarks. Presumably compiling a large
library would be an appropriate realistic benchmark?
Reading a file of natural language text as words would be a good
non-realistic benchmark (although you need to do that to construct
a concordance).
Incidentally, the best size for a hash table is about 20% bigger
(give or take a huge margin) than the number of entries. Then most
buckets are empty or have only one entry. So if 5 times is big
enough, maybe the hash function could be improved? IIRC, it was
very simple: some combination of first, second and last character
with the length of the word, or was that long ago and far away?
If it is still based on something that simple, then "ved_foo_x"
and "ved_baz_x" would hash to the same bucket ... since quite
a lot of words begin "ved" or "sys" and have the same length, it
would mean the hash was essentially just on the last character,
and increasing the hash table size would not help as much as
you'd think.
I need to reboot into Linux to look at the source files ...
Eyeballing dic_distrib() in Windows Poplog I can't tell if
there are too many big numbers: how to know? (The numbers ought
to be in a Poisson distribution I think. Help! Any statisticians
here?)
Jonathan
--
Use jlc at address, not spam.
|