On Thu, 18 Dec 2003 14:14:49 +0000 (UTC), A.Sloman@cs.bham.ac.uk
wrote:
>Jonathan wrote:
>
>[AS]
>> >That reminds me: I have been thinking of increasing the size of the
>> >pop11 dictionary. Currently it is 1023, and when poplog has compiled
>> >quite a lot of code, many of the hash buckets in the dictionary have
>> >more than 10 items and some over 20, which would slow down consword.
>> >
[JLC]
>> >You can check this by looking at the output of dic_distrib(); which
>> >prints out bucket sizes.
>>
>> Groan! Why oh why doesn't it leave them on the stack, so that
>> [% dic_distrib() %]
>> would give a list that could be analysed?
>
>It was originally just a debugging tool for the system developers.
>But I agree that a list would have been useful.
It's *always* easier to print a list than it is to manipulate
output, therefore "system developers" should be chastised severely.
>This will do it:
Thanks ... saves me doing it. I may have a quick play with the
numbers now ...
>> IIRC, it was
>> very simple: some combination of first, second and last character
>> with the length of the word, or was that long ago and far away?
>
>Your memory is almost right. here's the code in syspop11 in
[snip]
Yes, you were too quick! I've just rebooted back into Windows
after having a look at it. (I must try out some Linux
newsreaders ...)
You will also have noticed the comment which says that if you change
the size of the dictionary, you also need to edit something else that
poplink uses. So worth looking at the code just for that comment.
>> If it is still based on something that simple, then "ved_foo_x"
>> and "ved_baz_x" would hash to the same bucket ... since quite
>It uses the length and middle character also, specifically to
>reduce this problem. I recall discussing this with John Gibson
>20 years ago or so.
I recall a similar conversation with him. The reason he didn't
use all the characters (or more of them) in a string was for
speed, but I'm not sure he quite convinced me that it wouldn't
have been better.
At present "picket", "packet" and "pocket" will all hash to the
same bucket (and "bucket" will hash to a different bucket);
similarly "mate", "mete", "mite", "mote" and "mute".
For a more quantitative argument, most variables will use
lower case letters, possibly ending in a digit. The length is
added to the first char, so that would give 30ish * 20ish * 30ish
possible hash results, or about 18,000 (ish**3) "likely"
hash results.
That's a deliberate underestimate of the *possible* hash values,
but intended to be an overestimate of the *likely* hash values.
So it's probably ok with a hash table up to about 20,000 buckets
in size. Someone working with a larger lexicon (or a multilingual
application with several languages dictionaries sharing the
same hash table) probably wants to hash on at least four
characters ... the same hash function is used for strings.
I guess it's fine for the pop11 dictionary though.
>> I need to reboot into Linux to look at the source files ...
>
>You may find it easier to browse this directory:
>
> http://www.cs.bham.ac.uk/research/poplog/src/master/C.all/src
Ok, yes. The comment I mentioned says:
If the size of the dictionary or the hashing algorithm is changed,
the corresponding procedure word_dict_cell in src/syscomp/w_util.p
must be changed also (since poplink sets up the initial dictionary)
Jonathan
--
Use jlc at address, not spam.
|