Jonathan wrote:
[AS]
> >That reminds me: I have been thinking of increasing the size of the
> >pop11 dictionary. Currently it is 1023, and when poplog has compiled
> >quite a lot of code, many of the hash buckets in the dictionary have
> >more than 10 items and some over 20, which would slow down consword.
> >
[JLC]
> >You can check this by looking at the output of dic_distrib(); which
> >prints out bucket sizes.
>
> Groan! Why oh why doesn't it leave them on the stack, so that
> [% dic_distrib() %]
> would give a list that could be analysed?
It was originally just a debugging tool for the system developers.
But I agree that a list would have been useful.
This will do it:
define dic_numbers() -> list;
;;; stack printed characters
define dlocal cucharout(char);
if char == `.` then `0` else char endif
enddefine;
;;; make a string of pop11 text for constructing a list
lvars string = consstring( #| `[`, dic_distrib(), `]` |# );
;;; compile the sring to make the list
pop11_compile(stringin(string)) -> list;
enddefine;
or compress the last two code lines to this near-incomprehensible
(except for experienced pop11 hackers) one-liner:
pop11_compile(stringin(consstring( #| `[`, dic_distrib(), `]` |# ))) -> list;
Another alternative would be to have two processes, one
producing characters the other consuming them. Then you would not need a
string. That's rather more tricky to do in pop11. The produce-consume
interaction could be driven by cucharout.
> >Since memory is now so plentiful and cheap, I was thinking it should be
> >increased to at least 5 times its size, maybe 5119 (a prime number).
> >
> >Any comments?
>
> If it's easy to do (change one constant?) it would be interesting to
> see some before-and-after benchmarks. Presumably compiling a large
> library would be an appropriate realistic benchmark?
>
> Reading a file of natural language text as words would be a good
> non-realistic benchmark (although you need to do that to construct
> a concordance).
For things like that I've suggested that it would be good to have a
version of incharitem that instead of producing word records (in the
dictionary) produced strings.
It should not be hard to implement using the stuff already in
$popsrc/item.p
> Incidentally, the best size for a hash table is about 20% bigger
> (give or take a huge margin) than the number of entries.
In that case since I get
countwords() =>
** 8127
in the instance of pop11 in which I am typing this message, my
suggestion is too small. Maybe something over 10000 would be better. (I
don't know if it could be changed dynamically -- perhaps not without a
lot of other changes). On a 32-bit machine that would require a table
of about 40Kbytes in size. On a 64-bit machine (e.g. Alpha) that would
grow to 80Kbytes. In 1976 that was more than the size of a whole pop11
system. But nowadays memory is available and cheap, so if there was
a significant speed gain for compilation and other applications it
might be worth it.
However compiling the whole of pop11 prolog on my 1ghz AMD athlon
takes under a second:
time pop11 :"lib prolog" > /dev/null
0.695u 0.050s 0:00.74 100.0% 0+0k 0+0io 573pf+0w
so maybe it is not worth it.
Even common lisp compiles in about 4 seconds:
time pop11 :"lib clisp" > /dev/null
3.601u 0.130s 0:03.73 100.0% 0+0k 0+0io 592pf+0w
(The second time. The first time takes about 1 seond longer.
Presumably the second time I beneift from Linux's disk cache.)
> Then most
> buckets are empty or have only one entry. So if 5 times is big
> enough, maybe the hash function could be improved?
I misled you about 5 times being big enough. It would not meet the
criterion you quoted.
> IIRC, it was
> very simple: some combination of first, second and last character
> with the length of the word, or was that long ago and far away?
Your memory is almost right. here's the code in syspop11 in
$popsrc/vec_generic.p
define :inline lconstant CHAR_HASH(T);
if _len _sgr _2 then
;;; use first, middle and last chars and length
_cptr!(T)[_0] _add _len
_add _shift(_cptr!(T)[_len _sub _1] _add _len, _3)
_add _shift(_cptr!(T)[_shift(_len,_-1)] _add _len, _6) -> _len;
_shift(_len, _-10) _add _len
elseif _len == _2 then
;;; use first and last chars
_cptr!(T)[_0] _add _shift(_cptr!(T)[_1], _3)
elseif _len == _1 then
;;; just use first character
_cptr!(T)[_0]
else
_0
endif
enddefine;
> If it is still based on something that simple, then "ved_foo_x"
> and "ved_baz_x" would hash to the same bucket ... since quite
> a lot of words begin "ved" or "sys" and have the same length, it
> would mean the hash was essentially just on the last character,
It uses the length and middle character also, specifically to
reduce this problem. I recall discussing this with John Gibson
20 years ago or so.
> I need to reboot into Linux to look at the source files ...
You may find it easier to browse this directory:
http://www.cs.bham.ac.uk/research/poplog/src/master/C.all/src
which includes the common code in $popsrc (as of some time in 1999).
I must install the recent changes.
> Eyeballing dic_distrib() in Windows Poplog I can't tell if
> there are too many big numbers: how to know? (The numbers ought
> to be in a Poisson distribution I think. Help! Any statisticians
> here?)
Aaron
|