[Date Prev] [Date Next] [Thread Prev] [Thread Next] Date Index Thread Index Search archive:
Date:Mon Dec 18 14:14:49 2003 
Subject:Re: Comparing Garbage Collectors 
From:A . Sloman 
Volume-ID:1031218.05 

Jonathan wrote:

[AS]
> >That reminds me: I have been thinking of increasing the size of the
> >pop11 dictionary. Currently it is 1023, and when poplog has compiled
> >quite a lot of code, many of the hash buckets in the dictionary have
> >more than 10 items and some over 20, which would slow down consword.
> >
 [JLC]
> >You can check this by looking at the output of dic_distrib(); which
> >prints out bucket sizes.
>
> Groan! Why oh why doesn't it leave them on the stack, so that
>  [% dic_distrib() %]
> would give a list that could be analysed?

It was originally just a debugging tool for the system developers.
But I agree that a list would have been useful.

This will do it:

define dic_numbers() -> list;

     ;;; stack printed characters
     define dlocal cucharout(char);
        if char == `.` then `0` else char endif
     enddefine;

     ;;; make a string of pop11 text for constructing a list
     lvars string = consstring( #| `[`, dic_distrib(), `]` |# );

     ;;; compile the sring to make the list
     pop11_compile(stringin(string)) -> list;

 enddefine;

or compress the last two code lines to this near-incomprehensible
(except for experienced pop11 hackers) one-liner:

     pop11_compile(stringin(consstring( #| `[`, dic_distrib(), `]` |# ))) -> list;

Another alternative would be to have two processes, one
producing characters the other consuming them. Then you would not need a
string. That's rather more tricky to do in pop11. The produce-consume
interaction could be driven by cucharout.

> >Since memory is now so plentiful and cheap, I was thinking it should be
> >increased to at least 5 times its size, maybe 5119 (a prime number).
> >
> >Any comments?
>
> If it's easy to do (change one constant?) it would be interesting to
> see some before-and-after benchmarks. Presumably compiling a large
> library would be an appropriate realistic benchmark?
>
> Reading a file of natural language text as words would be a good
> non-realistic benchmark (although you need to do that to construct
> a concordance).

For things like that I've suggested that it would be good to have a
version of incharitem that instead of producing word records (in the
dictionary) produced strings.

It should not be hard to implement using the stuff already in
$popsrc/item.p

> Incidentally, the best size for a hash table is about 20% bigger
> (give or take a huge margin) than the number of entries.

In that case since I get

    countwords() =>
    ** 8127

in the instance of pop11 in which I am typing this message, my
suggestion is too small. Maybe something over 10000 would be better. (I
don't know if it could be changed dynamically -- perhaps not without a
lot of other changes). On a 32-bit machine that would require a table
of about 40Kbytes in size. On a 64-bit machine (e.g. Alpha) that would
grow to 80Kbytes. In 1976 that was more than the size of a whole pop11
system. But nowadays memory is available and cheap, so if there was
a significant speed gain for compilation and other applications it
might be worth it.

However compiling the whole of pop11 prolog on my 1ghz AMD athlon
takes under a second:

    time pop11 :"lib prolog" > /dev/null
    0.695u 0.050s 0:00.74 100.0%    0+0k 0+0io 573pf+0w

so maybe it is not worth it.

Even common lisp compiles in about 4 seconds:

    time pop11 :"lib clisp" > /dev/null
    3.601u 0.130s 0:03.73 100.0%    0+0k 0+0io 592pf+0w

(The second time. The first time takes about 1 seond longer.
Presumably the second time I beneift from Linux's disk cache.)

> Then most
> buckets are empty or have only one entry. So if 5 times is big
> enough, maybe the hash function could be improved?

I misled you about 5 times being big enough. It would not meet the
criterion you quoted.

> IIRC, it was
> very simple: some combination of first, second and last character
> with the length of the word, or was that long ago and far away?

Your memory is almost right. here's the code in syspop11 in

    $popsrc/vec_generic.p

    define :inline lconstant CHAR_HASH(T);
        if _len _sgr _2 then
            ;;; use first, middle and last chars and length
            _cptr!(T)[_0] _add _len
            _add _shift(_cptr!(T)[_len _sub _1] _add _len, _3)
            _add _shift(_cptr!(T)[_shift(_len,_-1)] _add _len, _6) -> _len;
            _shift(_len, _-10) _add _len
        elseif _len == _2 then
            ;;; use first and last chars
            _cptr!(T)[_0] _add _shift(_cptr!(T)[_1], _3)
        elseif _len == _1 then
            ;;; just use first character
            _cptr!(T)[_0]
        else
            _0
        endif
    enddefine;

> If it is still based on something that simple, then "ved_foo_x"
> and "ved_baz_x" would hash to the same bucket ... since quite
> a lot of words begin "ved" or "sys" and have the same length, it
> would mean the hash was essentially just on the last character,

It uses the length and middle character also, specifically to
reduce this problem. I recall discussing this with John Gibson
20 years ago or so.

> I need to reboot into Linux to look at the source files ...

You may find it easier to browse this directory:

    http://www.cs.bham.ac.uk/research/poplog/src/master/C.all/src

which includes the common code in $popsrc (as of some time in 1999).

I must install the recent changes.

> Eyeballing dic_distrib() in Windows Poplog I can't tell if
> there are too many big numbers: how to know? (The numbers ought
> to be in a Poisson distribution I think. Help! Any statisticians
> here?)


Aaron