Clive Souter poses this interesting problem:
> I'm having a problem trying to load a large lexicon into pop. By large,
> I mean natural language type large, ie almost 60,000 lines. As a relatively
> naive pop-user, I thought I would make the external (unix) file look like
> this:
>
> [[....] ...... [....]]->lexicon;
>
[A lot of discussion deleted]
> Incidentally, the reason I'm trying to load
> the lexicon is to create a pop11 datafile structure from it, but I believe
> you need to load it first to create one.
As a preliminary, I strongly suggest that such large data structures
are kept on secondary store. This lexicon will consume approximately
2Mb of data (estimated using an average of 8 long words per word). This
will significantly impact the overall performance of the program -- in
terms of garbage collection, page and memory cache performance and so on.
Furthermore, this 2Mb is just what is needed for the words. When you
add in the properties associated with the words (as you are likely
to want to) an in-store solution is unattractive.
So, rule of thumb, keep your large data structures on disc.
Secondly, definitely don't use -datafile-. Yes, it is very convenient
but that convenience is bought at the expense of excessive store
consumption, opacity of representation, and sluggishness of performance.
It isn't really intended for this kind of job. And besides datafile
doesn't quite work as advertised and I don't think that one should be
getting into those kind of issues.
If you are looking for a packaged solution, I suggest getting hold
of Jon Meyer's interface to "gdbm" which will just about exactly what you
want. Jon works for Integral Solutions Ltd and would (I think) be
happy to make his work available. For all I know it already is!
gdbm is a package for creating and accessing large lookup tables
of strings.
The rest of this note is relevant if you are prepared to roll your own
solution.
> Another solution, which would involve me getting my hands dirty is to use the
> lexicon as an external file, with an external look-up program written in say
> C, which was then called using pop's external functions. However, judging
> by the number of queries on pop-forum about external functions, i'm not
> sure this is a desirable solution.
This is certainly an option. Don't be put off by any discussion of
external functions -- they are a cinch in normal use. The recent spate
of complaints is related to the considerable complexity that is
introduced by the interface to the X toolkit. And they are largely
caused by the short-sightedness of the designers of X and the toolkit
who are under the dubious impression that C is a good language to
develop graphical applications in. Pah.
However, why bother with C? Pop11 has a whole bunch of perfectly
decent file i/o primitives that are just what the doctor ordered.
Look at REF SYSIO. If you are happy to hack in C you'll be perfectly
at home here.
From my own efforts at manipulating large dictionaries (250,000 words
and upwards) I recommend the following technique. Prepare the lexicon
as a file of words + any properties you want. e.g.
cat (noun)
jump (verb)
quickly (adverb)
Sort this on the first field, using the "sort" facility under UNIX.
[This is an absolutely indispensible program, incidentally, and gets
four stars.] Now this is a free-format file -- easy to read but
not so easy to manipulate as a random access file.
Then write an indexing program. This program will generate a binary index
from the lexicon. The index consists of the starting addresses of
all the words in the lexicon. [This index *can* be loaded into Pop11
as an array. It is only 60,000 words in size. However, make sure it
gets loaded as an integer-only vector (an intvec). Integer only vectors
are NOT scanned by the garbage collector and therefore don't make as
much impact on the performance.]
To look up a word in the lexicon you can use binary chop or prepare other
more sophisticated indexes. For 60,000 elements, I would tend to construct
other indices (e.g. a first letter index, that tells you where each
letter of the alphabet starts in the lexicon) and use a version that
makes a slightly better guess at where it should go to than just binary
chop.
Steve
|