[Date Prev] [Date Next] [Thread Prev] [Thread Next] Date Index Thread Index Search archive:
Date:Mon Feb 28 11:59:27 1994 
Subject:Re: a heretical suggestion regarding Pop's lexical rules 
From:jlc (Jonathan Cunningham) 
Volume-ID:940228.02 

> From: "A.Sloman" <A.Sloman@cs.bham.ac.uk>
> Subject: a heretical suggestion regarding Pop's lexical rules

> I began to wonder whether
>  it would be best to abandon the distinction between sign characters and
> alphanumeric characters, and instead, like Lisp, allow any
> space-delimited, or separator delimited, mixture of characters to be
> accepted by the lexical analyser. Similarly, words could be allowed to
> start with numerals, except for the special cases currently used where
> there's a single letter, e.g. 3.5e5, 3.5s5 (single precision float),
> 3.5d5 (double precision float).
>
> We could rule out leading numerals followed by "_", so as to avoid
> ambiguity with ratios and complex numbers e.g.  "10_/3", "2_+:3",
> "1.5_+:6.9"
>
Why not reform some of the number syntax at the same time? I prefer the
Common Lisp notation for rationals, i.e
   1/2 is the rational number 1/2
instead of having to type the ugly 1_/2

The underscore was only necessary because, with separate sign
characters, 1/2 would not previously be parsed as a single item.

Given the speed of machines nowadays, I think the simplest rule would be
to say that a token is any sequence of characters excluding separators
and spaces (ie what you are proposing, and roughly what you get in
Lisp), and that any token which isn't a number is a word. I talk about
"speed of machines" because the obvious (simple) way to implement this
is to always make a string corresponding to a token, and discard it if
you discover it can be interpreted as a number. (Actually, if you used a
fixed buffer, you needn't even create garbage when you get numbers.)

> Then, for example, x+y would be a single identifier, not three, whereas
> x+y,x-y would be three identifiers, since "," would remain a separator,
> and 1cat 2cat 3cat would be acceptable identifiers.
>
> This would have several advantages:
[deleted] - yes.
>
> Disadvantages:
>
> (1) the compiler and things like valof would have to dig into words to
> see if they start with "$-" instead of begin given "$-" as a separate
> item to test using "==". This will complicate and slow down compilation
> where sections are used. However, some other aspects of lexical analysis
> would presumably be slightly faster and certainly simpler.

We could make a similar rule for such things as we do for numbers, ie a
token which can be parsed into particular significant bits, is parsed
into those bits, so that $-baz$-grum is parsed into four words.

> (2) The lexical analyser would have a harder job deciding whether it was
> dealing with a number or a word. 123ee5 would be a word and 123e5 a
> number.
>
I don't think so. Using the "token buffer" approach, I think it would be
no harder than it is now. You might have to take a different approach,
that is all.

> (3) many expressions using infix operators will become more verbose, as
> they will need more spaces, as in lisp. (However some of us prefer to
> use spaces in those contexts in any case. I often (not always) add
> spaces to code obtained from others, e.g. after commas and around "+",
> "-", "*" "=", and "/", and possibly even "->", for the sake of clarity.
> Steve Knight goes even further and puts spaces after "(" and before ")".

Perhaps this should be moved to the list of advantages :-).

> (4) it is TOTALLY backward non-compatible, though I think it would not

I don't think translating existing code is the answer. What you could do
is make a new lexical analyser available as an alternative. Why not
adapt the existing lisp lexer - make it case sensitive and change the
tables that define separators etc. You could try it out by using
popval(pdtolist(newincharitem(rep))) - I can't think of any reason why
that wouldn't work on a new source file. Make a new pop94 compiler! :-).

Another disadvantage which you didn't mention (but it obviously doesn't
bother the lisp community), is the danger of very small typos leaving
the program syntactically legal (remember the dot instead of comma which
turned a fortran "do" loop into an assignment statement). There is an
argument for syntactic redundancy, so that typing errors cause syntax
errors (rather than run-time errors). You lose some syntactic
redundancy if you increase the number of legal identifiers (obviously -
there is no way to avoid it). OTOH, the need for more spaces partially
offsets this.

Cheers,
Jonathan