Computers need to recognise whether a sequence of objects forms a valid statement in a language.
Consider
x := x + 1
the cat eats the canary.
A parser is a Scheme function which recognises a phrase in a language,
For English, some well-formed part of a sentence.
"the cat" is a phrase of the English language.
Why do we focus on recognising phrases?
A sentence is complicated - built up out of phrases
For English: create parsers to recognise
nouns,
determiners (the words "a" and "the")
use these to build parser for recognising noun-phrases
we build a parser for sentences out of noun- and verb- phrase parsers.
Languages are defined by grammars.
Grammars are written using productions,
sentence -> noun_phrase verb_phrase
says that a sentence is a noun-phrase followed by a verb-phrase.
In Scheme terms, the sentence:
'(the cat eats the canary)
is made appending the noun-phrase the cat
to the verb-phrase eats the canary.
(append '(the cat) '(eats the canary))
terminal symbols are those entities in the grammar which cannot be rewritten.
They are words (or "tokens", or "lexemes") of the language.
Those entities in the grammar which can be rewritten are called "non-terminal symbols".
We shall use the notation that we have already been using informally to descibe Scheme, in which non-terminals are written using bold typewriter characters, while terminals are written using plain tyewriter characters.
sentence -> noun_phrase verb_phrase noun_phrase -> determiner noun determiner -> the determiner -> a noun -> cat noun -> dog noun -> canary verb_phrase -> verb noun_phrase verb -> eats verb -> likes
For convenience, the vertical bar can be used to indicate alternative Right Hand Sides
verb -> eats | likes
The earliest convention for writing productions used by computer
scientists is Backus-Naur Form (BNF), in which non-terminal
symbols are enclosed in angle brackets. For example
<sentence> -> <noun_phrase> <verb_phrase> <determiner> -> the |
Another convention that is used is to quote terminal symbols, thus:
sentence -> noun_phrase verb_phrase determiner -> "the" |
A parser is a function which recognises that a particular sequence of terminal symbols is a legal statement in a language.
Practical applications of parsers usually require that some kind of parse-tree is produced to represent the logical structure of the statement.
A compiler needs a parse-tree in order to generate code.
A natural language query system requires a parse tree in order to "understand" a question put to it and to generate an answer.
In computer science we call the terminal symbols fed to a parser "tokens"
Use Scheme expressions to represent parse-trees. So, a Pascal parser might parse:
'(x + 2 * y)
as the Scheme structure
'(+ x (* 2 y))
Suppose we define a parser as a function that takes a list of tokens and returns a parse-tree.
Raises two problems
we have to find some way of signalling that a list of tokens is not in fact a sentence of the language #f
sure that #f cant be mistaken for an actual parse-tree?
Compositionality.
Hopeless to try to write one huge function which will perform a parse of a complex language.
Have to build a parser out of smaller functions that we write.
What sort of functions?
Build big parsers out of smaller ones.
sentence -> noun_phrase verb_phrase
We want a parser for a sentence - so build it out of a parser for the grammatical class
noun_phrase
and a parser for the class verb_phrase.
Suppose we want our sentence-parser to parse this sentence:
(the cat eats the canary)
What text are the noun_phrase parser and the verb_phrase parser going to have to work on?
The noun_phrase-parser is going to work on the whole list:
(the cat eats the canary)
recognising (the cat) as being a noun_phrase.
The verb_phrase-parser is going to have to work on (eats the canary).
We have to provide some way of giving it its correct argument.
How about the cddr of the original list
Wewould be making the assumption that each noun phrase consists of exactly two tokens,
(the furry cat eats the canary).
A bad mistake when creating parsers to assume that any given grammatical class had a fixed length.
Better: require a parser to return result a record which contains
part of the original list of tokens which remains unparsed.
That way we can slot the two parsers for noun phrases and verb phrases together (or any other two).
We can think of the first parser as "eating" tokens until it is satisfied, when what remains is given to the second parser.
If we are using UMASS Scheme we can define a record class for parses as follows:
(define class_parse (record-class 'parse '(full full))) (define cons_parse (car class_parse)) (define sel_parse (caddr class_parse)) (define tree_parse (car sel_parse)) (define rest_parse (cadr sel_parse))
Now, a parser can signal failure by returning #f without risk of confusion, since if it succeeds it will always return one of these parse-records.
Thus we can regard a parser as a function which takes as argument a list of tokens and returns as result
For example, we might write a parser which recognised expressions in the Pascal language. internal form)
The Pascal expression x+2*y would be represented by the Scheme
(+ x (* 2 y))
Thus suppose we have a parser for Pascal
(define p (parse_pascal '(x + 2 * y ; z := 4;)) )
then the Scheme variable p will have the value which prints as:
<parse (begin (+ x (* 2 y)) (:= z 4)) '()>
if we choose an obvious way to represent Pascal as Scheme.
To make a parser for the whole English language is hard
Even to make a parser for a programming language is quite hard.
However, let us make a start
noun_phrase -> determiner noun determiner -> the determiner -> aThat is to say, a determiner is the word "a" or "the".
Lets write a parser for determiner
The function takes a list of tokens as argument.
If (1) the list of tokens begins with the symbol 'a or with the symbol 'the,we have a determiner,
so (2) create a parse-record
(define parse_determiner (lambda (list_of_tokens) (if (member? (car list_of_tokens) '(a the)) ; (1) determiner there? (cons_parse ; (2) yes! make parse (car list_of_tokens) ; (3) tree (cdr list_of_tokens)) ; (4) unparsed list #f ) )) ; (5) fail
We will need the definition of member? from Lecture 5:
(define (member? x list) (if (null? list) #f (if (equal? x (car list)) #t (member? x (cdr list)))))
We can now try this out.
(parse_determiner '(the cat eats the canary))
returns the parse-record:
<parse the (cat eats the canary)>
The tree_parse component of the parse record is the symbol 'the
the rest_parse component is '(eats the canary), that is the original list with the determiner removed.
While if there is no determiner:
(parse_determiner '(eats the canary))
we get
#f
Note that there is a bug in this function, we have not allowed for the possibility that the list of tokens might be empty. We'll mend this later.
Likewise we could define a noun as a member of a list of words known to be nouns. Now there are tens if not hundreds of thousands of nouns in the English language, so this would be a long list (and expensive to look through), but we can restrict our vocabulary.
(define noun '(cat dog child woman man bone cabbage canary)) (define parse_noun (lambda (list_of_tokens) (if (member? (car list_of_tokens) noun) (cons_parse (car list_of_tokens) (cdr list_of_tokens)) #f ) ))
Now, can we write a parser for a noun_phrase? Make use of our two existing parsers, rather than trying to write a function that sees if the list_of_tokens has a determiner as a first word and a noun as second. This is NOT what we should do:
(define parse_noun_phrase (lambda (list_of_tokens) ; DONT DO THIS (if (and (member? (car list_of_tokens) '(a the)) (member? (cadr list_of_tokens) noun) ) (cons_parse (list 'noun_phrase (car list_of_tokens) (cadr list_of_tokens)) (cddr list_of_tokens) ) #f) ))
Consider our grammar
noun_phrase -> determiner noun_phrase
There's no way we could extend the kind of parser we see above to handle a language with a complex grammar. Instead we should rely on our little parsers being designed to work together to make a big parser.
(define parse_noun_phrase (lambda (list_of_tokens) (let ((p_det (parse_determiner list_of_tokens))) ; (1) (if p_det ; (2) (let ( (p_n (parse_noun (rest_parse p_det)))); (3) (if p_n ; (4) (cons_parse ; (5) (list 'noun_phrase ; (6) (tree_parse p_det) ; (7) (tree_parse p_n)) ; (8) (rest_parse p_n) ; (9) ) ; #f) ;(10) );end let ; #f) ;(11) );end let ; )); end def. parse_noun_phrase ;
Now we can try this out
(parse_noun_phrase '(the cat eats the canary))
obtaining
<parse (noun_phrase the cat) (eats the canary)>
The tree_parse component of the parse record is
'(noun_phrase the cat) ,
the rest_parse component is '(eats the canary)
the original list with the noun-phrase removed.
and, if the parse fails:
(parse_noun_phrase '(eats the canary))
we get
#f
Also if we have a determiner first, but no noun second:
(parse_noun_phrase '(the the canary))
we get
#f
(define verb '(likes eats hugs)) (define (parse_verb list_of_tokens) (if (member? (car list_of_tokens) verb) (cons_parse (car list_of_tokens) (cdr list_of_tokens)) #f ) )
We are going to get fed up with changing the names of the p_... variables, so let us call them p1 and p2.
(define parse_verb_phrase (lambda (list_of_tokens) (let ((p1 (parse_verb list_of_tokens))) (if p1 (let ((p2 (parse_noun_phrase (rest_parse p1)))) (if p2 (cons_parse (list 'verb_phrase (tree_parse p1) (tree_parse p2)) (rest_parse p2) ) #f) ) ;end let #f) );end let )); end def. parse_verb_phrase
(example '(parse_verb_phrase '(eats the canary)) (cons_parse '(verb_phrase eats (noun_phrase the canary)) '()) )
Now we can define a sentence of the English Language
sentence -> noun_phrase verb_phrase
(define parse_sentence (lambda (list_of_tokens) (let ((p1 (parse_noun_phrase list_of_tokens))) (if p1 (let ((p2 (parse_verb_phrase (rest_parse p1)))) (if p2 (cons_parse (list 'sentence (tree_parse p1) (tree_parse p2)) (rest_parse p2) ) #f) ) ;end let #f) );end let )); end def. parse_sentence
Now we can try out our complete sentence-parser. If we try it on the sentence '(the cat eats the canary) we see that we obtain a parse:
(example '(parse_sentence '(the cat eats the canary)) (cons_parse '(sentence (noun_phrase the cat) ; parse-tree (verb_phrase eats (noun_phrase the canary))) ; end of parse-tree '() ; unparsed ) ; end of parse )
If we give it a non-sentence like '(canary the cat eats) we get:
(example '(parse_sentence '(canary the cat eats)) #f)
that is, the parse fails.
Give it a sentence with nonsense at the end.
We can use the example capability to compare the result of the parse (1) with what we expect (2).
Run the example - it works!
(example '(parse_sentence '(the dog eats the bone 4 5 6)) ;(1) (cons_parse ;(2) '(sentence (noun_phrase the dog) (verb_phrase eats (noun_phrase the bone))) '(4 5 6) ) )
(example '(parse_sentence '(the cat eats the canary with the yellow feathers)) (cons_parse '(sentence (noun_phrase the cat) (verb_phrase eats (noun_phrase the canary))) '(with the yellow feathers) ) )
that is the prepositional phrase '(with the yellow feathers) is left unrecognised. Note that this form of sentence also poses a problem of ambiguity - does the canary have yellow feathers? or perhaps the cat uses yellow feathers as an instrument with which to eat the unfortunate bird. We know that canaries are yellow-feathered birds and that cats are not given to using tools as instruments - but that is semantic knowledge that cannot readily be built into syntax. The parallel sentence "the man eat the turkey with the knife and fork", has the opposite structure.
(example '(parse_sentence '(the cabbage eats the man)) ( cons_parse '(sentence (noun_phrase the cabbage) (verb_phrase eats (noun_phrase the man))) '() ) )
Distinguishing grammatically correct sense from grammatically correct nonsense is an issue of semantics.
We can make good use of the trace function to help us debug our parsers.
(trace parse_verb_phrase) (trace parse_verb) (trace parse_noun_phrase) (parse_verb_phrase '(eats the canary)) > parse_verb_phrase [eats the canary ] 1 !> parse_verb [eats the canary ] 1 !< parse_verb [eats the canary ] !> parse_noun_phrase [the canary ] 1 !< parse_noun_phrase [[noun_phrase the canary ]] < parse_verb_phrase [[verb_phrase eats [noun_phrase the canary ]]] ((verb_phrase eats (noun_phrase the canary)))