1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
|
\documentstyle[11pt]{article}
\title{TP Lex and Yacc -- The Compiler Writer's Tools for Turbo Pascal\\
Version 4.1 User Manual}
\author{Albert Gr\"af\\
Department of Musicinformatics\\
Johannes Gutenberg-University Mainz\\
\\
ag@muwiinfa.geschichte.uni-mainz.de}
\date{April 1998}
\setlength{\topmargin}{0cm}
\setlength{\oddsidemargin}{0cm}
\setlength{\evensidemargin}{0cm}
\setlength{\textwidth}{16cm}
\setlength{\textheight}{21cm}
%\setlength{\parindent}{0pt}
\parskip=4pt plus 1pt minus 1pt
\itemsep=0pt
\renewcommand{\baselinestretch}{1.1}
\unitlength=1mm
%\tolerance=500
%\parskip=0.1cm
\leftmargini 1.5em
\leftmarginii 1.5em \leftmarginiii 1.5em \leftmarginiv 1.5em \leftmarginv 1.5em
\leftmarginvi 1.5em
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This document describes the TP Lex and Yacc compiler generator toolset.
These tools are designed especially to help you prepare compilers and
similar programs like text processing utilities and command language
interpreters with the Turbo Pascal (TM) programming language.
TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM)
utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson
at Bell Laboratories, and are used with the C programming language. TP Lex
and Yacc are intended to be approximately ``compatible'' with these programs.
However, they are an independent development of the author, based on the
techniques described in the famous ``dragon book'' of Aho, Sethi and Ullman
(Aho, Sethi, Ullman: {\em Compilers : principles, techniques and tools,\/}
Reading (Mass.), Addison-Wesley, 1986).
Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland
Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo
Pascal-compatible compiler which currently runs on DOS and Linux (other ports
are under development). Recent information about TP Lex/Yacc, and the sources
are available from the TPLY homepage:
\begin{quote}\begin{verbatim}
http://www.musikwissenschaft.uni-mainz.de/~ag/tply
\end{verbatim}\end{quote}
For information about the Free Pascal Compiler, please refer to:
\begin{quote}\begin{verbatim}
http://www.freepascal.org
\end{verbatim}\end{quote}
TP Lex and Yacc, like any other tools of this kind, are not intended for
novices or casual programmers; they require extensive programming experience
as well as a thorough understanding of the principles of parser design and
implementation to be put to work successfully. But if you are a seasoned
Turbo Pascal programmer with some background in compiler design and formal
language theory, you will almost certainly find TP Lex and Yacc to be a
powerful extension of your Turbo Pascal toolset.
This manual tells you how to get started with the TP Lex and Yacc programs
and provides a short description of these programs. Some knowledge about
the C versions of Lex and Yacc will be useful, although not strictly
necessary. For further reading, you may also refer to:
\begin{itemize}
\item
Aho, Sethi and Ullman: {\em Compilers : principles, techniques and
tools.\/} Reading (Mass.), Addison-Wesley, 1986.
\item
Johnson, S.C.: {\em Yacc -- yet another compiler-compiler.\/} CSTR-32,
Bell Telephone Laboratories, 1974.
\item
Lesk, M.E.: {\em Lex -- a lexical analyser generator.\/} CSTR-39, Bell
Telephone Laboratories, 1975.
\item
Schreiner, Friedman: {\em Introduction to compiler construction with
UNIX.\/} Prentice-Hall, 1985.
\item
The Unix Programmer's Manual, Sections `Lex' and `Yacc'.
\end{itemize}
\subsection*{Credits}
I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex
and Yacc to take advantage of the large memory models in Borland Pascal 7.0
and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be),
the maintainer of the Linux version of the Free Pascal compiler, who is
responsible for the Free Pascal port. And of course thanks are due to the many
TP Lex/Yacc users all over the world for their support and comments which
helped to improve these programs.
\subsection*{Getting Started}
Instructions on how to compile and install TP Lex and Yacc on all supported
platforms can be found in the \verb"README" file contained in the
distribution.
Once you have installed TP Lex and Yacc on your system, you can compile your
first TP Lex and Yacc program \verb"expr". \verb"Expr" is a simple desktop
calculator program contained in the distribution, which consists of a lexical
analyzer in the TP Lex source file \verb"exprlex.l" and the parser and main
program in the TP Yacc source file \verb"expr.y". To compile these programs,
issue the commands
\begin{quote}\begin{verbatim}
lex exprlex
yacc expr
\end{verbatim}\end{quote}
That's it! You now have the Turbo Pascal sources (\verb"exprlex.pas" and
\verb"expr.pas") for the \verb"expr" program. Use the Turbo Pascal
compiler to compile these programs as follows:
\begin{quote}\begin{verbatim}
tpc expr
\end{verbatim}\end{quote}
(Of course, the precise compilation command depends on the type of compiler
you are using. Thus you may have to replace \verb"tpc" with \verb"bpc",
\verb"dcc" or \verb"dcc32", depending on the version of the
Turbo/Borland/Delphi compiler you have, and with \verb"ppc386" for the Free
Pascal compiler. If you are using TP Lex and Yacc with Free Pascal under
Linux, the corresponding commands are:
\begin{quote}\begin{verbatim}
plex exprlex
pyacc expr
ppc386 expr
\end{verbatim}\end{quote}
Note that in the Linux version, the programs are named \verb"plex" and
\verb"pyacc" to avoid name clashes with the corresponding UNIX utilities.)
Having compiled \verb"expr.pas", you can execute the \verb"expr" program and
type some expressions to see it work (terminate the program with an empty
line). There is a number of other sample TP Lex and Yacc programs (\verb".l"
and \verb".y" files) in the distribution, including a TP Yacc cross reference
utility and a complete parser for Standard Pascal.
The TP Lex and Yacc programs recognize some options which may be specified
anywhere on the command line. E.g.,
\begin{quote}\begin{verbatim}
lex -o exprlex
\end{verbatim}\end{quote}
runs TP Lex with ``DFA optimization'' and
\begin{quote}\begin{verbatim}
yacc -v expr
\end{verbatim}\end{quote}
runs TP Yacc in ``verbose'' mode (TP Yacc generates a readable description
of the generated parser).
The TP Lex and Yacc programs use the following default filename extensions:
\begin{itemize}
\item \verb".l": TP Lex input files
\item \verb".y": TP Yacc input files
\item \verb".pas": TP Lex and Yacc output files
\end{itemize}
As usual, you may overwrite default filename extensions by explicitly
specifying suffixes.
If you ever forget how to run TP Lex and Yacc, you can issue the command
\verb"lex" or \verb"yacc" (resp.\ \verb"plex" or \verb"pyacc")
without arguments to get a short summary of the command line syntax.
\section{TP Lex}
This section describes the TP Lex lexical analyzer generator.
\subsection{Usage}
\begin{quote}\begin{verbatim}
lex [options] lex-file[.l] [output-file[.pas]]
\end{verbatim}\end{quote}
\subsection{Options}
\begin{description}
\item[\verb"-v"]
``Verbose:'' Lex generates a readable description of the generated
lexical analyzer, written to lex-file with new extension \verb".lst".
\item[\verb"-o"]
``Optimize:'' Lex optimizes DFA tables to produce a minimal DFA.
\end{description}
\subsection{Description}
TP Lex is a program generator that is used to generate the Turbo Pascal
source code for a lexical analyzer subroutine from the specification
of an input language by a regular expression grammar.
TP Lex parses the source grammar contained in \verb"lex-file" (with default
suffix \verb".l") and writes the constructed lexical analyzer subroutine
to the specified \verb"output-file" (with default suffix \verb".pas"); if no
output file is specified, output goes to \verb"lex-file" with new suffix
\verb".pas." If any errors are found during compilation, error messages are
written to the list file (\verb"lex-file" with new suffix \verb".lst").
The generated output file contains a lexical analyzer routine, \verb"yylex",
implemented as:
\begin{quote}\begin{verbatim}
function yylex : Integer;
\end{verbatim}\end{quote}
This routine has to be called by your main program to execute the lexical
analyzer. The return value of the \verb"yylex" routine usually denotes the
number of a token recognized by the lexical analyzer (see the \verb"return"
routine in the \verb"LexLib" unit). At end-of-file the \verb"yylex" routine
normally returns \verb"0".
The code template for the \verb"yylex" routine may be found in the
\verb"yylex.cod" file. This file is needed by TP Lex when it constructs the
output file. It must be present either in the current directory or in the
directory from which TP Lex was executed (TP Lex searches these directories in
the indicated order). (NB: For the Linux/Free Pascal version, the code
template is searched in some directory defined at compile-time instead of the
execution path, usually /usr/lib/fpc/lexyacc.)
The TP Lex library (\verb"LexLib") unit is required by programs using
Lex-generated lexical analyzers; you will therefore have to put an appropriate
\verb"uses" clause into your program or unit that contains the lexical
analyzer routine. The \verb"LexLib" unit also provides various useful utility
routines; see the file \verb"lexlib.pas" for further information.
\subsection{Lex Source}
A TP Lex program consists of three sections separated with the \verb"%%"
delimiter:
\begin{quote}\begin{verbatim}
definitions
%%
rules
%%
auxiliary procedures
\end{verbatim}\end{quote}
All sections may be empty. The TP Lex language is line-oriented; definitions
and rules are separated by line breaks. There is no special notation for
comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
fragments (see below).
The definitions section may contain the following elements:
\begin{itemize}
\item
regular definitions in the format:
\begin{quote}\begin{verbatim}
name substitution
\end{verbatim}\end{quote}
which serve to abbreviate common subexpressions. The \verb"{name}"
notation causes the corresponding substitution from the definitions
section to be inserted into a regular expression. The name must be
a legal identifier (letter followed by a sequence of letters and digits;
the underscore counts as a letter; upper- and lowercase are distinct).
Regular definitions must be non-recursive.
\item
start state definitions in the format:
\begin{quote}\begin{verbatim}
%start name ...
\end{verbatim}\end{quote}
which are used in specifying start conditions on rules (described
below). The \verb"%start" keyword may also be abbreviated as \verb"%s"
or \verb"%S".
\item
Turbo Pascal declarations enclosed between \verb"%{" and \verb"%}".
These will be inserted into the output file (at global scope). Also,
any line that does not look like a Lex definition (e.g., starts with
blank or tab) will be treated as Turbo Pascal code. (In particular,
this also allows you to include Turbo Pascal comments in your Lex
program.)
\end{itemize}
The rules section of a TP Lex program contains the actual specification of
the lexical analyzer routine. It may be thought of as a big \verb"CASE"
statement discriminating over the different patterns to be matched and listing the
corresponding statements (actions) to be executed. Each rule consists of a
regular expression describing the strings to be matched in the input, and a
corresponding action, a Turbo Pascal statement to be executed when the
expression matches. Expression and statement are delimited with whitespace
(blanks and/or tabs). Thus the format of a Lex grammar rule is:
\begin{quote}\begin{verbatim}
expression statement;
\end{verbatim}\end{quote}
Note that the action must be a single Turbo Pascal statement terminated
with a semicolon (use \verb"begin ... end" for compound statements). The
statement may span multiple lines if the successor lines are indented with
at least one blank or tab. The action may also be replaced by the \verb"|"
character, indicating that the action for this rule is the same as that for
the next one.
The TP Lex library unit provides various variables and routines which are
useful in the programming of actions. In particular, the \verb"yytext" string
variable holds the text of the matched string, and the \verb"yyleng" Byte
variable its length.
Regular expressions are used to describe the strings to be matched in a
grammar rule. They are built from the usual constructs describing character
classes and sequences, and operators specifying repetitions and alternatives.
The precise format of regular expressions is described in the next section.
The rules section may also start with some Turbo Pascal declarations
(enclosed in \verb"%{ %}") which are treated as local declarations of the
actions routine.
Finally, the auxiliary procedures section may contain arbitrary Turbo
Pascal code (such as supporting routines or a main program) which is
simply tacked on to the end of the output file. The auxiliary procedures
section is optional.
\subsection{Regular Expressions}
Table \ref{tab1} summarizes the format of the regular expressions
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig.\ 3.48).
$c$ stands for a single character, $s$ for a string, $r$ for a regular
expression, and $n,m$ for nonnegative integers.
\begin{table*}\centering
\begin{tabular}{c|c|c}
\hline\hline
{\sc Expression}& {\sc Matches}& {\sc Example}\\
\hline
$c$& any non-operator character $c$& \verb"a"\\
\verb"\"$c$& character $c$ literally& \verb"\*"\\
\verb'"'$s$\verb'"'& string $s$ literally& \verb'"**"'\\
\verb"."& any character but newline& \verb"a.*b"\\
\verb"^"& beginning of line& \verb"^abc"\\
\verb"$"& end of line& \verb"abc$"\\
\verb"["$s$\verb"]"& any character in $s$& \verb"[abc]"\\
\verb"[^"$s$\verb"]"& any character not in $s$& \verb"[^abc]"\\
$r$\verb"*"& zero or more $r$'s& \verb"a*"\\
$r$\verb"+"& one or more $r$'s& \verb"a+"\\
$r$\verb"?"& zero or one $r$& \verb"a?"\\
$r$\verb"{"$m$\verb","$n$\verb"}"& $m$ to $n$ occurrences of $r$& \verb"a{1,5}"\\
$r$\verb"{"$m$\verb"}"& $m$ occurrences of $r$& \verb"a{5}"\\
$r_1r_2$& $r_1$ then $r_2$& \verb"ab"\\
$r_1$\verb"|"$r_2$& $r_1$ or $r_2$& \verb"a|b"\\
\verb"("$r$\verb")"& $r$& \verb"(a|b)"\\
$r_1$\verb"/"$r_2$& $r_1$ when followed by $r_2$& \verb"a/b"\\
\verb"<"$x$\verb">"$r$& $r$ when in start condition $x$& \verb"<x>abc"\\
\hline
\end{tabular}
\caption{Regular expressions.}
\label{tab1}
\end{table*}
The operators \verb"*", \verb"+", \verb"?" and \verb"{}" have highest
precedence, followed by concatenation. The \verb"|" operator has lowest
precedence. Parentheses \verb"()" may be used to group expressions and
overwrite default precedences. The \verb"<>" and \verb"/" operators may only
occur once in an expression.
The usual C-like escapes are recognized:
\begin{itemize}
\item \verb"\n" denotes newline
\item \verb"\r" denotes carriage return
\item \verb"\t" denotes tab
\item \verb"\b" denotes backspace
\item \verb"\f" denotes form feed
\item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
\end{itemize}
You can also use the \verb"\" character to quote characters which would
otherwise be interpreted as operator symbols. In character classes, you may
use the \verb"-" character to denote ranges of characters. For instance,
\verb"[a-z]" denotes the class of all lowercase letters.
The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
which match more than one rule. In such a case, the lexical analyzer prefers
the longest match and, if it still has the choice between different rules,
it picks the first of these. If no rule matches, the lexical analyzer
executes a default action which consists of copying the input character
to the output unchanged. Thus, if the purpose of a lexical analyzer is
to translate some parts of the input, and leave the rest unchanged, you
only have to specify the patterns which have to be treated specially. If,
however, the lexical analyzer has to absorb its whole input, you will have
to provide rules that match everything. E.g., you might use the rules
\begin{quote}\begin{verbatim}
. |
\n ;
\end{verbatim}\end{quote}
which match ``any other character'' (and ignore it).
Sometimes certain patterns have to be analyzed differently depending on some
amount of context in which the pattern appears. In such a case the \verb"/"
operator is useful. For instance, the expression \verb"a/b" matches \verb"a",
but only if followed by \verb"b". Note that the \verb"b" does not belong to
the match; rather, the lexical analyzer, when matching an \verb"a", will look
ahead in the input to see whether it is followed by a \verb"b", before it
declares that it has matched an \verb"a". Such lookahead may be arbitrarily
complex (up to the size of the \verb"LexLib" input buffer). E.g., the pattern
\verb"a/.*b" matches an \verb"a" which is followed by a \verb"b" somewhere on
the same input line. TP Lex also has a means to specify left context which is
described in the next section.
\subsection{Start Conditions}
TP Lex provides some features which make it possible to handle left context.
The \verb"^" character at the beginning of a regular expression may be used
to denote the beginning of the line. More distant left context can be described
conveniently by using start conditions on rules.
Any rule which is prefixed with the \verb"<>" construct is only valid if the
lexical analyzer is in the denoted start state. For instance, the expression
\verb"<x>a" can only be matched if the lexical analyzer is in start state
\verb"x". You can have multiple start states in a rule; e.g., \verb"<x,y>a"
can be matched in start states \verb"x" or \verb"y".
Start states have to be declared in the definitions section by means of
one or more start state definitions (see above). The lexical analyzer enters
a start state through a call to the \verb"LexLib" routine \verb"start". E.g.,
you may write:
\begin{quote}\begin{verbatim}
%start x y
%%
<x>a start(y);
<y>b start(x);
%%
begin
start(x); if yylex=0 then ;
end.
\end{verbatim}\end{quote}
Upon initialization, the lexical analyzer is put into state \verb"x". It then
proceeds in state \verb"x" until it matches an \verb"a" which puts it into
state \verb"y". In state \verb"y" it may match a \verb"b" which puts it into
state \verb"x" again, etc.
Start conditions are useful when certain constructs have to be analyzed
differently depending on some left context (such as a special character
at the beginning of the line), and if multiple lexical analyzers have to
work in concert. If a rule is not prefixed with a start condition, it is
valid in all user-defined start states, as well as in the lexical analyzer's
default start state.
\subsection{Lex Library}
The TP Lex library (\verb"LexLib") unit provides various variables and
routines which are used by Lex-generated lexical analyzers and application
programs. It provides the input and output streams and other internal data
structures used by the lexical analyzer routine, and supplies some variables
and utility routines which may be used by actions and application programs.
Refer to the file \verb"lexlib.pas" for a closer description.
You can also modify the Lex library unit (and/or the code template in the
\verb"yylex.cod" file) to customize TP Lex to your target applications. E.g.,
you might wish to optimize the code of the lexical analyzer for some
special application, make the analyzer read from/write to memory instead
of files, etc.
\subsection{Implementation Restrictions}
Internal table sizes and the main memory available limit the complexity of
source grammars that TP Lex can handle. There is currently no possibility to
change internal table sizes (apart from modifying the sources of TP Lex
itself), but the maximum table sizes provided by TP Lex seem to be large
enough to handle most realistic applications. The actual table sizes depend on
the particular implementation (they are much larger than the defaults if TP
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
Free Pascal), and are shown in the statistics printed by TP Lex when a
compilation is finished. The units given there are ``p'' (positions, i.e.
items in the position table used to construct the DFA), ``s'' (DFA states) and
``t'' (transitions of the generated DFA).
As implemented, the generated DFA table is stored as a typed array constant
which is inserted into the \verb"yylex.cod" code template. The transitions in
each state are stored in order. Of course it would have been more efficient to
generate a big \verb"CASE" statement instead, but I found that this may cause
problems with the encoding of large DFA tables because Turbo Pascal has
a quite rigid limit on the code size of individual procedures. I decided to
use a scheme in which transitions on different symbols to the same state are
merged into one single transition (specifying a character set and the
corresponding next state). This keeps the number of transitions in each state
quite small and still allows a fairly efficient access to the transition
table.
The TP Lex program has an option (\verb"-o") to optimize DFA tables. This
causes a minimal DFA to be generated, using the algorithm described in Aho,
Sethi, Ullman (1986). Although the absolute limit on the number of DFA states
that TP Lex can handle is at least 300, TP Lex poses an additional restriction
(100) on the number of states in the initial partition of the DFA optimization
algorithm. Thus, you may get a fatal \verb"integer set overflow" message when
using the \verb"-o" option even when TP Lex is able to generate an unoptimized
DFA. In such cases you will just have to be content with the unoptimized DFA.
(Hopefully, this will be fixed in a future version. Anyhow, using the merged
transitions scheme described above, TP Lex usually constructs unoptimized
DFA's which are not far from being optimal, and thus in most cases DFA
optimization won't have a great impact on DFA table sizes.)
\subsection{Differences from UNIX Lex}
Major differences between TP Lex and UNIX Lex are listed below.
\begin{itemize}
\item
TP Lex produces output code for Turbo Pascal, rather than for C.
\item
Character tables (\verb"%T") are not supported; neither are any
directives to determine internal table sizes (\verb"%p", \verb"%n",
etc.).
\item
Library routines are named differently from the UNIX version (e.g.,
the \verb"start" routine takes the place of the \verb"BEGIN" macro of
UNIX Lex), and, of course, all macros of UNIX Lex (\verb"ECHO",
\verb"REJECT", etc.) had to be implemented as procedures.
\item
The TP Lex library unit starts counting line numbers at 0, incrementing
the count {\em before\/} a line is read (in contrast, UNIX Lex
initializes \verb"yylineno" to 1 and increments it {\em after\/} the
line end has been read). This is motivated by the way in which TP Lex
maintains the current line, and will not affect your programs unless you
explicitly reset the \verb"yylineno" value (e.g., when opening a new
input file). In such a case you should set \verb"yylineno" to 0 rather
than 1.
\end{itemize}
\section{TP Yacc}
This section describes the TP Yacc compiler compiler.
\subsection{Usage}
\begin{quote}\begin{verbatim}
yacc [options] yacc-file[.y] [output-file[.pas]]
\end{verbatim}\end{quote}
\subsection{Options}
\begin{description}
\item[\verb"-v"]
``Verbose:'' TP Yacc generates a readable description of the generated
parser, written to \verb"yacc-file" with new extension \verb".lst".
\item[\verb"-d"]
``Debug:'' TP Yacc generates parser with debugging output.
\end{description}
\subsection{Description}
TP Yacc is a program that lets you prepare parsers from the description
of input languages by BNF-like grammars. You simply specify the grammar
for your target language, augmented with the Turbo Pascal code necessary
to process the syntactic constructs, and TP Yacc translates your grammar
into the Turbo Pascal code for a corresponding parser subroutine named
\verb"yyparse".
TP Yacc parses the source grammar contained in \verb"yacc-file" (with default
suffix \verb".y") and writes the constructed parser subroutine to the
specified \verb"output-file" (with default suffix \verb".pas"); if no output
file is specified, output goes to \verb"yacc-file" with new suffix
\verb".pas". If any errors are found during compilation, error messages are
written to the list file (\verb"yacc-file" with new suffix \verb".lst").
The generated parser routine, \verb"yyparse", is declared as:
\begin{quote}\begin{verbatim}
function yyparse : Integer;
\end{verbatim}\end{quote}
This routine may be called by your main program to execute the parser.
The return value of the \verb"yyparse" routine denotes success or failure of
the parser (possible return values: 0 = success, 1 = unrecoverable syntax
error or parse stack overflow).
Similar to TP Lex, the code template for the \verb"yyparse" routine may be
found in the \verb"yyparse.cod" file. The rules for locating this file are
analogous to those of TP Lex (see Section {\em TP Lex\/}).
The TP Yacc library (\verb"YaccLib") unit is required by programs using Yacc-
generated parsers; you will therefore have to put an appropriate \verb"uses"
clause into your program or unit that contains the parser routine. The
\verb"YaccLib" unit also provides some routines which may be used to control
the actions of the parser. See the file \verb"yacclib.pas" for further
information.
\subsection{Yacc Source}
A TP Yacc program consists of three sections separated with the \verb"%%"
delimiter:
\begin{quote}\begin{verbatim}
definitions
%%
rules
%%
auxiliary procedures
\end{verbatim}\end{quote}
The TP Yacc language is free-format: whitespace (blanks, tabs and newlines)
is ignored, except if it serves as a delimiter. Comments have the C-like
format \verb"/* ... */". They are treated as whitespace. Grammar symbols are
denoted by identifiers which have the usual form (letter, including
underscore, followed by a sequence of letters and digits; upper- and
lowercase is distinct). The TP Yacc language also has some keywords which
always start with the \verb"%" character. Literals are denoted by characters
enclosed in single quotes. The usual C-like escapes are recognized:
\begin{itemize}
\item \verb"\n" denotes newline
\item \verb"\r" denotes carriage return
\item \verb"\t" denotes tab
\item \verb"\b" denotes backspace
\item \verb"\f" denotes form feed
\item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
\end{itemize}
\subsection{Definitions}
The first section of a TP Yacc grammar serves to define the symbols used in
the grammar. It may contain the following types of definitions:
\begin{itemize}
\item
start symbol definition: A definition of the form
\begin{quote}\begin{verbatim}
%start symbol
\end{verbatim}\end{quote}
declares the start nonterminal of the grammar (if this definition is
omitted, TP Yacc assumes the left-hand side nonterminal of the first
grammar rule as the start symbol of the grammar).
\item
terminal definitions: Definitions of the form
\begin{quote}\begin{verbatim}
%token symbol ...
\end{verbatim}\end{quote}
are used to declare the terminal symbols (``tokens'') of the target
language. Any identifier not introduced in a \verb"%token" definition
will be treated as a nonterminal symbol.
As far as TP Yacc is concerned, tokens are atomic symbols which do not
have an innert structure. A lexical analyzer must be provided which
takes on the task of tokenizing the input stream and return the
individual tokens and literals to the parser (see Section {\em Lexical
Analysis\/}).
\item
precedence definitions: Operator symbols (terminals) may be associated
with a precedence by means of a precedence definition which may have
one of the following forms
\begin{quote}\begin{verbatim}
%left symbol ...
%right symbol ...
%nonassoc symbol ...
\end{verbatim}\end{quote}
which are used to declare left-, right- and nonassociative operators,
respectively. Each precedence definition introduces a new precedence
level, lowest precedence first. E.g., you may write:
\begin{quote}\begin{verbatim}
%nonassoc '<' '>' '=' GEQ LEQ NEQ
/* relational operators */
%left '+' '-' OR
/* addition operators */
%left '*' '/' AND
/* multiplication operators */
%right NOT UMINUS
/* unary operators */
\end{verbatim}\end{quote}
A terminal identifier introduced in a precedence definition may, but
need not, appear in a \verb"%token" definition as well.
\item
type definitions: Any (terminal or nonterminal) grammar symbol may be
associated with a type identifier which is used in the processing of
semantic values. Type tags of the form \verb"<name>" may be used in
token and precedence definitions to declare the type of a terminal
symbol, e.g.:
\begin{quote}\begin{verbatim}
%token <Real> NUM
%left <AddOp> '+' '-'
\end{verbatim}\end{quote}
To declare the type of a nonterminal symbol, use a type definition of
the form:
\begin{quote}\begin{verbatim}
%type <name> symbol ...
\end{verbatim}\end{quote}
e.g.:
\begin{quote}\begin{verbatim}
%type <Real> expr
\end{verbatim}\end{quote}
In a \verb"%type" definition, you may also omit the nonterminals, i.e.
you may write:
\begin{quote}\begin{verbatim}
%type <name>
\end{verbatim}\end{quote}
This is useful when a given type is only used with type casts (see
Section {\em Grammar Rules and Actions\/}), and is not associated with
a specific nonterminal.
\item
Turbo Pascal declarations: You may also include arbitrary Turbo Pascal
code in the definitions section, enclosed in \verb"%{ %}". This code
will be inserted as global declarations into the output file, unchanged.
\end{itemize}
\subsection{Grammar Rules and Actions}
The second part of a TP Yacc grammar contains the grammar rules for the
target language. Grammar rules have the format
\begin{quote}\begin{verbatim}
name : symbol ... ;
\end{verbatim}\end{quote}
The left-hand side of a rule must be an identifier (which denotes a
nonterminal symbol). The right-hand side may be an arbitrary (possibly
empty) sequence of nonterminal and terminal symbols (including literals
enclosed in single quotes). The terminating semicolon may also be omitted.
Different rules for the same left-hand side symbols may be written using
the \verb"|" character to separate the different alternatives:
\begin{quote}\begin{verbatim}
name : symbol ...
| symbol ...
...
;
\end{verbatim}\end{quote}
For instance, to specify a simple grammar for arithmetic expressions, you
may write:
\begin{quote}\begin{verbatim}
%left '+' '-'
%left '*' '/'
%token NUM
%%
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
(The \verb"%left" definitions at the beginning of the grammar are needed to
specify the precedence and associativity of the operator symbols. This will be
discussed in more detail in Section {\em Ambigious Grammars\/}.)
Grammar rules may contain actions -- Turbo Pascal statements enclosed in
\verb"{ }" -- to be executed as the corresponding rules are recognized.
Furthermore, rules may return values, and access values returned by other
rules. These ``semantic'' values are written as \verb"$$" (value of the
left-hand side nonterminal) and \verb"$i" (value of the $i$th right-hand
side symbol). They are kept on a special value stack which is maintained
automatically by the parser.
Values associated with terminal symbols must be set by the lexical analyzer
(more about this in Section {\em Lexical Analysis\/}). Actions of the form
\verb"$$ := $1" can frequently be omitted, since it is the default action
assumed by TP Yacc for any rule that does not have an explicit action.
By default, the semantic value type provided by Yacc is \verb"Integer". You
can also put a declaration like
\begin{quote}\begin{verbatim}
%{
type YYSType = Real;
%}
\end{verbatim}\end{quote}
into the definitions section of your Yacc grammar to change the default value
type. However, if you have different value types, the preferred method is to
use type definitions as discussed in Section {\em Definitions\/}. When such
type definitions are given, TP Yacc handles all the necessary details of the
\verb"YYSType" definition and also provides a fair amount of type checking
which makes it easier to find type errors in the grammar.
For instance, we may declare the symbols \verb"NUM" and \verb"expr" in the
example above to be of type \verb"Real", and then use these values to
evaluate an expression as it is parsed.
\begin{quote}\begin{verbatim}
%left '+' '-'
%left '*' '/'
%token <Real> NUM
%type <Real> expr
%%
expr : expr '+' expr { $$ := $1+$3; }
| expr '-' expr { $$ := $1-$3; }
| expr '*' expr { $$ := $1*$3; }
| expr '/' expr { $$ := $1/$3; }
| '(' expr ')' { $$ := $2; }
| NUM
;
\end{verbatim}\end{quote}
(Note that we omitted the action of the last rule. The ``copy action''
\verb"$$ := $1" required by this rule is automatically added by TP Yacc.)
Actions may not only appear at the end, but also in the middle of a rule
which is useful to perform some processing before a rule is fully parsed.
Such actions inside a rule are treated as special nonterminals which are
associated with an empty right-hand side. Thus, a rule like
\begin{quote}\begin{verbatim}
x : y { action; } z
\end{verbatim}\end{quote}
will be treated as:
\begin{quote}\begin{verbatim}
x : y $act z
$act : { action; }
\end{verbatim}\end{quote}
Actions inside a rule may also access values to the left of the action,
and may return values by assigning to the \verb"$$" value. The value returned
by such an action can then be accessed by other actions using the usual
\verb"$i" notation. E.g., we may write:
\begin{quote}\begin{verbatim}
x : y { $$ := 2*$1; } z { $$ := $2+$3; }
\end{verbatim}\end{quote}
which has the effect of setting the value of \verb"x" to
\begin{quote}\begin{verbatim}
2*(the value of y)+(the value of z).
\end{verbatim}\end{quote}
Sometimes it is desirable to access values in enclosing rules. This can be
done using the notation \verb"$i" with $i\leq 0$. \verb"$0" refers to the
first value ``to the left'' of the current rule, \verb"$-1" to the second,
and so on. Note that in this case the referenced value depends on the actual
contents of the parse stack, so you have to make sure that the requested
values are always where you expect them.
There are some situations in which TP Yacc cannot easily determine the
type of values (when a typed parser is used). This is true, in particular,
for values in enclosing rules and for the \verb"$$" value in an action inside
a rule. In such cases you may use a type cast to explicitly specify the type
of a value. The format for such type casts is \verb"$<name>$" (for left-hand
side values) and \verb"$<name>i" (for right-hand side values) where
\verb"name" is a type identifier (which must occur in a \verb"%token",
precedence or \verb"%type" definition).
\subsection{Auxiliary Procedures}
The third section of a TP Yacc program is optional. If it is present, it
may contain any Turbo Pascal code (such as supporting routines or a main
program) which is tacked on to the end of the output file.
\subsection{Lexical Analysis}
For any TP Yacc-generated parser, the programmer must supply a lexical
analyzer routine named \verb"yylex" which performs the lexical analysis for
the parser. This routine must be declared as
\begin{quote}\begin{verbatim}
function yylex : Integer;
\end{verbatim}\end{quote}
The \verb"yylex" routine may either be prepared by hand, or by using the
lexical analyzer generator TP Lex (see Section {\em TP Lex\/}).
The lexical analyzer must be included in your main program behind the
parser subroutine (the \verb"yyparse" code template includes a forward
definition of the \verb"yylex" routine such that the parser can access the
lexical analyzer). For instance, you may put the lexical analyzer
routine into the auxiliary procedures section of your TP Yacc grammar,
either directly, or by using the the Turbo Pascal include directive
(\verb"$I").
The parser repeatedly calls the \verb"yylex" routine to tokenize the input
stream and obtain the individual lexical items in the input. For any
literal character, the \verb"yylex" routine has to return the corresponding
character code. For the other, symbolic, terminals of the input language,
the lexical analyzer must return corresponding integer codes. These are
assigned automatically by TP Yacc in the order in which token definitions
appear in the definitions section of the source grammar. The lexical
analyzer can access these values through corresponding integer constants
which are declared by TP Yacc in the output file.
For instance, if
\begin{quote}\begin{verbatim}
%token NUM
\end{verbatim}\end{quote}
is the first definition in the Yacc grammar, then TP Yacc will create
a corresponding constant declaration
\begin{quote}\begin{verbatim}
const NUM = 257;
\end{verbatim}\end{quote}
in the output file (TP Yacc automatically assigns symbolic token numbers
starting at 257; 1 thru 255 are reserved for character literals, 0 denotes
end-of-file, and 256 is reserved for the special error token which will be
discussed in Section {\em Error Handling\/}). This definition may then be
used, e.g., in a corresponding TP Lex program as follows:
\begin{quote}\begin{verbatim}
[0-9]+ return(NUM);
\end{verbatim}\end{quote}
You can also explicitly assign token numbers in the grammar. For this
purpose, the first occurrence of a token identifier in the definitions
section may be followed by an unsigned integer. E.g. you may write:
\begin{quote}\begin{verbatim}
%token NUM 299
\end{verbatim}\end{quote}
Besides the return value of \verb"yylex", the lexical analyzer routine may
also return an additional semantic value for the recognized token. This value
is assigned to a variable named \verb"yylval" and may then be accessed in
actions through the \verb"$i" notation (see above, Section {\em Grammar
Rules and Actions\/}). The \verb"yylval" variable is of type \verb"YYSType"
(the semantic value type, \verb"Integer" by default); its declaration may be
found in the \verb"yyparse.cod" file.
For instance, to assign an \verb"Integer" value to a \verb"NUM" token in the
above example, we may write:
\begin{quote}\begin{verbatim}
[0-9]+ begin
val(yytext, yylval, code);
return(NUM);
end;
\end{verbatim}\end{quote}
This assigns \verb"yylval" the value of the \verb"NUM" token (using the Turbo
Pascal standard procedure \verb"val").
If a parser uses tokens of different types (via a \verb"%token <name>"
definition), then the \verb"yylval" variable will not be of type
\verb"Integer", but instead of a corresponding variant record type which is
capable of holding all the different value types declared in the TP Yacc
grammar. In this case, the lexical analyzer must assign a semantic value to
the corresponding record component which is named \verb"yy"{\em name\/}
(where {\em name\/} stands for the corresponding type identifier).
E.g., if token \verb"NUM" is declared \verb"Real":
\begin{quote}\begin{verbatim}
%token <Real> NUM
\end{verbatim}\end{quote}
then the value for token \verb"NUM" must be assigned to \verb"yylval.yyReal".
\subsection{How The Parser Works}
TP Yacc uses the LALR(1) technique developed by Donald E.\ Knuth and F.\
DeRemer to construct a simple, efficient, non-backtracking bottom-up
parser for the source grammar. The LALR parsing technique is described
in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a
look at the parser description TP Yacc generates from a small sample
grammar, to get an idea of how the LALR parsing algorithm works. We
consider the following simplified version of the arithmetic expression
grammar:
\begin{quote}\begin{verbatim}
%token NUM
%left '+'
%left '*'
%%
expr : expr '+' expr
| expr '*' expr
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
When run with the \verb"-v" option on the above grammar, TP Yacc generates
the parser description listed below.
\begin{quote}\begin{verbatim}
state 0:
$accept : _ expr $end
'(' shift 2
NUM shift 3
. error
expr goto 1
state 1:
$accept : expr _ $end
expr : expr _ '+' expr
expr : expr _ '*' expr
$end accept
'*' shift 4
'+' shift 5
. error
state 2:
expr : '(' _ expr ')'
'(' shift 2
NUM shift 3
. error
expr goto 6
state 3:
expr : NUM _ (4)
. reduce 4
state 4:
expr : expr '*' _ expr
'(' shift 2
NUM shift 3
. error
expr goto 7
state 5:
expr : expr '+' _ expr
'(' shift 2
NUM shift 3
. error
expr goto 8
state 6:
expr : '(' expr _ ')'
expr : expr _ '+' expr
expr : expr _ '*' expr
')' shift 9
'*' shift 4
'+' shift 5
. error
state 7:
expr : expr '*' expr _ (2)
expr : expr _ '+' expr
expr : expr _ '*' expr
. reduce 2
state 8:
expr : expr '+' expr _ (1)
expr : expr _ '+' expr
expr : expr _ '*' expr
'*' shift 4
$end reduce 1
')' reduce 1
'+' reduce 1
. error
state 9:
expr : '(' expr ')' _ (3)
. reduce 3
\end{verbatim}\end{quote}
Each state of the parser corresponds to a certain prefix of the input
which has already been seen. The parser description lists the grammar
rules wich are parsed in each state, and indicates the portion of each
rule which has already been parsed by an underscore. In state 0, the
start state of the parser, the parsed rule is
\begin{quote}\begin{verbatim}
$accept : expr $end
\end{verbatim}\end{quote}
This is not an actual grammar rule, but a starting rule automatically
added by TP Yacc. In general, it has the format
\begin{quote}\begin{verbatim}
$accept : X $end
\end{verbatim}\end{quote}
where \verb"X" is the start nonterminal of the grammar, and \verb"$end" is
a pseudo token denoting end-of-input (the \verb"$end" symbol is used by the
parser to determine when it has successfully parsed the input).
The description of the start rule in state 0,
\begin{quote}\begin{verbatim}
$accept : _ expr $end
\end{verbatim}\end{quote}
with the underscore positioned before the \verb"expr" symbol, indicates that
we are at the beginning of the parse and are ready to parse an expression
(nonterminal \verb"expr").
The parser maintains a stack to keep track of states visited during the
parse. There are two basic kinds of actions in each state: {\em shift\/},
which reads an input symbol and pushes the corresponding next state on top of
the stack, and {\em reduce\/} which pops a number of states from the stack
(corresponding to the number of right-hand side symbols of the rule used
in the reduction) and consults the {\em goto\/} entries of the uncovered
state to find the transition corresponding to the left-hand side symbol of the
reduced rule.
In each step of the parse, the parser is in a given state (the state on
top of its stack) and may consult the current {\em lookahead symbol\/}, the
next symbol in the input, to determine the parse action -- shift or reduce --
to perform. The parser terminates as soon as it reaches state 1 and reads
in the endmarker, indicated by the {\em accept\/} action on \verb"$end" in
state 1.
Sometimes the parser may also carry out an action without inspecting the
current lookahead token. This is the case, e.g., in state 3 where the
only action is reduction by rule 4:
\begin{quote}\begin{verbatim}
. reduce 4
\end{verbatim}\end{quote}
The default action in a state can also be {\em error\/} indicating that any
other input represents a syntax error. (In case of such an error the
parser will start syntactic error recovery, as described in Section
{\em Error Handling\/}.)
Now let us see how the parser responds to a given input. We consider the
input string \verb"2+5*3" which is presented to the parser as the token
sequence:
\begin{quote}\begin{verbatim}
NUM + NUM * NUM
\end{verbatim}\end{quote}
Table \ref{tab2} traces the corresponding actions of the parser. We also
show the current state in each move, and the remaining states on the stack.
\begin{table*}\centering
\begin{tabular}{l|l|l|p{8cm}}
\hline\hline
{\sc State}& {\sc Stack}& {\sc Lookahead}& {\sc Action}\\
\hline
0 & & \verb"NUM" & shift state 3\\
3 & 0 & & reduce rule 4 (pop 1 state, uncovering state
0, then goto state 1 on symbol \verb"expr")\\
1 & 0 & \verb"+" & shift state 5\\
5 & 1 0 & \verb"NUM" & shift state 3\\
3 & 5 1 0 & & reduce rule 4 (pop 1 state, uncovering state
5, then goto state 8 on symbol \verb"expr")\\
8 & 5 1 0 & \verb"*" & shift 4\\
4 & 8 5 1 0 & \verb"NUM" & shift 3\\
3 & 4 8 5 1 0 & & reduce rule 4 (pop 1 state, uncovering state
4, then goto state 7 on symbol \verb"expr")\\
7 & 4 8 5 1 0 & & reduce rule 2 (pop 3 states, uncovering state
5, then goto state 8 on symbol \verb"expr")\\
8 & 5 1 0 & \verb"$end" & reduce rule 1 (pop 3 states, uncovering state
0, then goto state 1 on symbol \verb"expr")\\
1 & 0 & \verb"$end" & accept\\
\hline
\end{tabular}
\caption{Parse of \protect\verb"NUM + NUM * NUM".}
\label{tab2}
\end{table*}
It is also instructive to see how the parser responds to illegal inputs.
E.g., you may try to figure out what the parser does when confronted with:
\begin{quote}\begin{verbatim}
NUM + )
\end{verbatim}\end{quote}
or:
\begin{quote}\begin{verbatim}
( NUM * NUM
\end{verbatim}\end{quote}
You will find that the parser, sooner or later, will always run into an
error action when confronted with errorneous inputs. An LALR parser will
never shift an invalid symbol and thus will always find syntax errors as
soon as it is possible during a left-to-right scan of the input.
TP Yacc provides a debugging option (\verb"-d") that may be used to trace
the actions performed by the parser. When a grammar is compiled with the
\verb"-d" option, the generated parser will print out the actions as it
parses its input.
\subsection{Ambigious Grammars}
There are situations in which TP Yacc will not produce a valid parser for
a given input language. LALR(1) parsers are restricted to one-symbol
lookahead on which they have to base their parsing decisions. If a
grammar is ambigious, or cannot be parsed unambigiously using one-symbol
lookahead, TP Yacc will generate parsing conflicts when constructing the
parse table. There are two types of such conflicts: {\em shift/reduce
conflicts\/} (when there is both a shift and a reduce action for a given
input symbol in a given state), and {\em reduce/reduce\/} conflicts (if
there is more than one reduce action for a given input symbol in a given
state). Note that there never will be a shift/shift conflict.
When a grammar generates parsing conflicts, TP Yacc prints out the number
of shift/reduce and reduce/reduce conflicts it encountered when constructing
the parse table. However, TP Yacc will still generate the output code for the
parser. To resolve parsing conflicts, TP Yacc uses the following built-in
disambiguating rules:
\begin{itemize}
\item
in a shift/reduce conflict, TP Yacc chooses the shift action.
\item
in a reduce/reduce conflict, TP Yacc chooses reduction of the first
grammar rule.
\end{itemize}
The shift/reduce disambiguating rule correctly resolves a type of
ambiguity known as the ``dangling-else ambiguity'' which arises in the
syntax of conditional statements of many programming languages (as in
Pascal):
\begin{quote}\begin{verbatim}
%token IF THEN ELSE
%%
stmt : IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
\end{verbatim}\end{quote}
This grammar is ambigious, because a nested construct like
\begin{quote}\begin{verbatim}
IF expr-1 THEN IF expr-2 THEN stmt-1
ELSE stmt-2
\end{verbatim}\end{quote}
can be parsed two ways, either as:
\begin{quote}\begin{verbatim}
IF expr-1 THEN ( IF expr-2 THEN stmt-1
ELSE stmt-2 )
\end{verbatim}\end{quote}
or as:
\begin{quote}\begin{verbatim}
IF expr-1 THEN ( IF expr-2 THEN stmt-1 )
ELSE stmt-2
\end{verbatim}\end{quote}
The first interpretation makes an \verb"ELSE" belong to the last unmatched
\verb"IF" which also is the interpretation chosen in most programming
languages. This is also the way that a TP Yacc-generated parser will parse
the construct since the shift/reduce disambiguating rule has the effect of
neglecting the reduction of \verb"IF expr-2 THEN stmt-1"; instead, the parser
will shift the \verb"ELSE" symbol which eventually leads to the reduction of
\verb"IF expr-2 THEN stmt-1 ELSE stmt-2".
The reduce/reduce disambiguating rule is used to resolve conflicts that
arise when there is more than one grammar rule matching a given construct.
Such ambiguities are often caused by ``special case constructs'' which may be
given priority by simply listing the more specific rules ahead of the more
general ones.
For instance, the following is an excerpt from the grammar describing the
input language of the UNIX equation formatter EQN:
\begin{quote}\begin{verbatim}
%right SUB SUP
%%
expr : expr SUB expr SUP expr
| expr SUB expr
| expr SUP expr
;
\end{verbatim}\end{quote}
Here, the \verb"SUB" and \verb"SUP" operator symbols denote sub- and
superscript, respectively. The rationale behind this example is that
an expression involving both sub- and superscript is often set differently
from a superscripted subscripted expression (compare $x_i^n$ to ${x_i}^n$).
This special case is therefore caught by the first rule in the above example
which causes a reduce/reduce conflict with rule 3 in expressions like
\verb"expr-1 SUB expr-2 SUP expr-3". The conflict is resolved in favour of
the first rule.
In both cases discussed above, the ambiguities could also be eliminated
by rewriting the grammar accordingly (although this yields more complicated
and less readable grammars). This may not always be the case. Often
ambiguities are also caused by design errors in the grammar. Hence, if
TP Yacc reports any parsing conflicts when constructing the parser, you
should use the \verb"-v" option to generate the parser description
(\verb".lst" file) and check whether TP Yacc resolved the conflicts correctly.
There is one type of syntactic constructs for which one often deliberately
uses an ambigious grammar as a more concise representation for a language
that could also be specified unambigiously: the syntax of expressions.
For instance, the following is an unambigious grammar for simple arithmetic
expressions:
\begin{quote}\begin{verbatim}
%token NUM
%%
expr : term
| expr '+' term
;
term : factor
| term '*' factor
;
factor : '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
You may check yourself that this grammar gives \verb"*" a higher precedence
than \verb"+" and makes both operators left-associative. The same effect can
be achieved with the following ambigious grammar using precedence definitions:
\begin{quote}\begin{verbatim}
%token NUM
%left '+'
%left '*'
%%
expr : expr '+' expr
| expr '*' expr
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
Without the precedence definitions, this is an ambigious grammar causing
a number of shift/reduce conflicts. The precedence definitions are used
to correctly resolve these conflicts (conflicts resolved using precedence
will not be reported by TP Yacc).
Each precedence definition introduces a new precedence level (lowest
precedence first) and specifies whether the corresponding operators
should be left-, right- or nonassociative (nonassociative operators
cannot be combined at all; example: relational operators in Pascal).
TP Yacc uses precedence information to resolve shift/reduce conflicts as
follows. Precedences are associated with each terminal occuring in a
precedence definition. Furthermore, each grammar rule is given the
precedence of its rightmost terminal (this default choice can be
overwritten using a \verb"%prec" tag; see below). To resolve a shift/reduce
conflict using precedence, both the symbol and the rule involved must
have been assigned precedences. TP Yacc then chooses the parse action
as follows:
\begin{itemize}
\item
If the symbol has higher precedence than the rule: shift.
\item
If the rule has higher precedence than the symbol: reduce.
\item
If symbol and rule have the same precedence, the associativity of the
symbol determines the parse action: if the symbol is left-associative:
reduce; if the symbol is right-associative: shift; if the symbol is
non-associative: error.
\end{itemize}
To give you an idea of how this works, let us consider our ambigious
arithmetic expression grammar (without precedences):
\begin{quote}\begin{verbatim}
%token NUM
%%
expr : expr '+' expr
| expr '*' expr
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
This grammar generates four shift/reduce conflicts. The description
of state 8 reads as follows:
\begin{quote}\begin{verbatim}
state 8:
*** conflicts:
shift 4, reduce 1 on '*'
shift 5, reduce 1 on '+'
expr : expr '+' expr _ (1)
expr : expr _ '+' expr
expr : expr _ '*' expr
'*' shift 4
'+' shift 5
$end reduce 1
')' reduce 1
. error
\end{verbatim}\end{quote}
In this state, we have successfully parsed a \verb"+" expression (rule 1).
When the next symbol is \verb"+" or \verb"*", we have the choice between the
reduction and shifting the symbol. Using the default shift/reduce
disambiguating rule, TP Yacc has resolved these conflicts in favour of shift.
Now let us assume the above precedence definition:
\begin{quote}\begin{verbatim}
%left '+'
%left '*'
\end{verbatim}\end{quote}
which gives \verb"*" higher precedence than \verb"+" and makes both operators
left-associative. The rightmost terminal in rule 1 is \verb"+". Hence, given
these precedence definitions, the first conflict will be resolved in favour
of shift (\verb"*" has higher precedence than \verb"+"), while the second one
is resolved in favour of reduce (\verb"+" is left-associative).
Similar conflicts arise in state 7:
\begin{quote}\begin{verbatim}
state 7:
*** conflicts:
shift 4, reduce 2 on '*'
shift 5, reduce 2 on '+'
expr : expr '*' expr _ (2)
expr : expr _ '+' expr
expr : expr _ '*' expr
'*' shift 4
'+' shift 5
$end reduce 2
')' reduce 2
. error
\end{verbatim}\end{quote}
Here, we have successfully parsed a \verb"*" expression which may be followed
by another \verb"+" or \verb"*" operator. Since \verb"*" is left-associative
and has higher precedence than \verb"+", both conflicts will be resolved in
favour of reduce.
Of course, you can also have different operators on the same precedence
level. For instance, consider the following extended version of the
arithmetic expression grammar:
\begin{quote}\begin{verbatim}
%token NUM
%left '+' '-'
%left '*' '/'
%%
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
This puts all ``addition'' operators on the first and all ``multiplication''
operators on the second precedence level. All operators are left-associative;
for instance, \verb"5+3-2" will be parsed as \verb"(5+3)-2".
By default, TP Yacc assigns each rule the precedence of its rightmost
terminal. This is a sensible decision in most cases. Occasionally, it
may be necessary to overwrite this default choice and explicitly assign
a precedence to a rule. This can be done by putting a precedence tag
of the form
\begin{quote}\begin{verbatim}
%prec symbol
\end{verbatim}\end{quote}
at the end of the corresponding rule which gives the rule the precedence
of the specified symbol. For instance, to extend the expression grammar
with a unary minus operator, giving it highest precedence, you may write:
\begin{quote}\begin{verbatim}
%token NUM
%left '+' '-'
%left '*' '/'
%right UMINUS
%%
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '-' expr %prec UMINUS
| '(' expr ')'
| NUM
;
\end{verbatim}\end{quote}
Note the use of the \verb"UMINUS" token which is not an actual input symbol
but whose sole purpose it is to give unary minus its proper precedence. If
we omitted the precedence tag, both unary and binary minus would have the
same precedence because they are represented by the same input symbol.
\subsection{Error Handling}
Syntactic error handling is a difficult area in the design of user-friendly
parsers. Usually, you will not like to have the parser give up upon the
first occurrence of an errorneous input symbol. Instead, the parser should
recover from a syntax error, that is, it should try to find a place in the
input where it can resume the parse.
TP Yacc provides a general mechanism to implement parsers with error
recovery. A special predefined \verb"error" token may be used in grammar rules
to indicate positions where syntax errors might occur. When the parser runs
into an error action (i.e., reads an errorneous input symbol) it prints out
an error message and starts error recovery by popping its stack until it
uncovers a state in which there is a shift action on the \verb"error" token.
If there is no such state, the parser terminates with return value 1,
indicating an unrecoverable syntax error. If there is such a state, the
parser takes the shift on the \verb"error" token (pretending it has seen
an imaginary \verb"error" token in the input), and resumes parsing in a
special ``error mode.''
While in error mode, the parser quietly skips symbols until it can again
perform a legal shift action. To prevent a cascade of error messages, the
parser returns to its normal mode of operation only after it has seen
and shifted three legal input symbols. Any additional error found after
the first shifted symbol restarts error recovery, but no error message
is printed. The TP Yacc library routine \verb"yyerrok" may be used to reset
the parser to its normal mode of operation explicitly.
For a simple example, consider the rule
\begin{quote}\begin{verbatim}
stmt : error ';' { yyerrok; }
\end{verbatim}\end{quote}
and assume a syntax error occurs while a statement (nonterminal \verb"stmt")
is parsed. The parser prints an error message, then pops its stack until it
can shift the token \verb"error" of the error rule. Proceeding in error mode,
it will skip symbols until it finds a semicolon, then reduces by the error
rule. The call to \verb"yyerrok" tells the parser that we have recovered from
the error and that it should proceed with the normal parse. This kind of
``panic mode'' error recovery scheme works well when statements are always
terminated with a semicolon. The parser simply skips the ``bad'' statement
and then resumes the parse.
Implementing a good error recovery scheme can be a difficult task; see
Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic.
Schreiner and Friedman have developed a systematic technique to implement
error recovery with Yacc which I found quite useful (I used it myself
to implement error recovery in the TP Yacc parser); see Schreiner/Friedman
(1985).
\subsection{Yacc Library}
The TP Yacc library (\verb"YaccLib") unit provides some global declarations
used by the parser routine \verb"yyparse", and some variables and utility
routines which may be used to control the actions of the parser and to
implement error recovery. See the file \verb"yacclib.pas" for a description
of these variables and routines.
You can also modify the Yacc library unit (and/or the code template in the
\verb"yyparse.cod" file) to customize TP Yacc to your target applications.
\subsection{Other Features}
TP Yacc supports all additional language elements entitled as ``Old Features
Supported But not Encouraged'' in the UNIX manual, which are provided for
backward compatibility with older versions of (UNIX) Yacc:
\begin{itemize}
\item
literals delimited by double quotes.
\item
multiple-character literals. Note that these are not treated as
character sequences but represent single tokens which are given a
symbolic integer code just like any other token identifier. However,
they will not be declared in the output file, so you have to make sure
yourself that the lexical analyzer returns the correct codes for these
symbols. E.g., you might explicitly assign token numbers by using a
definition like
\begin{quote}\begin{verbatim}
%token ':=' 257
\end{verbatim}\end{quote}
at the beginning of the Yacc grammar.
\item
\verb"\" may be used instead of \verb"%", i.e. \verb"\\" means
\verb"%%", \verb"\left" is the same as \verb"%left", etc.
\item
other synonyms:
\begin{itemize}
\item \verb"%<" for \verb"%left"
\item \verb"%>" for \verb"%right"
\item \verb"%binary" or \verb"%2" for \verb"%nonassoc"
\item \verb"%term" or \verb"%0" for \verb"%token"
\item \verb"%=" for \verb"%prec"
\end{itemize}
\item
actions may also be written as \verb"= { ... }" or
\verb"= single-statement;"
\item
Turbo Pascal declarations (\verb"%{ ... %}") may be put at the
beginning of the rules section. They will be treated as local
declarations of the actions routine.
\end{itemize}
\subsection{Implementation Restrictions}
As with TP Lex, internal table sizes and the main memory available limit the
complexity of source grammars that TP Yacc can handle. However, the maximum
table sizes provided by TP Yacc are large enough to handle quite complex
grammars (such as the Pascal grammar in the TP Yacc distribution). The actual
table sizes are shown in the statistics printed by TP Yacc when a compilation
is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t"
(shift and goto transitions) and "r" (reductions).
The default stack size of the generated parsers is \verb"yymaxdepth = 1024",
as declared in the TP Yacc library unit. This should be sufficient for any
average application, but you can change the stack size by including a
corresponding declaration in the definitions part of the Yacc grammar
(or change the value in the \verb"YaccLib" unit). Note that right-recursive
grammar rules may increase stack space requirements, so it is a good
idea to use left-recursive rules wherever possible.
\subsection{Differences from UNIX Yacc}
Major differences between TP Yacc and UNIX Yacc are listed below.
\begin{itemize}
\item
TP Yacc produces output code for Turbo Pascal, rather than for C.
\item
TP Yacc does not support \verb"%union" definitions. Instead, a value
type is declared by specifying the type identifier itself as the tag of
a \verb"%token" or \verb"%type" definition. TP Yacc will automatically
generate an appropriate variant record type (\verb"YYSType") which is
capable of holding values of any of the types used in \verb"%token" and
\verb"%type".
Type checking is very strict. If you use type definitions, then
any symbol referred to in an action must have a type introduced
in a type definition. Either the symbol must have been assigned a
type in the definitions section, or the \verb"$<type-identifier>"
notation must be used. The syntax of the \verb"%type" definition has
been changed slightly to allow definitions of the form
\begin{quote}\begin{verbatim}
%type <type-identifier>
\end{verbatim}\end{quote}
(omitting the nonterminals) which may be used to declare types which
are not assigned to any grammar symbol, but are used with the
\verb"$<...>" construct.
\item
The parse tables constructed by this Yacc version are slightly greater
than those constructed by UNIX Yacc, since a reduce action will only be
chosen as the default action if it is the only action in the state.
In difference, UNIX Yacc chooses a reduce action as the default action
whenever it is the only reduce action of the state (even if there are
other shift actions).
This solves a bug in UNIX Yacc that makes the generated parser start
error recovery too late with certain types of error productions (see
also Schreiner/Friedman, {\em Introduction to compiler construction with
UNIX,\/} 1985). Also, errors will be caught sooner in most cases where
UNIX Yacc would carry out an additional (default) reduction before
detecting the error.
\item
Library routines are named differently from the UNIX version (e.g.,
the \verb"yyerrlab" routine takes the place of the \verb"YYERROR"
macro of UNIX Yacc), and, of course, all macros of UNIX Yacc
(\verb"YYERROR", \verb"YYACCEPT", etc.) had to be implemented as
procedures.
\end{itemize}
\end{document}
|