1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
|
\input texinfo.tex @c -*-texinfo-*-
@setfilename silex.info
@settitle SILex
@setchapternewpage on
@footnotestyle end
@paragraphindent 3
@syncodeindex fn cp
@syncodeindex vr cp
@syncodeindex ky cp
@syncodeindex pg cp
@syncodeindex tp cp
@c ---------- Info copyright ----------
@ifinfo
This file documents the version 1.0 of SILex, a Scheme
Implementation of Lex.
Copyright @copyright{} 2001 Danny Dub@'e
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
@end ifinfo
@c ---------- Title & copyright pages (printed) ----------
@titlepage
@title SILex
@subtitle A Scheme Implementation of Lex
@subtitle Documentation for SILex version 1.0
@author Danny Dub@'e
@page
@vskip 0pt plus 1filll
Copyright @copyright{} 2001 Danny Dub@'e.
This is the first edition of the SILex documentation. It
documents the version 1.0 of SILex.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
@end titlepage
@headings double
@c ---------- Top node ----------
@ifinfo
@node Top, Overview, (dir), (dir)
@c Node, Next, Prev, Up
@top
This document describes SILex. ``SILex'' stands for ``Scheme
Implementation of Lex''. It generates a Scheme lexical analyser from a
Lex-like specification file.
This document is the first edition and describes SILex version
1.0.
@menu
* Overview:: A general description of SILex.
* Syntax:: The look of a specification file.
* Semantics:: The meaning of a specification file.
* Generating:: How to generate and use lexical analysers.
* Interface:: With a Scheme LALR(1) parser generator.
* Index:: Concepts and commands relating to SILex.
* Acknowledgements::
@end menu
@end ifinfo
@c ---------- 1st: overview ----------
@node Overview, Syntax, Top, Top
@c Node, Next, Prev, Up
@chapter Overview
SILex is a lexical analyser generator similar to the Lex and
Flex programs, but for Scheme. ``SILex'' stands for ``Scheme
Implementation of Lex''.
SILex has many similarities with the C programs, but has many
differences, too. The syntax of the specification files for SILex is
close to that of Lex and Flex. Of course, the actions must be written
in Scheme and not in C. The set of regular expressions is mostly the
same. An important difference is relative to the multiple start states
in the C analysers. SILex replaces them by allowing multiple analysers
to take their input from the same source. Different inputs can be
analysed at the same time, possibly with different instances of one or
more lexical analysers. The analysers are created dynamically.
SILex provides many other features. The designer of a lexical
analyser can specify the actions to be taken when the end of file is
reached or when an error occurs. The analyser can keep track of the
position in the input in terms of the number of the line, column and
offset. An analyser can take its input from an input port, a string or
a function. SILex is portable; it does not depend on a particular
character set. It can generate analysers that are portable, too.
Finally, the table encoding the behavior of the analyser can be compiled
to Scheme code. The fastest lexical analysers can be produced this way.
@ignore
Lex-like
Scheme program
Driver for Scheme
One or more Scheme actions
Facilities for error handling
Line & column counting
Multiple inputs
Multiple scanners per input
Different input methods
Different table codings
Portable
@end ignore
@c ---------- 2nd: syntax of a specification file ----------
@node Syntax, Semantics, Overview, Top
@c Node, Next, Prev, Up
@chapter Syntax of the specification file
@cindex Syntax of the specification file
@cindex Specification file
@cindex Comment
@cindex White space
A specification file for a lexical analyser contains two parts:
the @dfn{macro definitions part} and the @dfn{rules part}. The two
parts are separated by the mark @code{%%}. The first part is used to
define @dfn{macros}; that is, to give names to some regular expressions.
The second part is used to indicate the regular expressions with which
the input will have to match and the @dfn{actions} associated with each
expression.
Comments can be inserted any place where white space is allowed
and is considered as white space itself. The syntax of the comments is
the same as in Scheme. That is, it begins with a semicolon @samp{;} and
extends up to the end of a line. The semicolon is a valid token in many
languages, so you should take care not to comment out an entire line
when you write a regular expression matching a semicolon.
The syntax of each part is presented, except for the regular
expressions, which are described apart. A small example follows.
@ignore
Macros part %% rules part
Comments & whitespace
@end ignore
@menu
* Macros part:: Syntax of the macro definitions.
* Rules part:: Syntax of the rule-action pairs.
* Regular expressions:: How to build regular expressions.
* Sample file:: Shows some frequent mistakes.
@end menu
@node Macros part, Rules part, Syntax, Syntax
@section Macro definitions part
@cindex Macro
@cindex Macro definitions part
@cindex Scope of a macro definition
The first part of a specification file contains zero or more
macro definitions. A definition consists of a name and a regular
expression, separated by white space. It looks better when each
definition is written on a separate line.
The syntax for a macro name is that of a Scheme symbol. The
case of the letters is not significant. For example, @code{abcd},
@code{+}, @code{...}, @code{Digit} and @code{digit} are all valid macro
names; the last two being the same. You cannot write two macro
definitions with the same name.
The defined macro can be referenced in regular expressions using
the syntax @code{@{@var{name}@}} (@pxref{Regular expressions}). The
scope of a macro definition includes the remaining definitions and the
rules part of the file. It is analogous to the @code{let*} is Scheme,
where the macro definitions correspond to the bindings and the rules
part correspond to the body.
End the macro definitions part with @code{%%}.
@ignore
Names = Scheme symbols
Case insensitive
End with %%
Order of macros
@end ignore
@node Rules part, Regular expressions, Macros part, Syntax
@section Rules part
@cindex Rules part
@cindex Pattern
@cindex Action
@cindex Indentation in actions
The rules part contains the rules up to the end of the
specification file. Each rule is a @dfn{pattern} optionally followed by
an @dfn{action}. The pattern is a regular expression. The action, if
there is one, is formed of one or more Scheme expressions.
The actions can span over several lines. To distinguish between
the remaining of the current action and the start of a new rule, SILex
checks the indentation. A new rule must start at the beginning of the
line. That is, the action starts right after the pattern and contains
all the following lines that start with white space.
SILex does not parse the actions. It simply captures the text
up to the start of the next rule. So a syntax error in an action is not
detected by SILex.
Nevertheless, SILex is able to detect that an action has been
omitted. In that case, a default action is supplied.
@ignore
Action = one or more Scheme expressions
Action are taken verbatim
Indentation is significant
Default actions
End of file
@end ignore
@node Regular expressions, Sample file, Rules part, Syntax
@section Regular expressions
@cindex Regular expression
@cindex Atomic regular expression
@cindex Ordinary character
@cindex Dot
@cindex Wild card
@findex .
@cindex Backslash
@cindex Protecting a character
@findex \n
@findex \@var{integer}
@findex \@var{c}
@cindex Macro reference
@findex @{@var{name}@}
@cindex String
@findex "@var{some text}"
@cindex Character class
@findex [@var{list of characters}]
We first describe the atomic regular expressions. Then, we show
how to build more complex regular expressions from simpler ones.
Finally, the markers are introduced.
The following constructs are regular expressions:
@table @asis
@item @code{@var{c}}
@dfn{Ordinary character}. It is a regular expression that matches the
character @var{c} itself. @var{c} cannot be one of @samp{.}, @samp{\},
@samp{@{}, @samp{"}, @samp{[}, @samp{|}, @samp{?}, @samp{+}, @samp{*},
@samp{(}, @samp{)}, @samp{^}, @samp{$}, @samp{;} or any white space.
@item @code{.}
@dfn{Wild card}. It matches any character except the newline character.
@item @code{\n}
@itemx @code{\@var{integer}}
@itemx @code{\@var{c}}
@dfn{Backslash}. The backslash is used for two things: protect a
character from special meaning; generating non-printable characters.
The expression @code{\n} matches the newline character. The expression
@code{\@var{integer}} matches the character that has number
@var{integer} (in the sense of @code{char->integer}). @var{integer}
must be a valid character number on the implementation that you use. It
may be more than 3 digits long and even negative@footnote{The Scheme
standards do not impose a particular character set, such as @sc{ascii}.
The only requirement is that the function @code{char->integer} returns
an integer.}. The expression @code{\@var{c}} matches the character
@var{c} if @var{c} is not @samp{n}, @samp{-} nor a digit.
@item @code{@{@var{name}@}}
@dfn{Macro reference}. This expression matches the same lexemes as
those matched by the regular expression named @var{name}. You can
imagine that the reference is replaced by the text of the named
expression. However, it works as if parentheses had been added to
protect the substituting expression.
@item @code{"@var{some text}"}
@dfn{String}. A string matches a lexeme identical to its contents. In
a string, the only special characters are @samp{"}, which closes the
string, and @samp{\} which keeps the effect mentioned above.
@item @code{[@var{list of characters}]}
@itemx @code{[]@var{list of characters}]}
@itemx @code{[-@var{list of characters}]}
@itemx @code{[^@var{list of characters}]}
@dfn{Character class}. The expression matches one of the enumerated
characters. For example, the expression @samp{[abc]} matches one of
@samp{a}, @samp{b} and @samp{c}. You can list a range of characters by
writing the first character, the @samp{-} and the last character. For
example, @samp{[A-Za-z]} matches one letter (if the letters are ordered
and contiguous in the character set used by your implementation). The
special characters in a class are @samp{]}, which closes the class,
@samp{-}, which denotes a range of character, and @samp{\}, which keeps
its usual meaning. There is an exception with the first character in a
class. If the first character is @samp{]} or @samp{-}, it loses its
special meaning. If the first character is @samp{^}, the expression
matches one character if it is @emph{not} enumerated in @var{list of
characters}.
@ignore
Ordinary character
Dot
Backslash: with n, with an integer (finir les chiffres), otherwise
Macro reference
String
Character class
@end ignore
@end table
@cindex Union of regular expressions
@cindex Alternatives
@findex |
@cindex Concatenation of regular expressions
@cindex Optional regular expression
@findex ?
@cindex Closure of a regular expression
@cindex Positive closure
@findex +
@cindex Kleene closure
@findex *
@cindex Repetition of a regular expression
@findex @{@var{i},@var{j}@}
@cindex Overriding the precedence
@cindex Grouping regular expressions
@cindex Precedence
@findex ( )
Suppose @var{r} and @var{s} are regular expressions. Then the
following expressions can be built:
@table @asis
@item @code{@var{r}|@var{s}}
@dfn{Union}. This regular expression matches a lexeme if the lexeme is
matched by @var{r} or by @var{s}.
@item @code{@var{r}@var{s}}
@dfn{Concatenation}. This expression matches a lexeme if the lexeme can
be written as the concatenation of a lexeme matched by @var{r} and a
lexeme matched by @var{s}.
@item @code{@var{r}?}
@dfn{Optional expression}. A lexeme matches this expression if it is
the empty lexeme or if it matches @var{r}.
@item @code{@var{r}+}
@dfn{Positive closure}. This expression matches a lexeme that can be
written as the concatenation of one or more lexemes, where each of those
matches @var{r}.
@item @code{@var{r}*}
@dfn{Kleene closure}. A lexeme is matched by this expression if it can
be written as the concatenation of zero or more lexemes, where each of
those matches @var{r}.
@item @code{@var{r}@{@var{i}@}}
@itemx @code{@var{r}@{@var{i},@}}
@itemx @code{@var{r}@{@var{i},@var{j}@}}
@dfn{Power or repetition of an expression}. These expressions allow the
``repetition'' of a regular expression a certain number of times.
@var{i} and @var{j} must be positive integers and @var{j} must be
greater or equal to @var{i}. The first form repeats the expression
@var{r} exactly @var{i} times. The second form repeats @var{r} at least
@var{i} times. The last form repeats @var{r} at least @var{i} times and
at most @var{j} times. You should avoid using large numbers (more than
10), because the finite automaton for @var{r} is copied once for each
repetition. The tables of the analyser may quickly become very large.
You should note that the syntax of these expressions does not conflict
with the syntax of the macro reference.
@item @code{(@var{r})}
@dfn{Parentheses}. This expression matches the same lexemes as @var{r}.
It is used to override the precedence of the operators.
@end table
The building operators are listed in order of increasing
precedence. The @code{?}, @code{+}, @code{*} and repetition operators
have the same precedence.
@ignore
Or
Conc
Question
Plus
Star
Power
Parentheses
@end ignore
@cindex Marker
@cindex Beginning of line marker
@findex ^
@cindex End of line marker
@findex $
@cindex End of file marker
@findex <<EOF>>
@cindex Error marker
@findex <<ERROR>>
The remaining ``expressions'' would better be called
@dfn{markers}. They all match the empty lexeme but require certain
conditions to be respected in the input. They cannot be used in all
regular expressions. Suppose that @var{r} is a regular expression
without markers.
@table @asis
@item @code{^@var{r}}
@itemx @code{@var{r}$}
@dfn{Beginning and end of line}. These markers require that the lexeme
is found at the beginning and at the end of the line, respectively. The
markers lose their special meaning if they are not placed at their end
of the regular expression or if they are used in the first part of the
specification file. In those cases, they are treated as regular
characters.
@item @code{<<EOF>>}
@dfn{End of file}. This marker is matched only when the input is at the
end of file. The marker must be used alone in its pattern, and only in
the second part of the file. There can be at most one rule with this
particular pattern.
@item @code{<<ERROR>>}
@dfn{Error}. This marker is matched only when there is a parsing error.
It can be used under the same conditions as @code{<<EOF>>}.
@ignore
Caret
Dollar
End of file
Error
@end ignore
@end table
@cindex White space in regular expressions
White space ends the regular expressions. In order to include
white space in a regular expression, it must be protected by a backslash
or placed in a string.
@ignore
Ended with white spaces
Examples
@end ignore
@node Sample file, , Regular expressions, Syntax
@section An example of a specification file
Here is an example of a SILex specification file. The file is
syntactically correct from the SILex point of view. However, many
common mistakes are shown. The file is not a useful one.
@example
; This is a syntactically correct but silly file.
partial hel
complete @{partial@}lo ; @r{Backward macro ref. only}
digit [0-9]
letter [a-zA-Z]
%%
-?@{digit@}+ (cons 'integer yytext) ; @r{@code{yytext} contains}
; @r{the lexeme}
-?@{digit@}+\.@{digit@}+[eE][-+]?@{digit@}+
(cons ; @r{A long action}
'float
yytext)
; (list 'semicolon) ; @r{Probably a mistake}
begin )list 'begin( ; @r{No error detected here}
end ; @r{The action is optional}
\73 (list 'bell-3) ; @r{It does not match the}
; @r{char. # 7 followed by @samp{3}}
\0073 (list 'bell-3) ; @r{Neither does it}
(\7)3 (list 'bell-3) ; @r{This does it}
"*()+|@{@}[].? are ordinary but \" and \\ are special"
[^\n] (list 'char) ; @r{Same thing as @samp{.}}
(@{letter@}|_)(@{letter@}|_|@{digit@})* ; @r{A C identifier}
[][] ; @r{One of the square brackets}
Repe(ti)@{2@}on (list 'repetition)
^@{letter@}+: (cons 'label yytext) ; @r{A label placed at the}
; @r{beginning of the line}
$^ ; @r{No special meaning}
<<EOF>> (list 'eof) ; @r{Detection of the end of file}
<<ERROR>> (my-error) ; @r{Error handling}
@end example
@ignore
Subset of Scheme(?)
Example of \73, \0073, (\7)3
@end ignore
@c ---------- 3rd: semantics of the specification file ----------
@node Semantics, Generating, Syntax, Top
@c Node, Next, Prev, Up
@chapter Semantics of the specification file
@cindex Semantics of the specification file
An important part of the semantics of a specification file is
described with the syntax of the regular expressions. The remainder is
presented here. We begin with the role of the actions. Information on
the matching method follows.
@menu
* Action:: What does an action.
* Matching rules:: When does a regular expression matches the input.
@end menu
@node Action, Matching rules, Semantics, Semantics
@section Evaluation of the actions
@findex yycontinue
@findex yygetc
@findex yyungetc
@findex yytext
@findex yyline
@findex yycolumn
@findex yyoffset
@cindex Skipping a lexeme
@cindex Getting characters
@cindex Ungetting characters
@cindex Lexeme
@cindex Line number
@cindex Column number
@cindex Offset
@cindex Default action
The action of a rule is evaluated when the corresponding pattern
is matched. The result of its evaluation is the result that the lexical
analyser returns to its caller.
There are a few local variables that are accessible by the
action when it is evaluated. Those are @code{yycontinue},
@code{yygetc}, @code{yyungetc}, @code{yytext}, @code{yyline},
@code{yycolumn} and @code{yyoffset}. Each one is described here:
@table @code
@item yycontinue
This variable contains the lexical analysis function itself. Use
@code{(yycontinue)} to ask for the next token. Typically, the action
associated with a pattern that matches white space is a call to
@code{yycontinue}; it has the effect of skipping the white space.
@item yygetc
@itemx yyungetc
These variables contain functions to get and unget characters from the
input of the analyser. They take no argument. @code{yygetc} returns a
character or the symbol @samp{eof} if the end of file is reached. They
should be used to read characters instead of accessing directly the
input port because the analyser may have read more characters in order
to have a look-ahead. It is incorrect to try to unget more characters
than has been gotten since @emph{the parsing of the last token}. If
such an attempt is made, @code{yyungetc} silently refuses.
@item yytext
This variable is bound to a string containing the lexeme. This string
is guaranteed not to be mutated. The string is created only if the
action `seems' to need it. The action is considered to need the lexeme
when @samp{yytext} appears somewhere in the text of the action.
@item yyline
@itemx yycolumn
@itemx yyoffset
These variables indicate the position in the input at the beginning of
the lexeme. @code{yyline} is the number of the line; the first line is
the line 1. @code{yycolumn} is the number of the column; the first
column is the column 1. It is important to mention that characters such
as the tabulation generate a variable length output when they are
printed. So it would be more accurate to say that @code{yycolumn} is
the number of the first character of the lexeme, starting at the
beginning of the line. @code{yyoffset} indicates the distance from the
beginning of the input; the first lexeme has offset 0. The three
variables may not all be existant depending on the kind of counting you
want the analyser to do for you (@pxref{Counters}).
@end table
There is a default action that is provided for a rule when its
action is omitted. If the pattern is @samp{<<EOF>>}, the default action
returns the object @samp{(0)}. If the pattern is @samp{<<ERROR>>}, the
default action displays an error message and returns the symbol
@samp{error}@footnote{Note that there is no portable way for the
analyser to end the execution of the program when an error occurs.}.
The default action for the other patterns is to call the analyser again.
It is clearer (and normally more useful) to specify explicitly the
action associated with each rule.
@ignore
An action is executed when its corresp. regexp is matched
Environment of the actions
yycontinue, yygetc, yyungetc, yytext, yyline, yycolumn, yyoffset
yycolumn = number of the character (cause: tabs)
Default actions
@end ignore
@node Matching rules, , Action, Semantics
@section Matching the rules
@cindex Matching method
@cindex Matching conflict
@cindex Conflict between patterns
@cindex Interactive analyser
Each time the analyser is asked to return a token, it tries to
match a prefix of the input with a pattern. There may be more than one
possible match; when it is the case, we say there is a conflict. For
example, suppose we have those regular expressions:
@example
begin
[a-z]*
@end example
@noindent
and the input is @samp{beginning1 @r{@dots{}}}. We have a match with
the first expression and we have many different matches with the second.
To resolve such a conflict, the longest match is chosen. So the chosen
match is the one between the lexeme @samp{beginning} and the second
expression.
Suppose we have the same regular expressions but the input is
@samp{begin+ @r{@dots{}}}. We have @emph{two} longest match. This
conflict is resolved by choosing the first pattern that allows a longest
match. So the chosen match is between the lexeme @samp{begin} and the
first pattern.
The analyser generated by SILex allows the empty lexeme to be
matched if there is no longer match. However, you should take care not
to call the analyser again without consuming at least one character of
the input. It would cause an infinite loop.
The pattern @samp{<<EOF>>} is matched when the analyser is
called and the input is at end of file. In this situation, the marker
is matched even if there is a pattern that matches the empty lexeme.
The analyser can be called again and again and the @samp{<<EOF>>}
pattern will be matched each time, causing its corresponding action to
be evaluated each time, too.
The pattern @samp{<<ERROR>>} is matched when the input is not at
end of file and no other match is possible. Depending on the action
associated with this pattern, your program may choose to stop or choose
to try to recover from the error. To recover from the error, your
program has to read some characters from the input before it can call
the analyser again.
All lexical analysers generated by SILex are interactive. That
is, they read as few characters as possible to get the longest match.
This is a useful property when the input is coming from a terminal. A
lexical analyser is normally based on a finite automaton; it is the case
for the analysers generated by SILex. A non-interactive analyser always
needs an extra character to provoke an invalid transition in the
automaton. The longest match is detected this way. With an interactive
analyser, an extra character is not required when it is impossible to
obtain a longer match.
A lexical analyser generated by SILex does not impose any @i{a
priori} limit on the size of the lexemes. The internal buffer is
extended each time it is necessary.
@ignore
Longest prefix of the input, first matching rule
Warning for matching empty string
^ & $ anchors
End of file anchor
Error anchor
Interactive matching
The lexeme is not limited to a certain length
@end ignore
@c ---------- 4th: generating a lexical analyser ----------
@node Generating, Interface, Semantics, Top
@c Node, Next, Prev, Up
@chapter Generating and using a lexical analyser
@cindex Generating a lexical analyser
@cindex Using a lexical analyser
The most common use of SILex is to generate a single complete
lexical analyser. In some situations however, it is preferable to only
generate the tables describing the analysers and leaving to the program
to build complete analysers at run time. It is the case when the
program has to parse many files simultaneously with the same analyser;
and when a file is to be parsed using many different analysers. After
the description of the two modes, we describe the SILex options and the
different input methods.
@ignore
One or many analysers
Options
Different input methods
@end ignore
@menu
* One analyser:: Generating and using one complete analyser.
* Many analysers:: Dynamic creation of analysers.
* Options:: Line counting, table encoding.
* Input:: Input from a port, a string or a function.
@end menu
@node One analyser, Many analysers, Generating, Generating
@section One complete analyser
The function @code{lex} generates a complete lexical analyser.
We first describe its parameters. Then the interface with the generated
analyser is presented.
@menu
* Lex:: The @code{lex} command.
* Functions:: The functions in the lexical analyser.
* Usage:: Using the lexical analyser.
@end menu
@node Lex, Functions, One analyser, One analyser
@subsection The @code{lex} command
@findex lex
Here is the template of a call to @code{lex}:
@noindent
@code{(lex @var{input-file} @var{output-file} [@var{options} @r{@dots{}}])}
@noindent
@var{input-file} is a string containing the name of the specification
file. @var{output-file} is a string containing the name of the file in
which the lexical analyser is written. For a description of the
options, see @ref{Options}.
This is an example of a call to @code{lex}:
@example
(lex "pascal.l" "pascal.l.scm")
@end example
@ignore
Invocation of lex
@end ignore
@node Functions, Usage, Lex, One analyser
@subsection The functions in the lexical analyser
@findex lexer
@findex lexer-get-line
@findex lexer-get-column
@findex lexer-get-offset
@findex lexer-getc
@findex lexer-ungetc
@findex lexer-init
@cindex Name convention
The file generated by @code{lex} contains a few global
definitions. A program using the analyser needs only the following
functions: @code{lexer}, @code{lexer-get-line}, @code{lexer-get-column},
@code{lexer-get-offset}, @code{lexer-getc}, @code{lexer-ungetc} and
@code{lexer-init}.
@table @code
@item lexer
The lexical analysis function.
@item lexer-get-line
@itemx lexer-get-column
@itemx lexer-get-offset
Functions to obtain the current position in the input.
@item lexer-getc
@itemx lexer-ungetc
Reading and returning characters. These functions have the advantage of
being accessible from outside the actions.
@item lexer-init
Initializing the analyser with the input source.
@end table
To avoid name conflicts, these variables and others that we did
not mention all begin with @samp{lexer@r{@dots{}}}.
@ignore
List of variables
Name convention (lexer...)
@end ignore
@node Usage, , Functions, One analyser
@subsection Using the lexical analyser
@cindex Initialization of the analyser
@cindex Token
The first function that must be called is the initialization
function. It is necessary to give to the analyser its source of
characters. Here is the template of a call to this function:
@noindent
@code{(lexer-init @var{input-type} @var{input})}
@noindent
The values @var{input-type} and @var{input} are described in
@ref{Input}.
Once the initialization is done, the program can get
@dfn{tokens} from the analyser by calling the lexical analysing
function:
@example
(lexer)
@end example
@noindent
The token is the result of the evaluation of the action corresponding to
the matched pattern. The current position can be obtained with:
@example
(lexer-get-line)
(lexer-get-column)
(lexer-get-offset)
@end example
@noindent
As is described in @ref{Options}, some or all of these functions may not
be available. Characters can be gotten and ungotten from the input this
way:
@example
(lexer-getc)
(lexer-ungetc)
@end example
@noindent
It is important to note that the analyser remembers the characters
previously gotten. Your program does not have to keep those itself.
Even after the end of file has been reached or an error has
occured, the @code{lexer} function can be called again. Its behavior
depends on the remaining characters in the input.
The analyser can be reinitialized in any time with a new input.
@ignore
How to use it
Can be called many times at end-of-file
@end ignore
@node Many analysers, Options, One analyser, Generating
@section Many analysers
There are applications where it is necessary to have more than
one lexical analyser parsing more than one file at a time. For example:
@itemize @minus
@item
The parsing of a C file (with cpp) may cause the parsing of other files
recursively because of the @code{#include} commands.
@item
An interactive compiler has to be able to compile a file without closing
the communication with the standard input.
@item
SILex itself parses the macro names, the regular expressions, the
interior of a string, @dots{}, with different sets of patterns.
@end itemize
We first begin with an overview on how SILex allows the
programmer to create multiple lexical analysers. We continue with a
description of the function @code{lex-tables}. We end the explanations
with the functions used to creat analysers dynamically.
@menu
* Dynamic style:: It is possible to parse many files
with many analysers.
* Lex-tables:: The @code{lex-tables} command.
* Usage2:: Building and using lexical analysers dynamically.
@end menu
@node Dynamic style, Lex-tables, Many analysers, Many analysers
@subsection Creating analysers dynamically
@cindex Dynamic creation of analysers
@cindex Input system
It is quite easy to create new analysers at run-time. Suppose
there is an input that you want to analyse. There are just two steps to
make.
@itemize @bullet
@item
Create an @dfn{input system} from the input. An input system provides
the buffering, the line counting and similar low level services.
@item
Create one or more analysers from the input system and the analyser
tables. The tables are generated by the function @code{lex-tables} from
a specification file. A table contains all the necessary information to
build up an analyser. Normally, you have to use more than one analyser
per input when you expect the syntax to vary greatly in the input.
@end itemize
The following example shows a typical organization for a
multi-analyser lexical analysis. Note that one table may have been used
to produce many instances of analysers. Those analysers would simply be
connected to different input systems@footnote{It would make no sense to
create two instances coming from the same table and being connected to
the same input system. They would both have exactly the same
behavior.}.
@example
Input1 Input2 Input3
| | |
| | |
IS1 IS2 IS3
| | |
+-------+-------+ | +--+---+
| | | | | |
An1.1 An1.2 An1.3 An2.1 An3.1 An3.2
@end example
There is no @i{a priori} limit on the number of input systems
and analysers that you can create dynamically.
@ignore
Input systems & dynamic lexical analysers
@end ignore
@node Lex-tables, Usage2, Dynamic style, Many analysers
@subsection The @code{lex-tables} command
@findex lex-tables
The function @code{lex-tables} produces a table describing an
analyser from a specification file. A call to @code{lex-tables} looks
like:
@noindent
@code{(lex-tables @var{input-file} @var{table-name} @var{output-file} [@var{options} @r{@dots{}}])}
@noindent
@var{input-file} must be a string containing the name of the
specification file. @var{output-file} is a string containing the name
in which the result is printed. A definition is written in the output
file. @var{table-name} must be a string and it is the name appearing in
the definition. The options are defined in @ref{Options}.
This is an example of a call to @code{lex-tables}:
@example
(lex-tables "c.l" "c-table" "c.l.scm")
@end example
@ignore
Invocation of lex-tables
@end ignore
@node Usage2, , Lex-tables, Many analysers
@subsection Building and using lexical analysers dynamically
@cindex Building an analyser dynamically
@pindex multilex.scm
@cindex Name convention
@findex lexer-make-IS
@findex lexer-get-func-line
@findex lexer-get-func-column
@findex lexer-get-func-offset
@findex lexer-get-func-getc
@findex lexer-get-func-ungetc
@findex lexer-make-lexer
In order to be able to create dynamically the analysers the
program needs, the files containing the tables and the file
@file{multilex.scm} must be loaded as part of the program. The name
convention is the following: all definitions in @file{multilex.scm}
introduce names beginning with @samp{lexer@r{@dots{}}} and the
definitions in the other files introduce names that are specified by the
programmer. This way, it is easy to avoid name conflicts.
Input systems are created with the function
@code{lexer-make-IS}. A call to this function looks like:
@noindent
@code{(lexer-make-IS @var{input-type} @var{input} [@var{counters}])}
@noindent
The values @var{input-type} and @var{input} are described in
@ref{Input}. The value of @var{counters} determines which counters the
input system should maintain. This is discussed in @ref{Input}. Input
systems are associative lists that cannot be used directly.
Useful functions can be extracted from an input system. The
following calls return functions that allows the program to interact
with the input system:
@example
(lexer-get-func-line @var{input-system})
(lexer-get-func-column @var{input-system})
(lexer-get-func-offset @var{input-system})
(lexer-get-func-getc @var{input-system})
(lexer-get-func-ungetc @var{input-system})
@end example
Lexical analysers are created with the function
@code{lexer-make-lexer}. The template of a call to this function is:
@noindent
@code{(lexer-make-lexer @var{table} @var{input-system})}
@noindent
@var{table} is a table generated by SILex. @var{input-system} is the
input system from which the analyser will take its input. The result of
the call is the analysis function. The analysis function takes no
argument and returns tokens.
This example summarizes all the step in the creation of an
analyser:
@example
(let* ((my-port (open-input-file "my-file"))
(my-IS (lexer-make-IS 'port my-port))
(my-get-line (lexer-get-func-line IS))
(my-get-column (lexer-get-func-column IS))
(my-get-offset (lexer-get-func-offset IS))
(my-getc (lexer-get-func-getc IS))
(my-ungetc (lexer-get-func-ungetc IS))
(my-analyser (lexer-make-lexer my-table IS)))
(let loop ((tok (my-analyser)))
(cond ((eq? tok 'eof)
@r{@dots{}}
@end example
@ignore
File lex-rt.scm
How to use it: lex-rt.scm, lexer-make-IS, lexer-make-lexer & cie
Can be called many times at end-of-file
Name convention (lexer...)
@end ignore
@node Options, Input, Many analysers, Generating
@section Options at generation time
@cindex Options
We describe the options that can be passed to @code{lex} and
@code{lex-tables}. They indicate which counters (line, column and
offset) the actions need; which table encoding should be used; and
whether the tables should be pretty-printed.
@menu
* Counters:: Keeping the position in the input.
* Tables encoding:: Encodings of the tables of an analyser.
* Pretty print:: Pretty printing the tables.
@end menu
@node Counters, Tables encoding, Options, Options
@subsection Line, column and offset counters
@cindex Counters
@vindex none
@vindex line
@vindex all
There are three different counting modes: no counter, line
counter and all counters. The more counters the input system maintains,
the more it is slowed down. The default is the line counting.
This option is specified when the program calls the functions
@code{lex}, @code{lex-tables} and @code{lexer-make-IS}. The three modes
are represented by the symbols @samp{none}, @samp{line} and @samp{all}.
When one of the first two functions is called the mode must be preceded
by the symbol @samp{counters}. These examples illustrate the use of the
option:
@example
(lex "html.l" "html.l.scm" 'counters 'none)
(lex-tables "cobol.l" "cobol-table" "cobol.l.scm" 'counters 'line)
(lexer-make-IS 'port my-port 'all)
@end example
You should be careful when you build analysers dynamically. The
mode specified at the input system creation must be consistent with the
mode specified at the tables creation.
@ignore
counters
@end ignore
@node Tables encoding, Pretty print, Counters, Options
@subsection Encoding of the table of an analyser
@cindex Encoding of the table
@vindex portable
@vindex code
@cindex Portability
@cindex Fast analyser
SILex provides three different encodings of the tables: the
default encoding, the portable encoding and the ``compilation'' to
Scheme code.
With the default encoding, the finite automaton of the analyser
is represented with data structures that contain the @emph{numbers} of
the characters (in the sense of @code{char->integer}). Since the
numbers associated with the characters may depend on the Scheme
implementation, an analyser generated with an implementation can be
safely used only with the same implementation. An analyser encoded in
the default style is not portable. But this representation is the most
compact.
With the portable encoding, the data structures describing the
automaton contain characters directly. If the automaton, as generated,
contains a transition from state @var{s} to state @var{t} on character
@var{c}, then somewhere in the table there is the Scheme character
@samp{#\@var{c}}. When the file containing the analyser is loaded in
any implementation, the character is read as is, and not as the number
@samp{(char->integer #\@var{c})} as evaluated by the original
implementation. As long as the implementation using the analyser
recognizes the characters mentionned in it, there is no problem.
So this encoding is portable. However, it is less compact.
This is because something like @samp{(65 90)} is more compact than
something like @samp{(#\A #\B @r{@dots{}} #\Y #\Z)} to represent
@samp{[A-Z]}. The construction of an analyser from a portable table
takes more time than the construction from a default table. But, once
built, the performance of the analyser is the same in both cases.
It is important to note that in some character sets, the letters
or the digits are not contiguous. So, in those cases, the regular
expression @samp{[A-Z]} does not necessarily accept only the uppercase
letters.
The last encoding is the compilation to Scheme code. This
produces a fast lexical analyser. Instead of containing data structures
representing the behavior of the automaton, the table contains Scheme
code that ``hard-codes'' the automaton. This encoding often generates
big tables. Such an analyser is not portable.
The encoding of the tables can be specified as an option when
@code{lex} and @code{lex-tables} are called. The symbols
@samp{portable} and @samp{code} are used to specify that the table must
be portable and that the table must be compiled, respectively. For
example, these calls illustrate the use of the options:
@example
(lex "c.l" "c.l.scm") ; @r{Default encoding}
(lex "c.l" "c.l.scm" 'portable) ; @r{Portable encoding}
(lex "c.l" "c.l.scm" 'code) ; @r{Compilation of the automaton}
@end example
@ignore
portable / code
@end ignore
@node Pretty print, , Tables encoding, Options
@subsection Pretty printing the tables
@cindex Pretty-printing the tables
The pretty-print option (specified with the symbol @samp{pp})
tells SILex to pretty-print the contents of the table. Normally, the
table is displayed as a compact mass of characters fitting in about 75
columns. The option is useful only for a developer of SILex. The
Scheme code generated with the @samp{code} option is always
pretty-printed.
@ignore
pp
@end ignore
@node Input, , Options, Generating
@section Input methods
@cindex Input
@cindex Input port, input from an
@cindex String, input from a
@cindex Function, input from a
An analyser can take its input from three different objects: an
input port, a string or a function. The type of input and the input
itself must be passed when an analyser is initialized and when an input
system is created. The input type is specified using one of the three
symbols: @samp{port}, @samp{string} or @samp{procedure}. For example:
@example
(lexer-init 'port (current-input-port))
(lexer-make-IS 'string "Input string.")
@end example
When an input port is used by an analyser, the program should
avoid reading characters directly from the port. This is because the
analyser may have needed a look-ahead to do the analysis of the
preceding token. The program would not find what it expects on the
port. The analyser provides safe functions to get characters from the
input. The analyser never closes itself the port it has received, this
task is left to the program.
When the analyser is initialized with a string, it takes a copy
of it. This way, eventual mutations of the string do not affect the
analysis.
The use of a function as character source allows the analyser to
parse any character stream, no matter how it is obtained. For example,
the characters may come from the decompression or decryption of a huge
file, the task being done lazily in order to save space. The function
must take no argument and return a character each time it is called.
When the end of file (or its logical equivalent) is reached, the
function must return an object that is not a character (for example, the
symbol @samp{eof}). After the function has returned an end of file
indicator, it is not called again.
@ignore
port / string / function
Copy of the string
The function at end-of-file is not called again
@end ignore
@c ---------- Interfacing with lalr.scm ----------
@node Interface, Acknowledgements, Generating, Top
@c Node, Next, Prev, Up
@appendix Interfacing with an @sc{lalr}(1) parser
@cindex Dominique Boucher
@cindex @sc{lalr}(1) parser generator
A nice @sc{lalr}(1) parser generator for Scheme has been written
by Dominique Boucher. The generator is accessible at the Scheme
Repository at @code{ftp://ftp.cs.indiana.edu} in the file
@file{/pub/scheme-repository/code/lang/lalr-scm.tar.gz}.
The parsers that are generated need two functions to operate: a
lexical analysis function and an error function. The analysis function
must take no argument and return a token each time it is called. This
is exactly the behavior of the lexical analysis functions created by
SILex.
The @sc{lalr}(1) parsers expect that the tokens are pairs with a
number in the @sc{car}, the token number, and any value in the @sc{cdr},
the token attribute. It is easy to respect this convention with a SILex
lexical analyser since the actions can be any Scheme expressions.
Furthermore, the file created by the @sc{lalr}(1) parser generator
contains definitions that give names to the number of the tokens. A
lexical analyser can use those names in its actions in order to simplify
the coordination between the two analysers.
@c ---------- Acknowledgements ----------
@node Acknowledgements, Index, Interface, Top
@c Node, Next, Prev, Up
@chapheading Acknowledgements
I would like to thank my comrades of the laboratory for their
support in this project. Especially Martin Larose and Marc Feeley for
their numerous suggestions.
I hope SILex will be useful for many Scheme programmers.
If you find a bug, please let me know at
@code{mailto:dube@@iro.umontreal.ca}.
@c ---------- Index & tables of contents ----------
@node Index, , Acknowledgements, Top
@c Node, Next, Prev, Up
@unnumbered Index
@printindex cp
@contents
@bye
Memos:
Verifier si des trucs comme #\^L sont portables
|