\input texinfo.tex @c -*-texinfo-*- @setfilename silex.info @settitle SILex @setchapternewpage on @footnotestyle end @paragraphindent 3 @syncodeindex fn cp @syncodeindex vr cp @syncodeindex ky cp @syncodeindex pg cp @syncodeindex tp cp @c ---------- Info copyright ---------- @ifinfo This file documents the version 1.0 of SILex, a Scheme Implementation of Lex. Copyright @copyright{} 2001 Danny Dub@'e This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. @end ifinfo @c ---------- Title & copyright pages (printed) ---------- @titlepage @title SILex @subtitle A Scheme Implementation of Lex @subtitle Documentation for SILex version 1.0 @author Danny Dub@'e @page @vskip 0pt plus 1filll Copyright @copyright{} 2001 Danny Dub@'e. This is the first edition of the SILex documentation. It documents the version 1.0 of SILex. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. @end titlepage @headings double @c ---------- Top node ---------- @ifinfo @node Top, Overview, (dir), (dir) @c Node, Next, Prev, Up @top This document describes SILex. ``SILex'' stands for ``Scheme Implementation of Lex''. It generates a Scheme lexical analyser from a Lex-like specification file. This document is the first edition and describes SILex version 1.0. @menu * Overview:: A general description of SILex. * Syntax:: The look of a specification file. * Semantics:: The meaning of a specification file. * Generating:: How to generate and use lexical analysers. * Interface:: With a Scheme LALR(1) parser generator. * Index:: Concepts and commands relating to SILex. * Acknowledgements:: @end menu @end ifinfo @c ---------- 1st: overview ---------- @node Overview, Syntax, Top, Top @c Node, Next, Prev, Up @chapter Overview SILex is a lexical analyser generator similar to the Lex and Flex programs, but for Scheme. ``SILex'' stands for ``Scheme Implementation of Lex''. SILex has many similarities with the C programs, but has many differences, too. The syntax of the specification files for SILex is close to that of Lex and Flex. Of course, the actions must be written in Scheme and not in C. The set of regular expressions is mostly the same. An important difference is relative to the multiple start states in the C analysers. SILex replaces them by allowing multiple analysers to take their input from the same source. Different inputs can be analysed at the same time, possibly with different instances of one or more lexical analysers. The analysers are created dynamically. SILex provides many other features. The designer of a lexical analyser can specify the actions to be taken when the end of file is reached or when an error occurs. The analyser can keep track of the position in the input in terms of the number of the line, column and offset. An analyser can take its input from an input port, a string or a function. SILex is portable; it does not depend on a particular character set. It can generate analysers that are portable, too. Finally, the table encoding the behavior of the analyser can be compiled to Scheme code. The fastest lexical analysers can be produced this way. @ignore Lex-like Scheme program Driver for Scheme One or more Scheme actions Facilities for error handling Line & column counting Multiple inputs Multiple scanners per input Different input methods Different table codings Portable @end ignore @c ---------- 2nd: syntax of a specification file ---------- @node Syntax, Semantics, Overview, Top @c Node, Next, Prev, Up @chapter Syntax of the specification file @cindex Syntax of the specification file @cindex Specification file @cindex Comment @cindex White space A specification file for a lexical analyser contains two parts: the @dfn{macro definitions part} and the @dfn{rules part}. The two parts are separated by the mark @code{%%}. The first part is used to define @dfn{macros}; that is, to give names to some regular expressions. The second part is used to indicate the regular expressions with which the input will have to match and the @dfn{actions} associated with each expression. Comments can be inserted any place where white space is allowed and is considered as white space itself. The syntax of the comments is the same as in Scheme. That is, it begins with a semicolon @samp{;} and extends up to the end of a line. The semicolon is a valid token in many languages, so you should take care not to comment out an entire line when you write a regular expression matching a semicolon. The syntax of each part is presented, except for the regular expressions, which are described apart. A small example follows. @ignore Macros part %% rules part Comments & whitespace @end ignore @menu * Macros part:: Syntax of the macro definitions. * Rules part:: Syntax of the rule-action pairs. * Regular expressions:: How to build regular expressions. * Sample file:: Shows some frequent mistakes. @end menu @node Macros part, Rules part, Syntax, Syntax @section Macro definitions part @cindex Macro @cindex Macro definitions part @cindex Scope of a macro definition The first part of a specification file contains zero or more macro definitions. A definition consists of a name and a regular expression, separated by white space. It looks better when each definition is written on a separate line. The syntax for a macro name is that of a Scheme symbol. The case of the letters is not significant. For example, @code{abcd}, @code{+}, @code{...}, @code{Digit} and @code{digit} are all valid macro names; the last two being the same. You cannot write two macro definitions with the same name. The defined macro can be referenced in regular expressions using the syntax @code{@{@var{name}@}} (@pxref{Regular expressions}). The scope of a macro definition includes the remaining definitions and the rules part of the file. It is analogous to the @code{let*} is Scheme, where the macro definitions correspond to the bindings and the rules part correspond to the body. End the macro definitions part with @code{%%}. @ignore Names = Scheme symbols Case insensitive End with %% Order of macros @end ignore @node Rules part, Regular expressions, Macros part, Syntax @section Rules part @cindex Rules part @cindex Pattern @cindex Action @cindex Indentation in actions The rules part contains the rules up to the end of the specification file. Each rule is a @dfn{pattern} optionally followed by an @dfn{action}. The pattern is a regular expression. The action, if there is one, is formed of one or more Scheme expressions. The actions can span over several lines. To distinguish between the remaining of the current action and the start of a new rule, SILex checks the indentation. A new rule must start at the beginning of the line. That is, the action starts right after the pattern and contains all the following lines that start with white space. SILex does not parse the actions. It simply captures the text up to the start of the next rule. So a syntax error in an action is not detected by SILex. Nevertheless, SILex is able to detect that an action has been omitted. In that case, a default action is supplied. @ignore Action = one or more Scheme expressions Action are taken verbatim Indentation is significant Default actions End of file @end ignore @node Regular expressions, Sample file, Rules part, Syntax @section Regular expressions @cindex Regular expression @cindex Atomic regular expression @cindex Ordinary character @cindex Dot @cindex Wild card @findex . @cindex Backslash @cindex Protecting a character @findex \n @findex \@var{integer} @findex \@var{c} @cindex Macro reference @findex @{@var{name}@} @cindex String @findex "@var{some text}" @cindex Character class @findex [@var{list of characters}] We first describe the atomic regular expressions. Then, we show how to build more complex regular expressions from simpler ones. Finally, the markers are introduced. The following constructs are regular expressions: @table @asis @item @code{@var{c}} @dfn{Ordinary character}. It is a regular expression that matches the character @var{c} itself. @var{c} cannot be one of @samp{.}, @samp{\}, @samp{@{}, @samp{"}, @samp{[}, @samp{|}, @samp{?}, @samp{+}, @samp{*}, @samp{(}, @samp{)}, @samp{^}, @samp{$}, @samp{;} or any white space. @item @code{.} @dfn{Wild card}. It matches any character except the newline character. @item @code{\n} @itemx @code{\@var{integer}} @itemx @code{\@var{c}} @dfn{Backslash}. The backslash is used for two things: protect a character from special meaning; generating non-printable characters. The expression @code{\n} matches the newline character. The expression @code{\@var{integer}} matches the character that has number @var{integer} (in the sense of @code{char->integer}). @var{integer} must be a valid character number on the implementation that you use. It may be more than 3 digits long and even negative@footnote{The Scheme standards do not impose a particular character set, such as @sc{ascii}. The only requirement is that the function @code{char->integer} returns an integer.}. The expression @code{\@var{c}} matches the character @var{c} if @var{c} is not @samp{n}, @samp{-} nor a digit. @item @code{@{@var{name}@}} @dfn{Macro reference}. This expression matches the same lexemes as those matched by the regular expression named @var{name}. You can imagine that the reference is replaced by the text of the named expression. However, it works as if parentheses had been added to protect the substituting expression. @item @code{"@var{some text}"} @dfn{String}. A string matches a lexeme identical to its contents. In a string, the only special characters are @samp{"}, which closes the string, and @samp{\} which keeps the effect mentioned above. @item @code{[@var{list of characters}]} @itemx @code{[]@var{list of characters}]} @itemx @code{[-@var{list of characters}]} @itemx @code{[^@var{list of characters}]} @dfn{Character class}. The expression matches one of the enumerated characters. For example, the expression @samp{[abc]} matches one of @samp{a}, @samp{b} and @samp{c}. You can list a range of characters by writing the first character, the @samp{-} and the last character. For example, @samp{[A-Za-z]} matches one letter (if the letters are ordered and contiguous in the character set used by your implementation). The special characters in a class are @samp{]}, which closes the class, @samp{-}, which denotes a range of character, and @samp{\}, which keeps its usual meaning. There is an exception with the first character in a class. If the first character is @samp{]} or @samp{-}, it loses its special meaning. If the first character is @samp{^}, the expression matches one character if it is @emph{not} enumerated in @var{list of characters}. @ignore Ordinary character Dot Backslash: with n, with an integer (finir les chiffres), otherwise Macro reference String Character class @end ignore @end table @cindex Union of regular expressions @cindex Alternatives @findex | @cindex Concatenation of regular expressions @cindex Optional regular expression @findex ? @cindex Closure of a regular expression @cindex Positive closure @findex + @cindex Kleene closure @findex * @cindex Repetition of a regular expression @findex @{@var{i},@var{j}@} @cindex Overriding the precedence @cindex Grouping regular expressions @cindex Precedence @findex ( ) Suppose @var{r} and @var{s} are regular expressions. Then the following expressions can be built: @table @asis @item @code{@var{r}|@var{s}} @dfn{Union}. This regular expression matches a lexeme if the lexeme is matched by @var{r} or by @var{s}. @item @code{@var{r}@var{s}} @dfn{Concatenation}. This expression matches a lexeme if the lexeme can be written as the concatenation of a lexeme matched by @var{r} and a lexeme matched by @var{s}. @item @code{@var{r}?} @dfn{Optional expression}. A lexeme matches this expression if it is the empty lexeme or if it matches @var{r}. @item @code{@var{r}+} @dfn{Positive closure}. This expression matches a lexeme that can be written as the concatenation of one or more lexemes, where each of those matches @var{r}. @item @code{@var{r}*} @dfn{Kleene closure}. A lexeme is matched by this expression if it can be written as the concatenation of zero or more lexemes, where each of those matches @var{r}. @item @code{@var{r}@{@var{i}@}} @itemx @code{@var{r}@{@var{i},@}} @itemx @code{@var{r}@{@var{i},@var{j}@}} @dfn{Power or repetition of an expression}. These expressions allow the ``repetition'' of a regular expression a certain number of times. @var{i} and @var{j} must be positive integers and @var{j} must be greater or equal to @var{i}. The first form repeats the expression @var{r} exactly @var{i} times. The second form repeats @var{r} at least @var{i} times. The last form repeats @var{r} at least @var{i} times and at most @var{j} times. You should avoid using large numbers (more than 10), because the finite automaton for @var{r} is copied once for each repetition. The tables of the analyser may quickly become very large. You should note that the syntax of these expressions does not conflict with the syntax of the macro reference. @item @code{(@var{r})} @dfn{Parentheses}. This expression matches the same lexemes as @var{r}. It is used to override the precedence of the operators. @end table The building operators are listed in order of increasing precedence. The @code{?}, @code{+}, @code{*} and repetition operators have the same precedence. @ignore Or Conc Question Plus Star Power Parentheses @end ignore @cindex Marker @cindex Beginning of line marker @findex ^ @cindex End of line marker @findex $ @cindex End of file marker @findex <<EOF>> @cindex Error marker @findex <<ERROR>> The remaining ``expressions'' would better be called @dfn{markers}. They all match the empty lexeme but require certain conditions to be respected in the input. They cannot be used in all regular expressions. Suppose that @var{r} is a regular expression without markers. @table @asis @item @code{^@var{r}} @itemx @code{@var{r}$} @dfn{Beginning and end of line}. These markers require that the lexeme is found at the beginning and at the end of the line, respectively. The markers lose their special meaning if they are not placed at their end of the regular expression or if they are used in the first part of the specification file. In those cases, they are treated as regular characters. @item @code{<<EOF>>} @dfn{End of file}. This marker is matched only when the input is at the end of file. The marker must be used alone in its pattern, and only in the second part of the file. There can be at most one rule with this particular pattern. @item @code{<<ERROR>>} @dfn{Error}. This marker is matched only when there is a parsing error. It can be used under the same conditions as @code{<<EOF>>}. @ignore Caret Dollar End of file Error @end ignore @end table @cindex White space in regular expressions White space ends the regular expressions. In order to include white space in a regular expression, it must be protected by a backslash or placed in a string. @ignore Ended with white spaces Examples @end ignore @node Sample file, , Regular expressions, Syntax @section An example of a specification file Here is an example of a SILex specification file. The file is syntactically correct from the SILex point of view. However, many common mistakes are shown. The file is not a useful one. @example ; This is a syntactically correct but silly file. partial hel complete @{partial@}lo ; @r{Backward macro ref. only} digit [0-9] letter [a-zA-Z] %% -?@{digit@}+ (cons 'integer yytext) ; @r{@code{yytext} contains} ; @r{the lexeme} -?@{digit@}+\.@{digit@}+[eE][-+]?@{digit@}+ (cons ; @r{A long action} 'float yytext) ; (list 'semicolon) ; @r{Probably a mistake} begin )list 'begin( ; @r{No error detected here} end ; @r{The action is optional} \73 (list 'bell-3) ; @r{It does not match the} ; @r{char. # 7 followed by @samp{3}} \0073 (list 'bell-3) ; @r{Neither does it} (\7)3 (list 'bell-3) ; @r{This does it} "*()+|@{@}[].? are ordinary but \" and \\ are special" [^\n] (list 'char) ; @r{Same thing as @samp{.}} (@{letter@}|_)(@{letter@}|_|@{digit@})* ; @r{A C identifier} [][] ; @r{One of the square brackets} Repe(ti)@{2@}on (list 'repetition) ^@{letter@}+: (cons 'label yytext) ; @r{A label placed at the} ; @r{beginning of the line} $^ ; @r{No special meaning} <<EOF>> (list 'eof) ; @r{Detection of the end of file} <<ERROR>> (my-error) ; @r{Error handling} @end example @ignore Subset of Scheme(?) Example of \73, \0073, (\7)3 @end ignore @c ---------- 3rd: semantics of the specification file ---------- @node Semantics, Generating, Syntax, Top @c Node, Next, Prev, Up @chapter Semantics of the specification file @cindex Semantics of the specification file An important part of the semantics of a specification file is described with the syntax of the regular expressions. The remainder is presented here. We begin with the role of the actions. Information on the matching method follows. @menu * Action:: What does an action. * Matching rules:: When does a regular expression matches the input. @end menu @node Action, Matching rules, Semantics, Semantics @section Evaluation of the actions @findex yycontinue @findex yygetc @findex yyungetc @findex yytext @findex yyline @findex yycolumn @findex yyoffset @cindex Skipping a lexeme @cindex Getting characters @cindex Ungetting characters @cindex Lexeme @cindex Line number @cindex Column number @cindex Offset @cindex Default action The action of a rule is evaluated when the corresponding pattern is matched. The result of its evaluation is the result that the lexical analyser returns to its caller. There are a few local variables that are accessible by the action when it is evaluated. Those are @code{yycontinue}, @code{yygetc}, @code{yyungetc}, @code{yytext}, @code{yyline}, @code{yycolumn} and @code{yyoffset}. Each one is described here: @table @code @item yycontinue This variable contains the lexical analysis function itself. Use @code{(yycontinue)} to ask for the next token. Typically, the action associated with a pattern that matches white space is a call to @code{yycontinue}; it has the effect of skipping the white space. @item yygetc @itemx yyungetc These variables contain functions to get and unget characters from the input of the analyser. They take no argument. @code{yygetc} returns a character or the symbol @samp{eof} if the end of file is reached. They should be used to read characters instead of accessing directly the input port because the analyser may have read more characters in order to have a look-ahead. It is incorrect to try to unget more characters than has been gotten since @emph{the parsing of the last token}. If such an attempt is made, @code{yyungetc} silently refuses. @item yytext This variable is bound to a string containing the lexeme. This string is guaranteed not to be mutated. The string is created only if the action `seems' to need it. The action is considered to need the lexeme when @samp{yytext} appears somewhere in the text of the action. @item yyline @itemx yycolumn @itemx yyoffset These variables indicate the position in the input at the beginning of the lexeme. @code{yyline} is the number of the line; the first line is the line 1. @code{yycolumn} is the number of the column; the first column is the column 1. It is important to mention that characters such as the tabulation generate a variable length output when they are printed. So it would be more accurate to say that @code{yycolumn} is the number of the first character of the lexeme, starting at the beginning of the line. @code{yyoffset} indicates the distance from the beginning of the input; the first lexeme has offset 0. The three variables may not all be existant depending on the kind of counting you want the analyser to do for you (@pxref{Counters}). @end table There is a default action that is provided for a rule when its action is omitted. If the pattern is @samp{<<EOF>>}, the default action returns the object @samp{(0)}. If the pattern is @samp{<<ERROR>>}, the default action displays an error message and returns the symbol @samp{error}@footnote{Note that there is no portable way for the analyser to end the execution of the program when an error occurs.}. The default action for the other patterns is to call the analyser again. It is clearer (and normally more useful) to specify explicitly the action associated with each rule. @ignore An action is executed when its corresp. regexp is matched Environment of the actions yycontinue, yygetc, yyungetc, yytext, yyline, yycolumn, yyoffset yycolumn = number of the character (cause: tabs) Default actions @end ignore @node Matching rules, , Action, Semantics @section Matching the rules @cindex Matching method @cindex Matching conflict @cindex Conflict between patterns @cindex Interactive analyser Each time the analyser is asked to return a token, it tries to match a prefix of the input with a pattern. There may be more than one possible match; when it is the case, we say there is a conflict. For example, suppose we have those regular expressions: @example begin [a-z]* @end example @noindent and the input is @samp{beginning1 @r{@dots{}}}. We have a match with the first expression and we have many different matches with the second. To resolve such a conflict, the longest match is chosen. So the chosen match is the one between the lexeme @samp{beginning} and the second expression. Suppose we have the same regular expressions but the input is @samp{begin+ @r{@dots{}}}. We have @emph{two} longest match. This conflict is resolved by choosing the first pattern that allows a longest match. So the chosen match is between the lexeme @samp{begin} and the first pattern. The analyser generated by SILex allows the empty lexeme to be matched if there is no longer match. However, you should take care not to call the analyser again without consuming at least one character of the input. It would cause an infinite loop. The pattern @samp{<<EOF>>} is matched when the analyser is called and the input is at end of file. In this situation, the marker is matched even if there is a pattern that matches the empty lexeme. The analyser can be called again and again and the @samp{<<EOF>>} pattern will be matched each time, causing its corresponding action to be evaluated each time, too. The pattern @samp{<<ERROR>>} is matched when the input is not at end of file and no other match is possible. Depending on the action associated with this pattern, your program may choose to stop or choose to try to recover from the error. To recover from the error, your program has to read some characters from the input before it can call the analyser again. All lexical analysers generated by SILex are interactive. That is, they read as few characters as possible to get the longest match. This is a useful property when the input is coming from a terminal. A lexical analyser is normally based on a finite automaton; it is the case for the analysers generated by SILex. A non-interactive analyser always needs an extra character to provoke an invalid transition in the automaton. The longest match is detected this way. With an interactive analyser, an extra character is not required when it is impossible to obtain a longer match. A lexical analyser generated by SILex does not impose any @i{a priori} limit on the size of the lexemes. The internal buffer is extended each time it is necessary. @ignore Longest prefix of the input, first matching rule Warning for matching empty string ^ & $ anchors End of file anchor Error anchor Interactive matching The lexeme is not limited to a certain length @end ignore @c ---------- 4th: generating a lexical analyser ---------- @node Generating, Interface, Semantics, Top @c Node, Next, Prev, Up @chapter Generating and using a lexical analyser @cindex Generating a lexical analyser @cindex Using a lexical analyser The most common use of SILex is to generate a single complete lexical analyser. In some situations however, it is preferable to only generate the tables describing the analysers and leaving to the program to build complete analysers at run time. It is the case when the program has to parse many files simultaneously with the same analyser; and when a file is to be parsed using many different analysers. After the description of the two modes, we describe the SILex options and the different input methods. @ignore One or many analysers Options Different input methods @end ignore @menu * One analyser:: Generating and using one complete analyser. * Many analysers:: Dynamic creation of analysers. * Options:: Line counting, table encoding. * Input:: Input from a port, a string or a function. @end menu @node One analyser, Many analysers, Generating, Generating @section One complete analyser The function @code{lex} generates a complete lexical analyser. We first describe its parameters. Then the interface with the generated analyser is presented. @menu * Lex:: The @code{lex} command. * Functions:: The functions in the lexical analyser. * Usage:: Using the lexical analyser. @end menu @node Lex, Functions, One analyser, One analyser @subsection The @code{lex} command @findex lex Here is the template of a call to @code{lex}: @noindent @code{(lex @var{input-file} @var{output-file} [@var{options} @r{@dots{}}])} @noindent @var{input-file} is a string containing the name of the specification file. @var{output-file} is a string containing the name of the file in which the lexical analyser is written. For a description of the options, see @ref{Options}. This is an example of a call to @code{lex}: @example (lex "pascal.l" "pascal.l.scm") @end example @ignore Invocation of lex @end ignore @node Functions, Usage, Lex, One analyser @subsection The functions in the lexical analyser @findex lexer @findex lexer-get-line @findex lexer-get-column @findex lexer-get-offset @findex lexer-getc @findex lexer-ungetc @findex lexer-init @cindex Name convention The file generated by @code{lex} contains a few global definitions. A program using the analyser needs only the following functions: @code{lexer}, @code{lexer-get-line}, @code{lexer-get-column}, @code{lexer-get-offset}, @code{lexer-getc}, @code{lexer-ungetc} and @code{lexer-init}. @table @code @item lexer The lexical analysis function. @item lexer-get-line @itemx lexer-get-column @itemx lexer-get-offset Functions to obtain the current position in the input. @item lexer-getc @itemx lexer-ungetc Reading and returning characters. These functions have the advantage of being accessible from outside the actions. @item lexer-init Initializing the analyser with the input source. @end table To avoid name conflicts, these variables and others that we did not mention all begin with @samp{lexer@r{@dots{}}}. @ignore List of variables Name convention (lexer...) @end ignore @node Usage, , Functions, One analyser @subsection Using the lexical analyser @cindex Initialization of the analyser @cindex Token The first function that must be called is the initialization function. It is necessary to give to the analyser its source of characters. Here is the template of a call to this function: @noindent @code{(lexer-init @var{input-type} @var{input})} @noindent The values @var{input-type} and @var{input} are described in @ref{Input}. Once the initialization is done, the program can get @dfn{tokens} from the analyser by calling the lexical analysing function: @example (lexer) @end example @noindent The token is the result of the evaluation of the action corresponding to the matched pattern. The current position can be obtained with: @example (lexer-get-line) (lexer-get-column) (lexer-get-offset) @end example @noindent As is described in @ref{Options}, some or all of these functions may not be available. Characters can be gotten and ungotten from the input this way: @example (lexer-getc) (lexer-ungetc) @end example @noindent It is important to note that the analyser remembers the characters previously gotten. Your program does not have to keep those itself. Even after the end of file has been reached or an error has occured, the @code{lexer} function can be called again. Its behavior depends on the remaining characters in the input. The analyser can be reinitialized in any time with a new input. @ignore How to use it Can be called many times at end-of-file @end ignore @node Many analysers, Options, One analyser, Generating @section Many analysers There are applications where it is necessary to have more than one lexical analyser parsing more than one file at a time. For example: @itemize @minus @item The parsing of a C file (with cpp) may cause the parsing of other files recursively because of the @code{#include} commands. @item An interactive compiler has to be able to compile a file without closing the communication with the standard input. @item SILex itself parses the macro names, the regular expressions, the interior of a string, @dots{}, with different sets of patterns. @end itemize We first begin with an overview on how SILex allows the programmer to create multiple lexical analysers. We continue with a description of the function @code{lex-tables}. We end the explanations with the functions used to creat analysers dynamically. @menu * Dynamic style:: It is possible to parse many files with many analysers. * Lex-tables:: The @code{lex-tables} command. * Usage2:: Building and using lexical analysers dynamically. @end menu @node Dynamic style, Lex-tables, Many analysers, Many analysers @subsection Creating analysers dynamically @cindex Dynamic creation of analysers @cindex Input system It is quite easy to create new analysers at run-time. Suppose there is an input that you want to analyse. There are just two steps to make. @itemize @bullet @item Create an @dfn{input system} from the input. An input system provides the buffering, the line counting and similar low level services. @item Create one or more analysers from the input system and the analyser tables. The tables are generated by the function @code{lex-tables} from a specification file. A table contains all the necessary information to build up an analyser. Normally, you have to use more than one analyser per input when you expect the syntax to vary greatly in the input. @end itemize The following example shows a typical organization for a multi-analyser lexical analysis. Note that one table may have been used to produce many instances of analysers. Those analysers would simply be connected to different input systems@footnote{It would make no sense to create two instances coming from the same table and being connected to the same input system. They would both have exactly the same behavior.}. @example Input1 Input2 Input3 | | | | | | IS1 IS2 IS3 | | | +-------+-------+ | +--+---+ | | | | | | An1.1 An1.2 An1.3 An2.1 An3.1 An3.2 @end example There is no @i{a priori} limit on the number of input systems and analysers that you can create dynamically. @ignore Input systems & dynamic lexical analysers @end ignore @node Lex-tables, Usage2, Dynamic style, Many analysers @subsection The @code{lex-tables} command @findex lex-tables The function @code{lex-tables} produces a table describing an analyser from a specification file. A call to @code{lex-tables} looks like: @noindent @code{(lex-tables @var{input-file} @var{table-name} @var{output-file} [@var{options} @r{@dots{}}])} @noindent @var{input-file} must be a string containing the name of the specification file. @var{output-file} is a string containing the name in which the result is printed. A definition is written in the output file. @var{table-name} must be a string and it is the name appearing in the definition. The options are defined in @ref{Options}. This is an example of a call to @code{lex-tables}: @example (lex-tables "c.l" "c-table" "c.l.scm") @end example @ignore Invocation of lex-tables @end ignore @node Usage2, , Lex-tables, Many analysers @subsection Building and using lexical analysers dynamically @cindex Building an analyser dynamically @pindex multilex.scm @cindex Name convention @findex lexer-make-IS @findex lexer-get-func-line @findex lexer-get-func-column @findex lexer-get-func-offset @findex lexer-get-func-getc @findex lexer-get-func-ungetc @findex lexer-make-lexer In order to be able to create dynamically the analysers the program needs, the files containing the tables and the file @file{multilex.scm} must be loaded as part of the program. The name convention is the following: all definitions in @file{multilex.scm} introduce names beginning with @samp{lexer@r{@dots{}}} and the definitions in the other files introduce names that are specified by the programmer. This way, it is easy to avoid name conflicts. Input systems are created with the function @code{lexer-make-IS}. A call to this function looks like: @noindent @code{(lexer-make-IS @var{input-type} @var{input} [@var{counters}])} @noindent The values @var{input-type} and @var{input} are described in @ref{Input}. The value of @var{counters} determines which counters the input system should maintain. This is discussed in @ref{Input}. Input systems are associative lists that cannot be used directly. Useful functions can be extracted from an input system. The following calls return functions that allows the program to interact with the input system: @example (lexer-get-func-line @var{input-system}) (lexer-get-func-column @var{input-system}) (lexer-get-func-offset @var{input-system}) (lexer-get-func-getc @var{input-system}) (lexer-get-func-ungetc @var{input-system}) @end example Lexical analysers are created with the function @code{lexer-make-lexer}. The template of a call to this function is: @noindent @code{(lexer-make-lexer @var{table} @var{input-system})} @noindent @var{table} is a table generated by SILex. @var{input-system} is the input system from which the analyser will take its input. The result of the call is the analysis function. The analysis function takes no argument and returns tokens. This example summarizes all the step in the creation of an analyser: @example (let* ((my-port (open-input-file "my-file")) (my-IS (lexer-make-IS 'port my-port)) (my-get-line (lexer-get-func-line IS)) (my-get-column (lexer-get-func-column IS)) (my-get-offset (lexer-get-func-offset IS)) (my-getc (lexer-get-func-getc IS)) (my-ungetc (lexer-get-func-ungetc IS)) (my-analyser (lexer-make-lexer my-table IS))) (let loop ((tok (my-analyser))) (cond ((eq? tok 'eof) @r{@dots{}} @end example @ignore File lex-rt.scm How to use it: lex-rt.scm, lexer-make-IS, lexer-make-lexer & cie Can be called many times at end-of-file Name convention (lexer...) @end ignore @node Options, Input, Many analysers, Generating @section Options at generation time @cindex Options We describe the options that can be passed to @code{lex} and @code{lex-tables}. They indicate which counters (line, column and offset) the actions need; which table encoding should be used; and whether the tables should be pretty-printed. @menu * Counters:: Keeping the position in the input. * Tables encoding:: Encodings of the tables of an analyser. * Pretty print:: Pretty printing the tables. @end menu @node Counters, Tables encoding, Options, Options @subsection Line, column and offset counters @cindex Counters @vindex none @vindex line @vindex all There are three different counting modes: no counter, line counter and all counters. The more counters the input system maintains, the more it is slowed down. The default is the line counting. This option is specified when the program calls the functions @code{lex}, @code{lex-tables} and @code{lexer-make-IS}. The three modes are represented by the symbols @samp{none}, @samp{line} and @samp{all}. When one of the first two functions is called the mode must be preceded by the symbol @samp{counters}. These examples illustrate the use of the option: @example (lex "html.l" "html.l.scm" 'counters 'none) (lex-tables "cobol.l" "cobol-table" "cobol.l.scm" 'counters 'line) (lexer-make-IS 'port my-port 'all) @end example You should be careful when you build analysers dynamically. The mode specified at the input system creation must be consistent with the mode specified at the tables creation. @ignore counters @end ignore @node Tables encoding, Pretty print, Counters, Options @subsection Encoding of the table of an analyser @cindex Encoding of the table @vindex portable @vindex code @cindex Portability @cindex Fast analyser SILex provides three different encodings of the tables: the default encoding, the portable encoding and the ``compilation'' to Scheme code. With the default encoding, the finite automaton of the analyser is represented with data structures that contain the @emph{numbers} of the characters (in the sense of @code{char->integer}). Since the numbers associated with the characters may depend on the Scheme implementation, an analyser generated with an implementation can be safely used only with the same implementation. An analyser encoded in the default style is not portable. But this representation is the most compact. With the portable encoding, the data structures describing the automaton contain characters directly. If the automaton, as generated, contains a transition from state @var{s} to state @var{t} on character @var{c}, then somewhere in the table there is the Scheme character @samp{#\@var{c}}. When the file containing the analyser is loaded in any implementation, the character is read as is, and not as the number @samp{(char->integer #\@var{c})} as evaluated by the original implementation. As long as the implementation using the analyser recognizes the characters mentionned in it, there is no problem. So this encoding is portable. However, it is less compact. This is because something like @samp{(65 90)} is more compact than something like @samp{(#\A #\B @r{@dots{}} #\Y #\Z)} to represent @samp{[A-Z]}. The construction of an analyser from a portable table takes more time than the construction from a default table. But, once built, the performance of the analyser is the same in both cases. It is important to note that in some character sets, the letters or the digits are not contiguous. So, in those cases, the regular expression @samp{[A-Z]} does not necessarily accept only the uppercase letters. The last encoding is the compilation to Scheme code. This produces a fast lexical analyser. Instead of containing data structures representing the behavior of the automaton, the table contains Scheme code that ``hard-codes'' the automaton. This encoding often generates big tables. Such an analyser is not portable. The encoding of the tables can be specified as an option when @code{lex} and @code{lex-tables} are called. The symbols @samp{portable} and @samp{code} are used to specify that the table must be portable and that the table must be compiled, respectively. For example, these calls illustrate the use of the options: @example (lex "c.l" "c.l.scm") ; @r{Default encoding} (lex "c.l" "c.l.scm" 'portable) ; @r{Portable encoding} (lex "c.l" "c.l.scm" 'code) ; @r{Compilation of the automaton} @end example @ignore portable / code @end ignore @node Pretty print, , Tables encoding, Options @subsection Pretty printing the tables @cindex Pretty-printing the tables The pretty-print option (specified with the symbol @samp{pp}) tells SILex to pretty-print the contents of the table. Normally, the table is displayed as a compact mass of characters fitting in about 75 columns. The option is useful only for a developer of SILex. The Scheme code generated with the @samp{code} option is always pretty-printed. @ignore pp @end ignore @node Input, , Options, Generating @section Input methods @cindex Input @cindex Input port, input from an @cindex String, input from a @cindex Function, input from a An analyser can take its input from three different objects: an input port, a string or a function. The type of input and the input itself must be passed when an analyser is initialized and when an input system is created. The input type is specified using one of the three symbols: @samp{port}, @samp{string} or @samp{procedure}. For example: @example (lexer-init 'port (current-input-port)) (lexer-make-IS 'string "Input string.") @end example When an input port is used by an analyser, the program should avoid reading characters directly from the port. This is because the analyser may have needed a look-ahead to do the analysis of the preceding token. The program would not find what it expects on the port. The analyser provides safe functions to get characters from the input. The analyser never closes itself the port it has received, this task is left to the program. When the analyser is initialized with a string, it takes a copy of it. This way, eventual mutations of the string do not affect the analysis. The use of a function as character source allows the analyser to parse any character stream, no matter how it is obtained. For example, the characters may come from the decompression or decryption of a huge file, the task being done lazily in order to save space. The function must take no argument and return a character each time it is called. When the end of file (or its logical equivalent) is reached, the function must return an object that is not a character (for example, the symbol @samp{eof}). After the function has returned an end of file indicator, it is not called again. @ignore port / string / function Copy of the string The function at end-of-file is not called again @end ignore @c ---------- Interfacing with lalr.scm ---------- @node Interface, Acknowledgements, Generating, Top @c Node, Next, Prev, Up @appendix Interfacing with an @sc{lalr}(1) parser @cindex Dominique Boucher @cindex @sc{lalr}(1) parser generator A nice @sc{lalr}(1) parser generator for Scheme has been written by Dominique Boucher. The generator is accessible at the Scheme Repository at @code{ftp://ftp.cs.indiana.edu} in the file @file{/pub/scheme-repository/code/lang/lalr-scm.tar.gz}. The parsers that are generated need two functions to operate: a lexical analysis function and an error function. The analysis function must take no argument and return a token each time it is called. This is exactly the behavior of the lexical analysis functions created by SILex. The @sc{lalr}(1) parsers expect that the tokens are pairs with a number in the @sc{car}, the token number, and any value in the @sc{cdr}, the token attribute. It is easy to respect this convention with a SILex lexical analyser since the actions can be any Scheme expressions. Furthermore, the file created by the @sc{lalr}(1) parser generator contains definitions that give names to the number of the tokens. A lexical analyser can use those names in its actions in order to simplify the coordination between the two analysers. @c ---------- Acknowledgements ---------- @node Acknowledgements, Index, Interface, Top @c Node, Next, Prev, Up @chapheading Acknowledgements I would like to thank my comrades of the laboratory for their support in this project. Especially Martin Larose and Marc Feeley for their numerous suggestions. I hope SILex will be useful for many Scheme programmers. If you find a bug, please let me know at @code{mailto:dube@@iro.umontreal.ca}. @c ---------- Index & tables of contents ---------- @node Index, , Acknowledgements, Top @c Node, Next, Prev, Up @unnumbered Index @printindex cp @contents @bye Memos: Verifier si des trucs comme #\^L sont portables