\input texinfo.tex   @c -*-texinfo-*-
@setfilename silex.info
@settitle SILex
@setchapternewpage on
@footnotestyle end
@paragraphindent 3

@syncodeindex fn cp
@syncodeindex vr cp
@syncodeindex ky cp
@syncodeindex pg cp
@syncodeindex tp cp


@c ---------- Info copyright ----------
@ifinfo
        This file documents the version 1.0 of SILex, a Scheme
Implementation of Lex.

Copyright @copyright{} 2001  Danny Dub@'e

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
@end ifinfo


@c ---------- Title & copyright pages (printed) ----------
@titlepage
@title SILex
@subtitle A Scheme Implementation of Lex
@subtitle Documentation for SILex version 1.0
@author Danny Dub@'e

@page
@vskip 0pt plus 1filll
Copyright @copyright{} 2001  Danny Dub@'e.

        This is the first edition of the SILex documentation.  It
documents the version 1.0 of SILex.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
@end titlepage
@headings double


@c ---------- Top node ----------
@ifinfo
@node Top,  Overview, (dir), (dir)
@c    Node, Next,     Prev,  Up
@top

        This document describes SILex.  ``SILex'' stands for ``Scheme
Implementation of Lex''.  It generates a Scheme lexical analyser from a
Lex-like specification file.

        This document is the first edition and describes SILex version
1.0.

@menu
* Overview::     A general description of SILex.
* Syntax::       The look of a specification file.
* Semantics::    The meaning of a specification file.
* Generating::   How to generate and use lexical analysers.
* Interface::    With a Scheme LALR(1) parser generator.
* Index::        Concepts and commands relating to SILex.
* Acknowledgements:: 
@end menu
@end ifinfo


@c ---------- 1st: overview ----------
@node Overview, Syntax, Top,  Top
@c    Node,     Next,   Prev, Up
@chapter Overview

        SILex is a lexical analyser generator similar to the Lex and
Flex programs, but for Scheme.  ``SILex'' stands for ``Scheme
Implementation of Lex''.

        SILex has many similarities with the C programs, but has many
differences, too.  The syntax of the specification files for SILex is
close to that of Lex and Flex.  Of course, the actions must be written
in Scheme and not in C.  The set of regular expressions is mostly the
same.  An important difference is relative to the multiple start states
in the C analysers.  SILex replaces them by allowing multiple analysers
to take their input from the same source.  Different inputs can be
analysed at the same time, possibly with different instances of one or
more lexical analysers.  The analysers are created dynamically.

        SILex provides many other features.  The designer of a lexical
analyser can specify the actions to be taken when the end of file is
reached or when an error occurs.  The analyser can keep track of the
position in the input in terms of the number of the line, column and
offset.  An analyser can take its input from an input port, a string or
a function.  SILex is portable; it does not depend on a particular
character set.  It can generate analysers that are portable, too.
Finally, the table encoding the behavior of the analyser can be compiled
to Scheme code.  The fastest lexical analysers can be produced this way.

@ignore
Lex-like
Scheme program
Driver for Scheme
One or more Scheme actions
Facilities for error handling
Line & column counting
Multiple inputs
Multiple scanners per input
Different input methods
Different table codings
Portable
@end ignore


@c ---------- 2nd: syntax of a specification file ----------
@node Syntax, Semantics, Overview, Top
@c    Node,   Next,      Prev,     Up
@chapter Syntax of the specification file

@cindex Syntax of the specification file
@cindex Specification file
@cindex Comment
@cindex White space

        A specification file for a lexical analyser contains two parts:
the @dfn{macro definitions part} and the @dfn{rules part}.  The two
parts are separated by the mark @code{%%}.  The first part is used to
define @dfn{macros}; that is, to give names to some regular expressions.
The second part is used to indicate the regular expressions with which
the input will have to match and the @dfn{actions} associated with each
expression.

        Comments can be inserted any place where white space is allowed
and is considered as white space itself.  The syntax of the comments is
the same as in Scheme.  That is, it begins with a semicolon @samp{;} and
extends up to the end of a line.  The semicolon is a valid token in many
languages, so you should take care not to comment out an entire line
when you write a regular expression matching a semicolon.

        The syntax of each part is presented, except for the regular
expressions, which are described apart.  A small example follows.

@ignore
Macros part %% rules part
Comments & whitespace
@end ignore

@menu
* Macros part::           Syntax of the macro definitions.
* Rules part::            Syntax of the rule-action pairs.
* Regular expressions::   How to build regular expressions.
* Sample file::           Shows some frequent mistakes.
@end menu

@node Macros part, Rules part, Syntax, Syntax
@section Macro definitions part

@cindex Macro
@cindex Macro definitions part
@cindex Scope of a macro definition

        The first part of a specification file contains zero or more
macro definitions.  A definition consists of a name and a regular
expression, separated by white space.  It looks better when each
definition is written on a separate line.

        The syntax for a macro name is that of a Scheme symbol.  The
case of the letters is not significant.  For example, @code{abcd},
@code{+}, @code{...}, @code{Digit} and @code{digit} are all valid macro
names; the last two being the same.  You cannot write two macro
definitions with the same name.

        The defined macro can be referenced in regular expressions using
the syntax @code{@{@var{name}@}} (@pxref{Regular expressions}).  The
scope of a macro definition includes the remaining definitions and the
rules part of the file.  It is analogous to the @code{let*} is Scheme,
where the macro definitions correspond to the bindings and the rules
part correspond to the body.

        End the macro definitions part with @code{%%}.

@ignore
Names = Scheme symbols
Case insensitive
End with %%
Order of macros
@end ignore

@node Rules part, Regular expressions, Macros part, Syntax
@section Rules part

@cindex Rules part
@cindex Pattern
@cindex Action
@cindex Indentation in actions

        The rules part contains the rules up to the end of the
specification file.  Each rule is a @dfn{pattern} optionally followed by
an @dfn{action}.  The pattern is a regular expression.  The action, if
there is one, is formed of one or more Scheme expressions.

        The actions can span over several lines.  To distinguish between
the remaining of the current action and the start of a new rule, SILex
checks the indentation.  A new rule must start at the beginning of the
line.  That is, the action starts right after the pattern and contains
all the following lines that start with white space.

        SILex does not parse the actions.  It simply captures the text
up to the start of the next rule.  So a syntax error in an action is not
detected by SILex.

        Nevertheless, SILex is able to detect that an action has been
omitted.  In that case, a default action is supplied.

@ignore
Action = one or more Scheme expressions
Action are taken verbatim
Indentation is significant
Default actions
End of file
@end ignore

@node Regular expressions, Sample file, Rules part, Syntax
@section Regular expressions

@cindex Regular expression
@cindex Atomic regular expression
@cindex Ordinary character
@cindex Dot
@cindex Wild card
@findex .
@cindex Backslash
@cindex Protecting a character
@findex \n
@findex \@var{integer}
@findex \@var{c}
@cindex Macro reference
@findex @{@var{name}@}
@cindex String
@findex "@var{some text}"
@cindex Character class
@findex [@var{list of characters}]

        We first describe the atomic regular expressions.  Then, we show
how to build more complex regular expressions from simpler ones.
Finally, the markers are introduced.

        The following constructs are regular expressions:

@table @asis
@item @code{@var{c}}
@dfn{Ordinary character}.  It is a regular expression that matches the
character @var{c} itself.  @var{c} cannot be one of @samp{.}, @samp{\},
@samp{@{}, @samp{"}, @samp{[}, @samp{|}, @samp{?}, @samp{+}, @samp{*},
@samp{(}, @samp{)}, @samp{^}, @samp{$}, @samp{;} or any white space.

@item @code{.}
@dfn{Wild card}.  It matches any character except the newline character.

@item @code{\n}
@itemx @code{\@var{integer}}
@itemx @code{\@var{c}}
@dfn{Backslash}.  The backslash is used for two things: protect a
character from special meaning; generating non-printable characters.
The expression @code{\n} matches the newline character.  The expression
@code{\@var{integer}} matches the character that has number
@var{integer} (in the sense of @code{char->integer}).  @var{integer}
must be a valid character number on the implementation that you use.  It
may be more than 3 digits long and even negative@footnote{The Scheme
standards do not impose a particular character set, such as @sc{ascii}.
The only requirement is that the function @code{char->integer} returns
an integer.}.  The expression @code{\@var{c}} matches the character
@var{c} if @var{c} is not @samp{n}, @samp{-} nor a digit.

@item @code{@{@var{name}@}}
@dfn{Macro reference}.  This expression matches the same lexemes as
those matched by the regular expression named @var{name}.  You can
imagine that the reference is replaced by the text of the named
expression.  However, it works as if parentheses had been added to
protect the substituting expression.

@item @code{"@var{some text}"}
@dfn{String}.  A string matches a lexeme identical to its contents.  In
a string, the only special characters are @samp{"}, which closes the
string, and @samp{\} which keeps the effect mentioned above.

@item @code{[@var{list of characters}]}
@itemx @code{[]@var{list of characters}]}
@itemx @code{[-@var{list of characters}]}
@itemx @code{[^@var{list of characters}]}
@dfn{Character class}.  The expression matches one of the enumerated
characters.  For example, the expression @samp{[abc]} matches one of
@samp{a}, @samp{b} and @samp{c}.  You can list a range of characters by
writing the first character, the @samp{-} and the last character.  For
example, @samp{[A-Za-z]} matches one letter (if the letters are ordered
and contiguous in the character set used by your implementation).  The
special characters in a class are @samp{]}, which closes the class,
@samp{-}, which denotes a range of character, and @samp{\}, which keeps
its usual meaning.  There is an exception with the first character in a
class.  If the first character is @samp{]} or @samp{-}, it loses its
special meaning.  If the first character is @samp{^}, the expression
matches one character if it is @emph{not} enumerated in @var{list of
characters}.

@ignore
Ordinary character
Dot
Backslash: with n, with an integer (finir les chiffres), otherwise
Macro reference
String
Character class
@end ignore
@end table

@cindex Union of regular expressions
@cindex Alternatives
@findex |
@cindex Concatenation of regular expressions
@cindex Optional regular expression
@findex ?
@cindex Closure of a regular expression
@cindex Positive closure
@findex +
@cindex Kleene closure
@findex *
@cindex Repetition of a regular expression
@findex @{@var{i},@var{j}@}
@cindex Overriding the precedence
@cindex Grouping regular expressions
@cindex Precedence
@findex ( )

        Suppose @var{r} and @var{s} are regular expressions.  Then the
following expressions can be built:

@table @asis
@item @code{@var{r}|@var{s}}
@dfn{Union}.  This regular expression matches a lexeme if the lexeme is
matched by @var{r} or by @var{s}.

@item @code{@var{r}@var{s}}
@dfn{Concatenation}.  This expression matches a lexeme if the lexeme can
be written as the concatenation of a lexeme matched by @var{r} and a
lexeme matched by @var{s}.

@item @code{@var{r}?}
@dfn{Optional expression}.  A lexeme matches this expression if it is
the empty lexeme or if it matches @var{r}.

@item @code{@var{r}+}
@dfn{Positive closure}.  This expression matches a lexeme that can be
written as the concatenation of one or more lexemes, where each of those
matches @var{r}.

@item @code{@var{r}*}
@dfn{Kleene closure}.  A lexeme is matched by this expression if it can
be written as the concatenation of zero or more lexemes, where each of
those matches @var{r}.

@item @code{@var{r}@{@var{i}@}}
@itemx @code{@var{r}@{@var{i},@}}
@itemx @code{@var{r}@{@var{i},@var{j}@}}
@dfn{Power or repetition of an expression}.  These expressions allow the
``repetition'' of a regular expression a certain number of times.
@var{i} and @var{j} must be positive integers and @var{j} must be
greater or equal to @var{i}.  The first form repeats the expression
@var{r} exactly @var{i} times.  The second form repeats @var{r} at least
@var{i} times.  The last form repeats @var{r} at least @var{i} times and
at most @var{j} times.  You should avoid using large numbers (more than
10), because the finite automaton for @var{r} is copied once for each
repetition.  The tables of the analyser may quickly become very large.
You should note that the syntax of these expressions does not conflict
with the syntax of the macro reference.

@item @code{(@var{r})}
@dfn{Parentheses}.  This expression matches the same lexemes as @var{r}.
It is used to override the precedence of the operators.
@end table

        The building operators are listed in order of increasing
precedence.  The @code{?}, @code{+}, @code{*} and repetition operators
have the same precedence.

@ignore
Or
Conc
Question
Plus
Star
Power
Parentheses
@end ignore

@cindex Marker
@cindex Beginning of line marker
@findex ^
@cindex End of line marker
@findex $
@cindex End of file marker
@findex <<EOF>>
@cindex Error marker
@findex <<ERROR>>

        The remaining ``expressions'' would better be called
@dfn{markers}.  They all match the empty lexeme but require certain
conditions to be respected in the input.  They cannot be used in all
regular expressions.  Suppose that @var{r} is a regular expression
without markers.

@table @asis
@item @code{^@var{r}}
@itemx @code{@var{r}$}
@dfn{Beginning and end of line}.  These markers require that the lexeme
is found at the beginning and at the end of the line, respectively.  The
markers lose their special meaning if they are not placed at their end
of the regular expression or if they are used in the first part of the
specification file.  In those cases, they are treated as regular
characters.

@item @code{<<EOF>>}
@dfn{End of file}.  This marker is matched only when the input is at the
end of file.  The marker must be used alone in its pattern, and only in
the second part of the file.  There can be at most one rule with this
particular pattern.

@item @code{<<ERROR>>}
@dfn{Error}.  This marker is matched only when there is a parsing error.
It can be used under the same conditions as @code{<<EOF>>}.

@ignore
Caret
Dollar
End of file
Error
@end ignore
@end table

@cindex White space in regular expressions

        White space ends the regular expressions.  In order to include
white space in a regular expression, it must be protected by a backslash
or placed in a string.

@ignore
Ended with white spaces
Examples
@end ignore

@node Sample file, , Regular expressions, Syntax
@section An example of a specification file

        Here is an example of a SILex specification file.  The file is
syntactically correct from the SILex point of view.  However, many
common mistakes are shown.  The file is not a useful one.

@example
; This is a syntactically correct but silly file.

partial     hel
complete    @{partial@}lo            ; @r{Backward macro ref. only}
digit       [0-9]
letter      [a-zA-Z]

%%

-?@{digit@}+    (cons 'integer yytext)   ; @r{@code{yytext} contains}
                                       ; @r{the lexeme}
-?@{digit@}+\.@{digit@}+[eE][-+]?@{digit@}+
              (cons                ; @r{A long action}
               'float
               yytext)

;             (list 'semicolon)    ; @r{Probably a mistake}

begin         )list 'begin(        ; @r{No error detected here}
end                                ; @r{The action is optional}

\73           (list 'bell-3)       ; @r{It does not match the}
                                   ; @r{char. # 7 followed by @samp{3}}
\0073         (list 'bell-3)       ; @r{Neither does it}
(\7)3         (list 'bell-3)       ; @r{This does it}

"*()+|@{@}[].? are ordinary but \" and \\ are special"

[^\n]         (list 'char)         ; @r{Same thing as @samp{.}}
(@{letter@}|_)(@{letter@}|_|@{digit@})*  ; @r{A C identifier}
[][]                               ; @r{One of the square brackets}

Repe(ti)@{2@}on   (list 'repetition)

^@{letter@}+:   (cons 'label yytext) ; @r{A label placed at the}
                                   ; @r{beginning of the line}
$^                                 ; @r{No special meaning}
<<EOF>>       (list 'eof)          ; @r{Detection of the end of file}
<<ERROR>>     (my-error)           ; @r{Error handling}
@end example

@ignore
Subset of Scheme(?)
Example of \73, \0073, (\7)3
@end ignore


@c ---------- 3rd: semantics of the specification file ----------
@node Semantics, Generating, Syntax, Top
@c    Node,      Next,       Prev,   Up
@chapter Semantics of the specification file

@cindex Semantics of the specification file

        An important part of the semantics of a specification file is
described with the syntax of the regular expressions.  The remainder is
presented here.  We begin with the role of the actions.  Information on
the matching method follows.

@menu
* Action::           What does an action.
* Matching rules::   When does a regular expression matches the input.
@end menu

@node Action, Matching rules, Semantics, Semantics
@section Evaluation of the actions

@findex yycontinue
@findex yygetc
@findex yyungetc
@findex yytext
@findex yyline
@findex yycolumn
@findex yyoffset
@cindex Skipping a lexeme
@cindex Getting characters
@cindex Ungetting characters
@cindex Lexeme
@cindex Line number
@cindex Column number
@cindex Offset
@cindex Default action

        The action of a rule is evaluated when the corresponding pattern
is matched.  The result of its evaluation is the result that the lexical
analyser returns to its caller.

        There are a few local variables that are accessible by the
action when it is evaluated.  Those are @code{yycontinue},
@code{yygetc}, @code{yyungetc}, @code{yytext}, @code{yyline},
@code{yycolumn} and @code{yyoffset}.  Each one is described here:

@table @code
@item yycontinue
This variable contains the lexical analysis function itself.  Use
@code{(yycontinue)} to ask for the next token.  Typically, the action
associated with a pattern that matches white space is a call to
@code{yycontinue}; it has the effect of skipping the white space.

@item yygetc
@itemx yyungetc
These variables contain functions to get and unget characters from the
input of the analyser.  They take no argument.  @code{yygetc} returns a
character or the symbol @samp{eof} if the end of file is reached.  They
should be used to read characters instead of accessing directly the
input port because the analyser may have read more characters in order
to have a look-ahead.  It is incorrect to try to unget more characters
than has been gotten since @emph{the parsing of the last token}.  If
such an attempt is made, @code{yyungetc} silently refuses.

@item yytext
This variable is bound to a string containing the lexeme.  This string
is guaranteed not to be mutated.  The string is created only if the
action `seems' to need it.  The action is considered to need the lexeme
when @samp{yytext} appears somewhere in the text of the action.

@item yyline
@itemx yycolumn
@itemx yyoffset
These variables indicate the position in the input at the beginning of
the lexeme.  @code{yyline} is the number of the line; the first line is
the line 1.  @code{yycolumn} is the number of the column; the first
column is the column 1.  It is important to mention that characters such
as the tabulation generate a variable length output when they are
printed.  So it would be more accurate to say that @code{yycolumn} is
the number of the first character of the lexeme, starting at the
beginning of the line.  @code{yyoffset} indicates the distance from the
beginning of the input; the first lexeme has offset 0.  The three
variables may not all be existant depending on the kind of counting you
want the analyser to do for you (@pxref{Counters}).
@end table

        There is a default action that is provided for a rule when its
action is omitted.  If the pattern is @samp{<<EOF>>}, the default action
returns the object @samp{(0)}.  If the pattern is @samp{<<ERROR>>}, the
default action displays an error message and returns the symbol
@samp{error}@footnote{Note that there is no portable way for the
analyser to end the execution of the program when an error occurs.}.
The default action for the other patterns is to call the analyser again.
It is clearer (and normally more useful) to specify explicitly the
action associated with each rule.

@ignore
An action is executed when its corresp. regexp is matched
Environment of the actions
yycontinue, yygetc, yyungetc, yytext, yyline, yycolumn, yyoffset
yycolumn = number of the character (cause: tabs)
Default actions
@end ignore

@node Matching rules, , Action, Semantics
@section Matching the rules

@cindex Matching method
@cindex Matching conflict
@cindex Conflict between patterns
@cindex Interactive analyser

        Each time the analyser is asked to return a token, it tries to
match a prefix of the input with a pattern.  There may be more than one
possible match; when it is the case, we say there is a conflict.  For
example, suppose we have those regular expressions:

@example
begin
[a-z]*
@end example

@noindent
and the input is @samp{beginning1 @r{@dots{}}}.  We have a match with
the first expression and we have many different matches with the second.
To resolve such a conflict, the longest match is chosen.  So the chosen
match is the one between the lexeme @samp{beginning} and the second
expression.

        Suppose we have the same regular expressions but the input is
@samp{begin+ @r{@dots{}}}.  We have @emph{two} longest match.  This
conflict is resolved by choosing the first pattern that allows a longest
match.  So the chosen match is between the lexeme @samp{begin} and the
first pattern.

        The analyser generated by SILex allows the empty lexeme to be
matched if there is no longer match.  However, you should take care not
to call the analyser again without consuming at least one character of
the input.  It would cause an infinite loop.

        The pattern @samp{<<EOF>>} is matched when the analyser is
called and the input is at end of file.  In this situation, the marker
is matched even if there is a pattern that matches the empty lexeme.
The analyser can be called again and again and the @samp{<<EOF>>}
pattern will be matched each time, causing its corresponding action to
be evaluated each time, too.

        The pattern @samp{<<ERROR>>} is matched when the input is not at
end of file and no other match is possible.  Depending on the action
associated with this pattern, your program may choose to stop or choose
to try to recover from the error.  To recover from the error, your
program has to read some characters from the input before it can call
the analyser again.

        All lexical analysers generated by SILex are interactive.  That
is, they read as few characters as possible to get the longest match.
This is a useful property when the input is coming from a terminal.  A
lexical analyser is normally based on a finite automaton; it is the case
for the analysers generated by SILex.  A non-interactive analyser always
needs an extra character to provoke an invalid transition in the
automaton.  The longest match is detected this way.  With an interactive
analyser, an extra character is not required when it is impossible to
obtain a longer match.

        A lexical analyser generated by SILex does not impose any @i{a
priori} limit on the size of the lexemes.  The internal buffer is
extended each time it is necessary.

@ignore
Longest prefix of the input, first matching rule
Warning for matching empty string
^ & $ anchors
End of file anchor
Error anchor
Interactive matching
The lexeme is not limited to a certain length
@end ignore


@c ---------- 4th: generating a lexical analyser ----------
@node Generating, Interface, Semantics, Top
@c    Node,       Next,      Prev,      Up
@chapter Generating and using a lexical analyser

@cindex Generating a lexical analyser
@cindex Using a lexical analyser

        The most common use of SILex is to generate a single complete
lexical analyser.  In some situations however, it is preferable to only
generate the tables describing the analysers and leaving to the program
to build complete analysers at run time.  It is the case when the
program has to parse many files simultaneously with the same analyser;
and when a file is to be parsed using many different analysers.  After
the description of the two modes, we describe the SILex options and the
different input methods.

@ignore
One or many analysers
Options
Different input methods
@end ignore

@menu
* One analyser::     Generating and using one complete analyser.
* Many analysers::   Dynamic creation of analysers.
* Options::          Line counting, table encoding.
* Input::            Input from a port, a string or a function.
@end menu

@node One analyser, Many analysers, Generating, Generating
@section One complete analyser

        The function @code{lex} generates a complete lexical analyser.
We first describe its parameters.  Then the interface with the generated
analyser is presented.

@menu
* Lex::         The @code{lex} command.
* Functions::   The functions in the lexical analyser.
* Usage::       Using the lexical analyser.
@end menu

@node Lex, Functions, One analyser, One analyser
@subsection The @code{lex} command

@findex lex

        Here is the template of a call to @code{lex}:

@noindent
@code{(lex @var{input-file} @var{output-file} [@var{options} @r{@dots{}}])}

@noindent
@var{input-file} is a string containing the name of the specification
file.  @var{output-file} is a string containing the name of the file in
which the lexical analyser is written.  For a description of the
options, see @ref{Options}.

        This is an example of a call to @code{lex}:

@example
(lex "pascal.l" "pascal.l.scm")
@end example

@ignore
Invocation of lex
@end ignore

@node Functions, Usage, Lex, One analyser
@subsection The functions in the lexical analyser

@findex lexer
@findex lexer-get-line
@findex lexer-get-column
@findex lexer-get-offset
@findex lexer-getc
@findex lexer-ungetc
@findex lexer-init
@cindex Name convention

        The file generated by @code{lex} contains a few global
definitions.  A program using the analyser needs only the following
functions: @code{lexer}, @code{lexer-get-line}, @code{lexer-get-column},
@code{lexer-get-offset}, @code{lexer-getc}, @code{lexer-ungetc} and
@code{lexer-init}.

@table @code
@item lexer
The lexical analysis function.

@item lexer-get-line
@itemx lexer-get-column
@itemx lexer-get-offset
Functions to obtain the current position in the input.

@item lexer-getc
@itemx lexer-ungetc
Reading and returning characters.  These functions have the advantage of
being accessible from outside the actions.

@item lexer-init
Initializing the analyser with the input source.
@end table

        To avoid name conflicts, these variables and others that we did
not mention all begin with @samp{lexer@r{@dots{}}}.

@ignore
List of variables
Name convention (lexer...)
@end ignore

@node Usage, , Functions, One analyser
@subsection Using the lexical analyser

@cindex Initialization of the analyser
@cindex Token

        The first function that must be called is the initialization
function.  It is necessary to give to the analyser its source of
characters.  Here is the template of a call to this function:

@noindent
@code{(lexer-init @var{input-type} @var{input})}

@noindent
The values @var{input-type} and @var{input} are described in
@ref{Input}.

        Once the initialization is done, the program can get
@dfn{tokens} from the analyser by calling the lexical analysing
function:
@example
(lexer)
@end example
@noindent
The token is the result of the evaluation of the action corresponding to
the matched pattern.  The current position can be obtained with:
@example
(lexer-get-line)
(lexer-get-column)
(lexer-get-offset)
@end example
@noindent
As is described in @ref{Options}, some or all of these functions may not
be available.  Characters can be gotten and ungotten from the input this
way:
@example
(lexer-getc)
(lexer-ungetc)
@end example
@noindent
It is important to note that the analyser remembers the characters
previously gotten.  Your program does not have to keep those itself.

        Even after the end of file has been reached or an error has
occured, the @code{lexer} function can be called again.  Its behavior
depends on the remaining characters in the input.

        The analyser can be reinitialized in any time with a new input.

@ignore
How to use it
Can be called many times at end-of-file
@end ignore

@node Many analysers, Options, One analyser, Generating
@section Many analysers

        There are applications where it is necessary to have more than
one lexical analyser parsing more than one file at a time.  For example:

@itemize @minus
@item
The parsing of a C file (with cpp) may cause the parsing of other files
recursively because of the @code{#include} commands.

@item
An interactive compiler has to be able to compile a file without closing
the communication with the standard input.

@item
SILex itself parses the macro names, the regular expressions, the
interior of a string, @dots{}, with different sets of patterns.
@end itemize

        We first begin with an overview on how SILex allows the
programmer to create multiple lexical analysers.  We continue with a
description of the function @code{lex-tables}.  We end the explanations
with the functions used to creat analysers dynamically.

@menu
* Dynamic style::   It is possible to parse many files
                    with many analysers.
* Lex-tables::      The @code{lex-tables} command.
* Usage2::          Building and using lexical analysers dynamically.
@end menu

@node Dynamic style, Lex-tables, Many analysers, Many analysers
@subsection Creating analysers dynamically

@cindex Dynamic creation of analysers
@cindex Input system

        It is quite easy to create new analysers at run-time.  Suppose
there is an input that you want to analyse.  There are just two steps to
make.

@itemize @bullet
@item
Create an @dfn{input system} from the input.  An input system provides
the buffering, the line counting and similar low level services.

@item
Create one or more analysers from the input system and the analyser
tables.  The tables are generated by the function @code{lex-tables} from
a specification file.  A table contains all the necessary information to
build up an analyser.  Normally, you have to use more than one analyser
per input when you expect the syntax to vary greatly in the input.
@end itemize

        The following example shows a typical organization for a
multi-analyser lexical analysis.  Note that one table may have been used
to produce many instances of analysers.  Those analysers would simply be
connected to different input systems@footnote{It would make no sense to
create two instances coming from the same table and being connected to
the same input system.  They would both have exactly the same
behavior.}.

@example
           Input1            Input2        Input3
             |                 |             |
             |                 |             |
            IS1               IS2           IS3
             |                 |             |
     +-------+-------+         |          +--+---+
     |       |       |         |          |      |
   An1.1   An1.2   An1.3     An2.1      An3.1  An3.2
@end example

        There is no @i{a priori} limit on the number of input systems
and analysers that you can create dynamically.

@ignore
Input systems & dynamic lexical analysers
@end ignore

@node Lex-tables, Usage2, Dynamic style, Many analysers
@subsection The @code{lex-tables} command

@findex lex-tables

        The function @code{lex-tables} produces a table describing an
analyser from a specification file.  A call to @code{lex-tables} looks
like:

@noindent
@code{(lex-tables @var{input-file} @var{table-name} @var{output-file} [@var{options} @r{@dots{}}])}

@noindent
@var{input-file} must be a string containing the name of the
specification file.  @var{output-file} is a string containing the name
in which the result is printed.  A definition is written in the output
file.  @var{table-name} must be a string and it is the name appearing in
the definition.  The options are defined in @ref{Options}.

        This is an example of a call to @code{lex-tables}:

@example
(lex-tables "c.l" "c-table" "c.l.scm")
@end example

@ignore
Invocation of lex-tables
@end ignore

@node Usage2, , Lex-tables, Many analysers
@subsection Building and using lexical analysers dynamically

@cindex Building an analyser dynamically
@pindex multilex.scm
@cindex Name convention
@findex lexer-make-IS
@findex lexer-get-func-line
@findex lexer-get-func-column
@findex lexer-get-func-offset
@findex lexer-get-func-getc
@findex lexer-get-func-ungetc
@findex lexer-make-lexer

        In order to be able to create dynamically the analysers the
program needs, the files containing the tables and the file
@file{multilex.scm} must be loaded as part of the program.  The name
convention is the following: all definitions in @file{multilex.scm}
introduce names beginning with @samp{lexer@r{@dots{}}} and the
definitions in the other files introduce names that are specified by the
programmer.  This way, it is easy to avoid name conflicts.

        Input systems are created with the function
@code{lexer-make-IS}.  A call to this function looks like:

@noindent
@code{(lexer-make-IS @var{input-type} @var{input} [@var{counters}])}

@noindent
The values @var{input-type} and @var{input} are described in
@ref{Input}.  The value of @var{counters} determines which counters the
input system should maintain.  This is discussed in @ref{Input}.  Input
systems are associative lists that cannot be used directly.

        Useful functions can be extracted from an input system.  The
following calls return functions that allows the program to interact
with the input system:

@example
(lexer-get-func-line @var{input-system})
(lexer-get-func-column @var{input-system})
(lexer-get-func-offset @var{input-system})
(lexer-get-func-getc @var{input-system})
(lexer-get-func-ungetc @var{input-system})
@end example

        Lexical analysers are created with the function
@code{lexer-make-lexer}.  The template of a call to this function is:

@noindent
@code{(lexer-make-lexer @var{table} @var{input-system})}

@noindent
@var{table} is a table generated by SILex.  @var{input-system} is the
input system from which the analyser will take its input.  The result of
the call is the analysis function.  The analysis function takes no
argument and returns tokens.

        This example summarizes all the step in the creation of an
analyser:

@example
(let* ((my-port       (open-input-file "my-file"))
       (my-IS         (lexer-make-IS 'port my-port))
       (my-get-line   (lexer-get-func-line IS))
       (my-get-column (lexer-get-func-column IS))
       (my-get-offset (lexer-get-func-offset IS))
       (my-getc       (lexer-get-func-getc IS))
       (my-ungetc     (lexer-get-func-ungetc IS))
       (my-analyser   (lexer-make-lexer my-table IS)))
  (let loop ((tok (my-analyser)))
    (cond ((eq? tok 'eof)
           @r{@dots{}}
@end example

@ignore
File lex-rt.scm
How to use it: lex-rt.scm, lexer-make-IS, lexer-make-lexer & cie
Can be called many times at end-of-file
Name convention (lexer...)
@end ignore

@node Options, Input, Many analysers, Generating
@section Options at generation time

@cindex Options

        We describe the options that can be passed to @code{lex} and
@code{lex-tables}.  They indicate which counters (line, column and
offset) the actions need; which table encoding should be used; and
whether the tables should be pretty-printed.

@menu
* Counters::          Keeping the position in the input.
* Tables encoding::   Encodings of the tables of an analyser.
* Pretty print::      Pretty printing the tables.
@end menu

@node Counters, Tables encoding, Options, Options
@subsection Line, column and offset counters

@cindex Counters
@vindex none
@vindex line
@vindex all

        There are three different counting modes: no counter, line
counter and all counters.  The more counters the input system maintains,
the more it is slowed down.  The default is the line counting.

        This option is specified when the program calls the functions
@code{lex}, @code{lex-tables} and @code{lexer-make-IS}.  The three modes
are represented by the symbols @samp{none}, @samp{line} and @samp{all}.
When one of the first two functions is called the mode must be preceded
by the symbol @samp{counters}.  These examples illustrate the use of the
option:

@example
(lex "html.l" "html.l.scm" 'counters 'none)

(lex-tables "cobol.l" "cobol-table" "cobol.l.scm" 'counters 'line)

(lexer-make-IS 'port my-port 'all)
@end example

        You should be careful when you build analysers dynamically.  The
mode specified at the input system creation must be consistent with the
mode specified at the tables creation.

@ignore
counters
@end ignore

@node Tables encoding, Pretty print, Counters, Options
@subsection Encoding of the table of an analyser

@cindex Encoding of the table
@vindex portable
@vindex code
@cindex Portability
@cindex Fast analyser

        SILex provides three different encodings of the tables: the
default encoding, the portable encoding and the ``compilation'' to
Scheme code.

        With the default encoding, the finite automaton of the analyser
is represented with data structures that contain the @emph{numbers} of
the characters (in the sense of @code{char->integer}).  Since the
numbers associated with the characters may depend on the Scheme
implementation, an analyser generated with an implementation can be
safely used only with the same implementation.  An analyser encoded in
the default style is not portable.  But this representation is the most
compact.

        With the portable encoding, the data structures describing the
automaton contain characters directly.  If the automaton, as generated,
contains a transition from state @var{s} to state @var{t} on character
@var{c}, then somewhere in the table there is the Scheme character
@samp{#\@var{c}}.  When the file containing the analyser is loaded in
any implementation, the character is read as is, and not as the number
@samp{(char->integer #\@var{c})} as evaluated by the original
implementation.  As long as the implementation using the analyser
recognizes the characters mentionned in it, there is no problem.

        So this encoding is portable.  However, it is less compact.
This is because something like @samp{(65 90)} is more compact than
something like @samp{(#\A #\B @r{@dots{}} #\Y #\Z)} to represent
@samp{[A-Z]}.  The construction of an analyser from a portable table
takes more time than the construction from a default table.  But, once
built, the performance of the analyser is the same in both cases.

        It is important to note that in some character sets, the letters
or the digits are not contiguous.  So, in those cases, the regular
expression @samp{[A-Z]} does not necessarily accept only the uppercase
letters.

        The last encoding is the compilation to Scheme code.  This
produces a fast lexical analyser.  Instead of containing data structures
representing the behavior of the automaton, the table contains Scheme
code that ``hard-codes'' the automaton.  This encoding often generates
big tables.  Such an analyser is not portable.

        The encoding of the tables can be specified as an option when
@code{lex} and @code{lex-tables} are called.  The symbols
@samp{portable} and @samp{code} are used to specify that the table must
be portable and that the table must be compiled, respectively.  For
example, these calls illustrate the use of the options:

@example
(lex "c.l" "c.l.scm")             ; @r{Default encoding}

(lex "c.l" "c.l.scm" 'portable)   ; @r{Portable encoding}

(lex "c.l" "c.l.scm" 'code)       ; @r{Compilation of the automaton}
@end example

@ignore
portable / code
@end ignore

@node Pretty print, , Tables encoding, Options
@subsection Pretty printing the tables

@cindex Pretty-printing the tables

        The pretty-print option (specified with the symbol @samp{pp})
tells SILex to pretty-print the contents of the table.  Normally, the
table is displayed as a compact mass of characters fitting in about 75
columns.  The option is useful only for a developer of SILex.  The
Scheme code generated with the @samp{code} option is always
pretty-printed.

@ignore
pp
@end ignore

@node Input, , Options, Generating
@section Input methods

@cindex Input
@cindex Input port, input from an
@cindex String, input from a
@cindex Function, input from a

        An analyser can take its input from three different objects: an
input port, a string or a function.  The type of input and the input
itself must be passed when an analyser is initialized and when an input
system is created.  The input type is specified using one of the three
symbols: @samp{port}, @samp{string} or @samp{procedure}.  For example:

@example
(lexer-init 'port (current-input-port))

(lexer-make-IS 'string "Input string.")
@end example

        When an input port is used by an analyser, the program should
avoid reading characters directly from the port.  This is because the
analyser may have needed a look-ahead to do the analysis of the
preceding token.  The program would not find what it expects on the
port.  The analyser provides safe functions to get characters from the
input.  The analyser never closes itself the port it has received, this
task is left to the program.

        When the analyser is initialized with a string, it takes a copy
of it.  This way, eventual mutations of the string do not affect the
analysis.

        The use of a function as character source allows the analyser to
parse any character stream, no matter how it is obtained.  For example,
the characters may come from the decompression or decryption of a huge
file, the task being done lazily in order to save space.  The function
must take no argument and return a character each time it is called.
When the end of file (or its logical equivalent) is reached, the
function must return an object that is not a character (for example, the
symbol @samp{eof}).  After the function has returned an end of file
indicator, it is not called again.

@ignore
port / string / function
Copy of the string
The function at end-of-file is not called again
@end ignore


@c ---------- Interfacing with lalr.scm ----------
@node Interface, Acknowledgements, Generating, Top
@c    Node,      Next,             Prev,       Up
@appendix Interfacing with an @sc{lalr}(1) parser

@cindex Dominique Boucher
@cindex @sc{lalr}(1) parser generator

        A nice @sc{lalr}(1) parser generator for Scheme has been written
by Dominique Boucher.  The generator is accessible at the Scheme
Repository at @code{ftp://ftp.cs.indiana.edu} in the file
@file{/pub/scheme-repository/code/lang/lalr-scm.tar.gz}.

        The parsers that are generated need two functions to operate: a
lexical analysis function and an error function.  The analysis function
must take no argument and return a token each time it is called.  This
is exactly the behavior of the lexical analysis functions created by
SILex.

        The @sc{lalr}(1) parsers expect that the tokens are pairs with a
number in the @sc{car}, the token number, and any value in the @sc{cdr},
the token attribute.  It is easy to respect this convention with a SILex
lexical analyser since the actions can be any Scheme expressions.
Furthermore, the file created by the @sc{lalr}(1) parser generator
contains definitions that give names to the number of the tokens.  A
lexical analyser can use those names in its actions in order to simplify
the coordination between the two analysers.


@c ---------- Acknowledgements ----------
@node Acknowledgements, Index, Interface, Top
@c    Node,             Next,  Prev,      Up
@chapheading Acknowledgements

        I would like to thank my comrades of the laboratory for their
support in this project.  Especially Martin Larose and Marc Feeley for
their numerous suggestions.

        I hope SILex will be useful for many Scheme programmers.

        If you find a bug, please let me know at
@code{mailto:dube@@iro.umontreal.ca}.


@c ---------- Index & tables of contents ----------
@node Index,     , Acknowledgements, Top
@c    Node,  Next, Prev,             Up
@unnumbered Index

@printindex cp

@contents
@bye

Memos:
Verifier si des trucs comme #\^L sont portables