From 7efd05778cddec0293e0d48199f3aeee2aad6178 Mon Sep 17 00:00:00 2001
From: Ludovic Courtès
Date: Fri, 18 Jan 2008 12:36:59 +0100
Subject: Add SILex, for simplicity.

---
 src/guile/silex/silex.texi | 1303 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1303 insertions(+)
 create mode 100644 src/guile/silex/silex.texi

(limited to 'src/guile/silex/silex.texi')

diff --git a/src/guile/silex/silex.texi b/src/guile/silex/silex.texi
new file mode 100644
index 0000000..6770134
--- /dev/null
+++ b/src/guile/silex/silex.texi
@@ -0,0 +1,1303 @@
+\input texinfo.tex   @c -*-texinfo-*-
+@setfilename silex.info
+@settitle SILex
+@setchapternewpage on
+@footnotestyle end
+@paragraphindent 3
+
+@syncodeindex fn cp
+@syncodeindex vr cp
+@syncodeindex ky cp
+@syncodeindex pg cp
+@syncodeindex tp cp
+
+
+@c ---------- Info copyright ----------
+@ifinfo
+        This file documents the version 1.0 of SILex, a Scheme
+Implementation of Lex.
+
+Copyright @copyright{} 2001  Danny Dub@'e
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+@end ifinfo
+
+
+@c ---------- Title & copyright pages (printed) ----------
+@titlepage
+@title SILex
+@subtitle A Scheme Implementation of Lex
+@subtitle Documentation for SILex version 1.0
+@author Danny Dub@'e
+
+@page
+@vskip 0pt plus 1filll
+Copyright @copyright{} 2001  Danny Dub@'e.
+
+        This is the first edition of the SILex documentation.  It
+documents the version 1.0 of SILex.
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+@end titlepage
+@headings double
+
+
+@c ---------- Top node ----------
+@ifinfo
+@node Top,  Overview, (dir), (dir)
+@c    Node, Next,     Prev,  Up
+@top
+
+        This document describes SILex.  ``SILex'' stands for ``Scheme
+Implementation of Lex''.  It generates a Scheme lexical analyser from a
+Lex-like specification file.
+
+        This document is the first edition and describes SILex version
+1.0.
+
+@menu
+* Overview::     A general description of SILex.
+* Syntax::       The look of a specification file.
+* Semantics::    The meaning of a specification file.
+* Generating::   How to generate and use lexical analysers.
+* Interface::    With a Scheme LALR(1) parser generator.
+* Index::        Concepts and commands relating to SILex.
+* Acknowledgements:: 
+@end menu
+@end ifinfo
+
+
+@c ---------- 1st: overview ----------
+@node Overview, Syntax, Top,  Top
+@c    Node,     Next,   Prev, Up
+@chapter Overview
+
+        SILex is a lexical analyser generator similar to the Lex and
+Flex programs, but for Scheme.  ``SILex'' stands for ``Scheme
+Implementation of Lex''.
+
+        SILex has many similarities with the C programs, but has many
+differences, too.  The syntax of the specification files for SILex is
+close to that of Lex and Flex.  Of course, the actions must be written
+in Scheme and not in C.  The set of regular expressions is mostly the
+same.  An important difference is relative to the multiple start states
+in the C analysers.  SILex replaces them by allowing multiple analysers
+to take their input from the same source.  Different inputs can be
+analysed at the same time, possibly with different instances of one or
+more lexical analysers.  The analysers are created dynamically.
+
+        SILex provides many other features.  The designer of a lexical
+analyser can specify the actions to be taken when the end of file is
+reached or when an error occurs.  The analyser can keep track of the
+position in the input in terms of the number of the line, column and
+offset.  An analyser can take its input from an input port, a string or
+a function.  SILex is portable; it does not depend on a particular
+character set.  It can generate analysers that are portable, too.
+Finally, the table encoding the behavior of the analyser can be compiled
+to Scheme code.  The fastest lexical analysers can be produced this way.
+
+@ignore
+Lex-like
+Scheme program
+Driver for Scheme
+One or more Scheme actions
+Facilities for error handling
+Line & column counting
+Multiple inputs
+Multiple scanners per input
+Different input methods
+Different table codings
+Portable
+@end ignore
+
+
+@c ---------- 2nd: syntax of a specification file ----------
+@node Syntax, Semantics, Overview, Top
+@c    Node,   Next,      Prev,     Up
+@chapter Syntax of the specification file
+
+@cindex Syntax of the specification file
+@cindex Specification file
+@cindex Comment
+@cindex White space
+
+        A specification file for a lexical analyser contains two parts:
+the @dfn{macro definitions part} and the @dfn{rules part}.  The two
+parts are separated by the mark @code{%%}.  The first part is used to
+define @dfn{macros}; that is, to give names to some regular expressions.
+The second part is used to indicate the regular expressions with which
+the input will have to match and the @dfn{actions} associated with each
+expression.
+
+        Comments can be inserted any place where white space is allowed
+and is considered as white space itself.  The syntax of the comments is
+the same as in Scheme.  That is, it begins with a semicolon @samp{;} and
+extends up to the end of a line.  The semicolon is a valid token in many
+languages, so you should take care not to comment out an entire line
+when you write a regular expression matching a semicolon.
+
+        The syntax of each part is presented, except for the regular
+expressions, which are described apart.  A small example follows.
+
+@ignore
+Macros part %% rules part
+Comments & whitespace
+@end ignore
+
+@menu
+* Macros part::           Syntax of the macro definitions.
+* Rules part::            Syntax of the rule-action pairs.
+* Regular expressions::   How to build regular expressions.
+* Sample file::           Shows some frequent mistakes.
+@end menu
+
+@node Macros part, Rules part, Syntax, Syntax
+@section Macro definitions part
+
+@cindex Macro
+@cindex Macro definitions part
+@cindex Scope of a macro definition
+
+        The first part of a specification file contains zero or more
+macro definitions.  A definition consists of a name and a regular
+expression, separated by white space.  It looks better when each
+definition is written on a separate line.
+
+        The syntax for a macro name is that of a Scheme symbol.  The
+case of the letters is not significant.  For example, @code{abcd},
+@code{+}, @code{...}, @code{Digit} and @code{digit} are all valid macro
+names; the last two being the same.  You cannot write two macro
+definitions with the same name.
+
+        The defined macro can be referenced in regular expressions using
+the syntax @code{@{@var{name}@}} (@pxref{Regular expressions}).  The
+scope of a macro definition includes the remaining definitions and the
+rules part of the file.  It is analogous to the @code{let*} is Scheme,
+where the macro definitions correspond to the bindings and the rules
+part correspond to the body.
+
+        End the macro definitions part with @code{%%}.
+
+@ignore
+Names = Scheme symbols
+Case insensitive
+End with %%
+Order of macros
+@end ignore
+
+@node Rules part, Regular expressions, Macros part, Syntax
+@section Rules part
+
+@cindex Rules part
+@cindex Pattern
+@cindex Action
+@cindex Indentation in actions
+
+        The rules part contains the rules up to the end of the
+specification file.  Each rule is a @dfn{pattern} optionally followed by
+an @dfn{action}.  The pattern is a regular expression.  The action, if
+there is one, is formed of one or more Scheme expressions.
+
+        The actions can span over several lines.  To distinguish between
+the remaining of the current action and the start of a new rule, SILex
+checks the indentation.  A new rule must start at the beginning of the
+line.  That is, the action starts right after the pattern and contains
+all the following lines that start with white space.
+
+        SILex does not parse the actions.  It simply captures the text
+up to the start of the next rule.  So a syntax error in an action is not
+detected by SILex.
+
+        Nevertheless, SILex is able to detect that an action has been
+omitted.  In that case, a default action is supplied.
+
+@ignore
+Action = one or more Scheme expressions
+Action are taken verbatim
+Indentation is significant
+Default actions
+End of file
+@end ignore
+
+@node Regular expressions, Sample file, Rules part, Syntax
+@section Regular expressions
+
+@cindex Regular expression
+@cindex Atomic regular expression
+@cindex Ordinary character
+@cindex Dot
+@cindex Wild card
+@findex .
+@cindex Backslash
+@cindex Protecting a character
+@findex \n
+@findex \@var{integer}
+@findex \@var{c}
+@cindex Macro reference
+@findex @{@var{name}@}
+@cindex String
+@findex "@var{some text}"
+@cindex Character class
+@findex [@var{list of characters}]
+
+        We first describe the atomic regular expressions.  Then, we show
+how to build more complex regular expressions from simpler ones.
+Finally, the markers are introduced.
+
+        The following constructs are regular expressions:
+
+@table @asis
+@item @code{@var{c}}
+@dfn{Ordinary character}.  It is a regular expression that matches the
+character @var{c} itself.  @var{c} cannot be one of @samp{.}, @samp{\},
+@samp{@{}, @samp{"}, @samp{[}, @samp{|}, @samp{?}, @samp{+}, @samp{*},
+@samp{(}, @samp{)}, @samp{^}, @samp{$}, @samp{;} or any white space.
+
+@item @code{.}
+@dfn{Wild card}.  It matches any character except the newline character.
+
+@item @code{\n}
+@itemx @code{\@var{integer}}
+@itemx @code{\@var{c}}
+@dfn{Backslash}.  The backslash is used for two things: protect a
+character from special meaning; generating non-printable characters.
+The expression @code{\n} matches the newline character.  The expression
+@code{\@var{integer}} matches the character that has number
+@var{integer} (in the sense of @code{char->integer}).  @var{integer}
+must be a valid character number on the implementation that you use.  It
+may be more than 3 digits long and even negative@footnote{The Scheme
+standards do not impose a particular character set, such as @sc{ascii}.
+The only requirement is that the function @code{char->integer} returns
+an integer.}.  The expression @code{\@var{c}} matches the character
+@var{c} if @var{c} is not @samp{n}, @samp{-} nor a digit.
+
+@item @code{@{@var{name}@}}
+@dfn{Macro reference}.  This expression matches the same lexemes as
+those matched by the regular expression named @var{name}.  You can
+imagine that the reference is replaced by the text of the named
+expression.  However, it works as if parentheses had been added to
+protect the substituting expression.
+
+@item @code{"@var{some text}"}
+@dfn{String}.  A string matches a lexeme identical to its contents.  In
+a string, the only special characters are @samp{"}, which closes the
+string, and @samp{\} which keeps the effect mentioned above.
+
+@item @code{[@var{list of characters}]}
+@itemx @code{[]@var{list of characters}]}
+@itemx @code{[-@var{list of characters}]}
+@itemx @code{[^@var{list of characters}]}
+@dfn{Character class}.  The expression matches one of the enumerated
+characters.  For example, the expression @samp{[abc]} matches one of
+@samp{a}, @samp{b} and @samp{c}.  You can list a range of characters by
+writing the first character, the @samp{-} and the last character.  For
+example, @samp{[A-Za-z]} matches one letter (if the letters are ordered
+and contiguous in the character set used by your implementation).  The
+special characters in a class are @samp{]}, which closes the class,
+@samp{-}, which denotes a range of character, and @samp{\}, which keeps
+its usual meaning.  There is an exception with the first character in a
+class.  If the first character is @samp{]} or @samp{-}, it loses its
+special meaning.  If the first character is @samp{^}, the expression
+matches one character if it is @emph{not} enumerated in @var{list of
+characters}.
+
+@ignore
+Ordinary character
+Dot
+Backslash: with n, with an integer (finir les chiffres), otherwise
+Macro reference
+String
+Character class
+@end ignore
+@end table
+
+@cindex Union of regular expressions
+@cindex Alternatives
+@findex |
+@cindex Concatenation of regular expressions
+@cindex Optional regular expression
+@findex ?
+@cindex Closure of a regular expression
+@cindex Positive closure
+@findex +
+@cindex Kleene closure
+@findex *
+@cindex Repetition of a regular expression
+@findex @{@var{i},@var{j}@}
+@cindex Overriding the precedence
+@cindex Grouping regular expressions
+@cindex Precedence
+@findex ( )
+
+        Suppose @var{r} and @var{s} are regular expressions.  Then the
+following expressions can be built:
+
+@table @asis
+@item @code{@var{r}|@var{s}}
+@dfn{Union}.  This regular expression matches a lexeme if the lexeme is
+matched by @var{r} or by @var{s}.
+
+@item @code{@var{r}@var{s}}
+@dfn{Concatenation}.  This expression matches a lexeme if the lexeme can
+be written as the concatenation of a lexeme matched by @var{r} and a
+lexeme matched by @var{s}.
+
+@item @code{@var{r}?}
+@dfn{Optional expression}.  A lexeme matches this expression if it is
+the empty lexeme or if it matches @var{r}.
+
+@item @code{@var{r}+}
+@dfn{Positive closure}.  This expression matches a lexeme that can be
+written as the concatenation of one or more lexemes, where each of those
+matches @var{r}.
+
+@item @code{@var{r}*}
+@dfn{Kleene closure}.  A lexeme is matched by this expression if it can
+be written as the concatenation of zero or more lexemes, where each of
+those matches @var{r}.
+
+@item @code{@var{r}@{@var{i}@}}
+@itemx @code{@var{r}@{@var{i},@}}
+@itemx @code{@var{r}@{@var{i},@var{j}@}}
+@dfn{Power or repetition of an expression}.  These expressions allow the
+``repetition'' of a regular expression a certain number of times.
+@var{i} and @var{j} must be positive integers and @var{j} must be
+greater or equal to @var{i}.  The first form repeats the expression
+@var{r} exactly @var{i} times.  The second form repeats @var{r} at least
+@var{i} times.  The last form repeats @var{r} at least @var{i} times and
+at most @var{j} times.  You should avoid using large numbers (more than
+10), because the finite automaton for @var{r} is copied once for each
+repetition.  The tables of the analyser may quickly become very large.
+You should note that the syntax of these expressions does not conflict
+with the syntax of the macro reference.
+
+@item @code{(@var{r})}
+@dfn{Parentheses}.  This expression matches the same lexemes as @var{r}.
+It is used to override the precedence of the operators.
+@end table
+
+        The building operators are listed in order of increasing
+precedence.  The @code{?}, @code{+}, @code{*} and repetition operators
+have the same precedence.
+
+@ignore
+Or
+Conc
+Question
+Plus
+Star
+Power
+Parentheses
+@end ignore
+
+@cindex Marker
+@cindex Beginning of line marker
+@findex ^
+@cindex End of line marker
+@findex $
+@cindex End of file marker
+@findex <<EOF>>
+@cindex Error marker
+@findex <<ERROR>>
+
+        The remaining ``expressions'' would better be called
+@dfn{markers}.  They all match the empty lexeme but require certain
+conditions to be respected in the input.  They cannot be used in all
+regular expressions.  Suppose that @var{r} is a regular expression
+without markers.
+
+@table @asis
+@item @code{^@var{r}}
+@itemx @code{@var{r}$}
+@dfn{Beginning and end of line}.  These markers require that the lexeme
+is found at the beginning and at the end of the line, respectively.  The
+markers lose their special meaning if they are not placed at their end
+of the regular expression or if they are used in the first part of the
+specification file.  In those cases, they are treated as regular
+characters.
+
+@item @code{<<EOF>>}
+@dfn{End of file}.  This marker is matched only when the input is at the
+end of file.  The marker must be used alone in its pattern, and only in
+the second part of the file.  There can be at most one rule with this
+particular pattern.
+
+@item @code{<<ERROR>>}
+@dfn{Error}.  This marker is matched only when there is a parsing error.
+It can be used under the same conditions as @code{<<EOF>>}.
+
+@ignore
+Caret
+Dollar
+End of file
+Error
+@end ignore
+@end table
+
+@cindex White space in regular expressions
+
+        White space ends the regular expressions.  In order to include
+white space in a regular expression, it must be protected by a backslash
+or placed in a string.
+
+@ignore
+Ended with white spaces
+Examples
+@end ignore
+
+@node Sample file, , Regular expressions, Syntax
+@section An example of a specification file
+
+        Here is an example of a SILex specification file.  The file is
+syntactically correct from the SILex point of view.  However, many
+common mistakes are shown.  The file is not a useful one.
+
+@example
+; This is a syntactically correct but silly file.
+
+partial     hel
+complete    @{partial@}lo            ; @r{Backward macro ref. only}
+digit       [0-9]
+letter      [a-zA-Z]
+
+%%
+
+-?@{digit@}+    (cons 'integer yytext)   ; @r{@code{yytext} contains}
+                                       ; @r{the lexeme}
+-?@{digit@}+\.@{digit@}+[eE][-+]?@{digit@}+
+              (cons                ; @r{A long action}
+               'float
+               yytext)
+
+;             (list 'semicolon)    ; @r{Probably a mistake}
+
+begin         )list 'begin(        ; @r{No error detected here}
+end                                ; @r{The action is optional}
+
+\73           (list 'bell-3)       ; @r{It does not match the}
+                                   ; @r{char. # 7 followed by @samp{3}}
+\0073         (list 'bell-3)       ; @r{Neither does it}
+(\7)3         (list 'bell-3)       ; @r{This does it}
+
+"*()+|@{@}[].? are ordinary but \" and \\ are special"
+
+[^\n]         (list 'char)         ; @r{Same thing as @samp{.}}
+(@{letter@}|_)(@{letter@}|_|@{digit@})*  ; @r{A C identifier}
+[][]                               ; @r{One of the square brackets}
+
+Repe(ti)@{2@}on   (list 'repetition)
+
+^@{letter@}+:   (cons 'label yytext) ; @r{A label placed at the}
+                                   ; @r{beginning of the line}
+$^                                 ; @r{No special meaning}
+<<EOF>>       (list 'eof)          ; @r{Detection of the end of file}
+<<ERROR>>     (my-error)           ; @r{Error handling}
+@end example
+
+@ignore
+Subset of Scheme(?)
+Example of \73, \0073, (\7)3
+@end ignore
+
+
+@c ---------- 3rd: semantics of the specification file ----------
+@node Semantics, Generating, Syntax, Top
+@c    Node,      Next,       Prev,   Up
+@chapter Semantics of the specification file
+
+@cindex Semantics of the specification file
+
+        An important part of the semantics of a specification file is
+described with the syntax of the regular expressions.  The remainder is
+presented here.  We begin with the role of the actions.  Information on
+the matching method follows.
+
+@menu
+* Action::           What does an action.
+* Matching rules::   When does a regular expression matches the input.
+@end menu
+
+@node Action, Matching rules, Semantics, Semantics
+@section Evaluation of the actions
+
+@findex yycontinue
+@findex yygetc
+@findex yyungetc
+@findex yytext
+@findex yyline
+@findex yycolumn
+@findex yyoffset
+@cindex Skipping a lexeme
+@cindex Getting characters
+@cindex Ungetting characters
+@cindex Lexeme
+@cindex Line number
+@cindex Column number
+@cindex Offset
+@cindex Default action
+
+        The action of a rule is evaluated when the corresponding pattern
+is matched.  The result of its evaluation is the result that the lexical
+analyser returns to its caller.
+
+        There are a few local variables that are accessible by the
+action when it is evaluated.  Those are @code{yycontinue},
+@code{yygetc}, @code{yyungetc}, @code{yytext}, @code{yyline},
+@code{yycolumn} and @code{yyoffset}.  Each one is described here:
+
+@table @code
+@item yycontinue
+This variable contains the lexical analysis function itself.  Use
+@code{(yycontinue)} to ask for the next token.  Typically, the action
+associated with a pattern that matches white space is a call to
+@code{yycontinue}; it has the effect of skipping the white space.
+
+@item yygetc
+@itemx yyungetc
+These variables contain functions to get and unget characters from the
+input of the analyser.  They take no argument.  @code{yygetc} returns a
+character or the symbol @samp{eof} if the end of file is reached.  They
+should be used to read characters instead of accessing directly the
+input port because the analyser may have read more characters in order
+to have a look-ahead.  It is incorrect to try to unget more characters
+than has been gotten since @emph{the parsing of the last token}.  If
+such an attempt is made, @code{yyungetc} silently refuses.
+
+@item yytext
+This variable is bound to a string containing the lexeme.  This string
+is guaranteed not to be mutated.  The string is created only if the
+action `seems' to need it.  The action is considered to need the lexeme
+when @samp{yytext} appears somewhere in the text of the action.
+
+@item yyline
+@itemx yycolumn
+@itemx yyoffset
+These variables indicate the position in the input at the beginning of
+the lexeme.  @code{yyline} is the number of the line; the first line is
+the line 1.  @code{yycolumn} is the number of the column; the first
+column is the column 1.  It is important to mention that characters such
+as the tabulation generate a variable length output when they are
+printed.  So it would be more accurate to say that @code{yycolumn} is
+the number of the first character of the lexeme, starting at the
+beginning of the line.  @code{yyoffset} indicates the distance from the
+beginning of the input; the first lexeme has offset 0.  The three
+variables may not all be existant depending on the kind of counting you
+want the analyser to do for you (@pxref{Counters}).
+@end table
+
+        There is a default action that is provided for a rule when its
+action is omitted.  If the pattern is @samp{<<EOF>>}, the default action
+returns the object @samp{(0)}.  If the pattern is @samp{<<ERROR>>}, the
+default action displays an error message and returns the symbol
+@samp{error}@footnote{Note that there is no portable way for the
+analyser to end the execution of the program when an error occurs.}.
+The default action for the other patterns is to call the analyser again.
+It is clearer (and normally more useful) to specify explicitly the
+action associated with each rule.
+
+@ignore
+An action is executed when its corresp. regexp is matched
+Environment of the actions
+yycontinue, yygetc, yyungetc, yytext, yyline, yycolumn, yyoffset
+yycolumn = number of the character (cause: tabs)
+Default actions
+@end ignore
+
+@node Matching rules, , Action, Semantics
+@section Matching the rules
+
+@cindex Matching method
+@cindex Matching conflict
+@cindex Conflict between patterns
+@cindex Interactive analyser
+
+        Each time the analyser is asked to return a token, it tries to
+match a prefix of the input with a pattern.  There may be more than one
+possible match; when it is the case, we say there is a conflict.  For
+example, suppose we have those regular expressions:
+
+@example
+begin
+[a-z]*
+@end example
+
+@noindent
+and the input is @samp{beginning1 @r{@dots{}}}.  We have a match with
+the first expression and we have many different matches with the second.
+To resolve such a conflict, the longest match is chosen.  So the chosen
+match is the one between the lexeme @samp{beginning} and the second
+expression.
+
+        Suppose we have the same regular expressions but the input is
+@samp{begin+ @r{@dots{}}}.  We have @emph{two} longest match.  This
+conflict is resolved by choosing the first pattern that allows a longest
+match.  So the chosen match is between the lexeme @samp{begin} and the
+first pattern.
+
+        The analyser generated by SILex allows the empty lexeme to be
+matched if there is no longer match.  However, you should take care not
+to call the analyser again without consuming at least one character of
+the input.  It would cause an infinite loop.
+
+        The pattern @samp{<<EOF>>} is matched when the analyser is
+called and the input is at end of file.  In this situation, the marker
+is matched even if there is a pattern that matches the empty lexeme.
+The analyser can be called again and again and the @samp{<<EOF>>}
+pattern will be matched each time, causing its corresponding action to
+be evaluated each time, too.
+
+        The pattern @samp{<<ERROR>>} is matched when the input is not at
+end of file and no other match is possible.  Depending on the action
+associated with this pattern, your program may choose to stop or choose
+to try to recover from the error.  To recover from the error, your
+program has to read some characters from the input before it can call
+the analyser again.
+
+        All lexical analysers generated by SILex are interactive.  That
+is, they read as few characters as possible to get the longest match.
+This is a useful property when the input is coming from a terminal.  A
+lexical analyser is normally based on a finite automaton; it is the case
+for the analysers generated by SILex.  A non-interactive analyser always
+needs an extra character to provoke an invalid transition in the
+automaton.  The longest match is detected this way.  With an interactive
+analyser, an extra character is not required when it is impossible to
+obtain a longer match.
+
+        A lexical analyser generated by SILex does not impose any @i{a
+priori} limit on the size of the lexemes.  The internal buffer is
+extended each time it is necessary.
+
+@ignore
+Longest prefix of the input, first matching rule
+Warning for matching empty string
+^ & $ anchors
+End of file anchor
+Error anchor
+Interactive matching
+The lexeme is not limited to a certain length
+@end ignore
+
+
+@c ---------- 4th: generating a lexical analyser ----------
+@node Generating, Interface, Semantics, Top
+@c    Node,       Next,      Prev,      Up
+@chapter Generating and using a lexical analyser
+
+@cindex Generating a lexical analyser
+@cindex Using a lexical analyser
+
+        The most common use of SILex is to generate a single complete
+lexical analyser.  In some situations however, it is preferable to only
+generate the tables describing the analysers and leaving to the program
+to build complete analysers at run time.  It is the case when the
+program has to parse many files simultaneously with the same analyser;
+and when a file is to be parsed using many different analysers.  After
+the description of the two modes, we describe the SILex options and the
+different input methods.
+
+@ignore
+One or many analysers
+Options
+Different input methods
+@end ignore
+
+@menu
+* One analyser::     Generating and using one complete analyser.
+* Many analysers::   Dynamic creation of analysers.
+* Options::          Line counting, table encoding.
+* Input::            Input from a port, a string or a function.
+@end menu
+
+@node One analyser, Many analysers, Generating, Generating
+@section One complete analyser
+
+        The function @code{lex} generates a complete lexical analyser.
+We first describe its parameters.  Then the interface with the generated
+analyser is presented.
+
+@menu
+* Lex::         The @code{lex} command.
+* Functions::   The functions in the lexical analyser.
+* Usage::       Using the lexical analyser.
+@end menu
+
+@node Lex, Functions, One analyser, One analyser
+@subsection The @code{lex} command
+
+@findex lex
+
+        Here is the template of a call to @code{lex}:
+
+@noindent
+@code{(lex @var{input-file} @var{output-file} [@var{options} @r{@dots{}}])}
+
+@noindent
+@var{input-file} is a string containing the name of the specification
+file.  @var{output-file} is a string containing the name of the file in
+which the lexical analyser is written.  For a description of the
+options, see @ref{Options}.
+
+        This is an example of a call to @code{lex}:
+
+@example
+(lex "pascal.l" "pascal.l.scm")
+@end example
+
+@ignore
+Invocation of lex
+@end ignore
+
+@node Functions, Usage, Lex, One analyser
+@subsection The functions in the lexical analyser
+
+@findex lexer
+@findex lexer-get-line
+@findex lexer-get-column
+@findex lexer-get-offset
+@findex lexer-getc
+@findex lexer-ungetc
+@findex lexer-init
+@cindex Name convention
+
+        The file generated by @code{lex} contains a few global
+definitions.  A program using the analyser needs only the following
+functions: @code{lexer}, @code{lexer-get-line}, @code{lexer-get-column},
+@code{lexer-get-offset}, @code{lexer-getc}, @code{lexer-ungetc} and
+@code{lexer-init}.
+
+@table @code
+@item lexer
+The lexical analysis function.
+
+@item lexer-get-line
+@itemx lexer-get-column
+@itemx lexer-get-offset
+Functions to obtain the current position in the input.
+
+@item lexer-getc
+@itemx lexer-ungetc
+Reading and returning characters.  These functions have the advantage of
+being accessible from outside the actions.
+
+@item lexer-init
+Initializing the analyser with the input source.
+@end table
+
+        To avoid name conflicts, these variables and others that we did
+not mention all begin with @samp{lexer@r{@dots{}}}.
+
+@ignore
+List of variables
+Name convention (lexer...)
+@end ignore
+
+@node Usage, , Functions, One analyser
+@subsection Using the lexical analyser
+
+@cindex Initialization of the analyser
+@cindex Token
+
+        The first function that must be called is the initialization
+function.  It is necessary to give to the analyser its source of
+characters.  Here is the template of a call to this function:
+
+@noindent
+@code{(lexer-init @var{input-type} @var{input})}
+
+@noindent
+The values @var{input-type} and @var{input} are described in
+@ref{Input}.
+
+        Once the initialization is done, the program can get
+@dfn{tokens} from the analyser by calling the lexical analysing
+function:
+@example
+(lexer)
+@end example
+@noindent
+The token is the result of the evaluation of the action corresponding to
+the matched pattern.  The current position can be obtained with:
+@example
+(lexer-get-line)
+(lexer-get-column)
+(lexer-get-offset)
+@end example
+@noindent
+As is described in @ref{Options}, some or all of these functions may not
+be available.  Characters can be gotten and ungotten from the input this
+way:
+@example
+(lexer-getc)
+(lexer-ungetc)
+@end example
+@noindent
+It is important to note that the analyser remembers the characters
+previously gotten.  Your program does not have to keep those itself.
+
+        Even after the end of file has been reached or an error has
+occured, the @code{lexer} function can be called again.  Its behavior
+depends on the remaining characters in the input.
+
+        The analyser can be reinitialized in any time with a new input.
+
+@ignore
+How to use it
+Can be called many times at end-of-file
+@end ignore
+
+@node Many analysers, Options, One analyser, Generating
+@section Many analysers
+
+        There are applications where it is necessary to have more than
+one lexical analyser parsing more than one file at a time.  For example:
+
+@itemize @minus
+@item
+The parsing of a C file (with cpp) may cause the parsing of other files
+recursively because of the @code{#include} commands.
+
+@item
+An interactive compiler has to be able to compile a file without closing
+the communication with the standard input.
+
+@item
+SILex itself parses the macro names, the regular expressions, the
+interior of a string, @dots{}, with different sets of patterns.
+@end itemize
+
+        We first begin with an overview on how SILex allows the
+programmer to create multiple lexical analysers.  We continue with a
+description of the function @code{lex-tables}.  We end the explanations
+with the functions used to creat analysers dynamically.
+
+@menu
+* Dynamic style::   It is possible to parse many files
+                    with many analysers.
+* Lex-tables::      The @code{lex-tables} command.
+* Usage2::          Building and using lexical analysers dynamically.
+@end menu
+
+@node Dynamic style, Lex-tables, Many analysers, Many analysers
+@subsection Creating analysers dynamically
+
+@cindex Dynamic creation of analysers
+@cindex Input system
+
+        It is quite easy to create new analysers at run-time.  Suppose
+there is an input that you want to analyse.  There are just two steps to
+make.
+
+@itemize @bullet
+@item
+Create an @dfn{input system} from the input.  An input system provides
+the buffering, the line counting and similar low level services.
+
+@item
+Create one or more analysers from the input system and the analyser
+tables.  The tables are generated by the function @code{lex-tables} from
+a specification file.  A table contains all the necessary information to
+build up an analyser.  Normally, you have to use more than one analyser
+per input when you expect the syntax to vary greatly in the input.
+@end itemize
+
+        The following example shows a typical organization for a
+multi-analyser lexical analysis.  Note that one table may have been used
+to produce many instances of analysers.  Those analysers would simply be
+connected to different input systems@footnote{It would make no sense to
+create two instances coming from the same table and being connected to
+the same input system.  They would both have exactly the same
+behavior.}.
+
+@example
+           Input1            Input2        Input3
+             |                 |             |
+             |                 |             |
+            IS1               IS2           IS3
+             |                 |             |
+     +-------+-------+         |          +--+---+
+     |       |       |         |          |      |
+   An1.1   An1.2   An1.3     An2.1      An3.1  An3.2
+@end example
+
+        There is no @i{a priori} limit on the number of input systems
+and analysers that you can create dynamically.
+
+@ignore
+Input systems & dynamic lexical analysers
+@end ignore
+
+@node Lex-tables, Usage2, Dynamic style, Many analysers
+@subsection The @code{lex-tables} command
+
+@findex lex-tables
+
+        The function @code{lex-tables} produces a table describing an
+analyser from a specification file.  A call to @code{lex-tables} looks
+like:
+
+@noindent
+@code{(lex-tables @var{input-file} @var{table-name} @var{output-file} [@var{options} @r{@dots{}}])}
+
+@noindent
+@var{input-file} must be a string containing the name of the
+specification file.  @var{output-file} is a string containing the name
+in which the result is printed.  A definition is written in the output
+file.  @var{table-name} must be a string and it is the name appearing in
+the definition.  The options are defined in @ref{Options}.
+
+        This is an example of a call to @code{lex-tables}:
+
+@example
+(lex-tables "c.l" "c-table" "c.l.scm")
+@end example
+
+@ignore
+Invocation of lex-tables
+@end ignore
+
+@node Usage2, , Lex-tables, Many analysers
+@subsection Building and using lexical analysers dynamically
+
+@cindex Building an analyser dynamically
+@pindex multilex.scm
+@cindex Name convention
+@findex lexer-make-IS
+@findex lexer-get-func-line
+@findex lexer-get-func-column
+@findex lexer-get-func-offset
+@findex lexer-get-func-getc
+@findex lexer-get-func-ungetc
+@findex lexer-make-lexer
+
+        In order to be able to create dynamically the analysers the
+program needs, the files containing the tables and the file
+@file{multilex.scm} must be loaded as part of the program.  The name
+convention is the following: all definitions in @file{multilex.scm}
+introduce names beginning with @samp{lexer@r{@dots{}}} and the
+definitions in the other files introduce names that are specified by the
+programmer.  This way, it is easy to avoid name conflicts.
+
+        Input systems are created with the function
+@code{lexer-make-IS}.  A call to this function looks like:
+
+@noindent
+@code{(lexer-make-IS @var{input-type} @var{input} [@var{counters}])}
+
+@noindent
+The values @var{input-type} and @var{input} are described in
+@ref{Input}.  The value of @var{counters} determines which counters the
+input system should maintain.  This is discussed in @ref{Input}.  Input
+systems are associative lists that cannot be used directly.
+
+        Useful functions can be extracted from an input system.  The
+following calls return functions that allows the program to interact
+with the input system:
+
+@example
+(lexer-get-func-line @var{input-system})
+(lexer-get-func-column @var{input-system})
+(lexer-get-func-offset @var{input-system})
+(lexer-get-func-getc @var{input-system})
+(lexer-get-func-ungetc @var{input-system})
+@end example
+
+        Lexical analysers are created with the function
+@code{lexer-make-lexer}.  The template of a call to this function is:
+
+@noindent
+@code{(lexer-make-lexer @var{table} @var{input-system})}
+
+@noindent
+@var{table} is a table generated by SILex.  @var{input-system} is the
+input system from which the analyser will take its input.  The result of
+the call is the analysis function.  The analysis function takes no
+argument and returns tokens.
+
+        This example summarizes all the step in the creation of an
+analyser:
+
+@example
+(let* ((my-port       (open-input-file "my-file"))
+       (my-IS         (lexer-make-IS 'port my-port))
+       (my-get-line   (lexer-get-func-line IS))
+       (my-get-column (lexer-get-func-column IS))
+       (my-get-offset (lexer-get-func-offset IS))
+       (my-getc       (lexer-get-func-getc IS))
+       (my-ungetc     (lexer-get-func-ungetc IS))
+       (my-analyser   (lexer-make-lexer my-table IS)))
+  (let loop ((tok (my-analyser)))
+    (cond ((eq? tok 'eof)
+           @r{@dots{}}
+@end example
+
+@ignore
+File lex-rt.scm
+How to use it: lex-rt.scm, lexer-make-IS, lexer-make-lexer & cie
+Can be called many times at end-of-file
+Name convention (lexer...)
+@end ignore
+
+@node Options, Input, Many analysers, Generating
+@section Options at generation time
+
+@cindex Options
+
+        We describe the options that can be passed to @code{lex} and
+@code{lex-tables}.  They indicate which counters (line, column and
+offset) the actions need; which table encoding should be used; and
+whether the tables should be pretty-printed.
+
+@menu
+* Counters::          Keeping the position in the input.
+* Tables encoding::   Encodings of the tables of an analyser.
+* Pretty print::      Pretty printing the tables.
+@end menu
+
+@node Counters, Tables encoding, Options, Options
+@subsection Line, column and offset counters
+
+@cindex Counters
+@vindex none
+@vindex line
+@vindex all
+
+        There are three different counting modes: no counter, line
+counter and all counters.  The more counters the input system maintains,
+the more it is slowed down.  The default is the line counting.
+
+        This option is specified when the program calls the functions
+@code{lex}, @code{lex-tables} and @code{lexer-make-IS}.  The three modes
+are represented by the symbols @samp{none}, @samp{line} and @samp{all}.
+When one of the first two functions is called the mode must be preceded
+by the symbol @samp{counters}.  These examples illustrate the use of the
+option:
+
+@example
+(lex "html.l" "html.l.scm" 'counters 'none)
+
+(lex-tables "cobol.l" "cobol-table" "cobol.l.scm" 'counters 'line)
+
+(lexer-make-IS 'port my-port 'all)
+@end example
+
+        You should be careful when you build analysers dynamically.  The
+mode specified at the input system creation must be consistent with the
+mode specified at the tables creation.
+
+@ignore
+counters
+@end ignore
+
+@node Tables encoding, Pretty print, Counters, Options
+@subsection Encoding of the table of an analyser
+
+@cindex Encoding of the table
+@vindex portable
+@vindex code
+@cindex Portability
+@cindex Fast analyser
+
+        SILex provides three different encodings of the tables: the
+default encoding, the portable encoding and the ``compilation'' to
+Scheme code.
+
+        With the default encoding, the finite automaton of the analyser
+is represented with data structures that contain the @emph{numbers} of
+the characters (in the sense of @code{char->integer}).  Since the
+numbers associated with the characters may depend on the Scheme
+implementation, an analyser generated with an implementation can be
+safely used only with the same implementation.  An analyser encoded in
+the default style is not portable.  But this representation is the most
+compact.
+
+        With the portable encoding, the data structures describing the
+automaton contain characters directly.  If the automaton, as generated,
+contains a transition from state @var{s} to state @var{t} on character
+@var{c}, then somewhere in the table there is the Scheme character
+@samp{#\@var{c}}.  When the file containing the analyser is loaded in
+any implementation, the character is read as is, and not as the number
+@samp{(char->integer #\@var{c})} as evaluated by the original
+implementation.  As long as the implementation using the analyser
+recognizes the characters mentionned in it, there is no problem.
+
+        So this encoding is portable.  However, it is less compact.
+This is because something like @samp{(65 90)} is more compact than
+something like @samp{(#\A #\B @r{@dots{}} #\Y #\Z)} to represent
+@samp{[A-Z]}.  The construction of an analyser from a portable table
+takes more time than the construction from a default table.  But, once
+built, the performance of the analyser is the same in both cases.
+
+        It is important to note that in some character sets, the letters
+or the digits are not contiguous.  So, in those cases, the regular
+expression @samp{[A-Z]} does not necessarily accept only the uppercase
+letters.
+
+        The last encoding is the compilation to Scheme code.  This
+produces a fast lexical analyser.  Instead of containing data structures
+representing the behavior of the automaton, the table contains Scheme
+code that ``hard-codes'' the automaton.  This encoding often generates
+big tables.  Such an analyser is not portable.
+
+        The encoding of the tables can be specified as an option when
+@code{lex} and @code{lex-tables} are called.  The symbols
+@samp{portable} and @samp{code} are used to specify that the table must
+be portable and that the table must be compiled, respectively.  For
+example, these calls illustrate the use of the options:
+
+@example
+(lex "c.l" "c.l.scm")             ; @r{Default encoding}
+
+(lex "c.l" "c.l.scm" 'portable)   ; @r{Portable encoding}
+
+(lex "c.l" "c.l.scm" 'code)       ; @r{Compilation of the automaton}
+@end example
+
+@ignore
+portable / code
+@end ignore
+
+@node Pretty print, , Tables encoding, Options
+@subsection Pretty printing the tables
+
+@cindex Pretty-printing the tables
+
+        The pretty-print option (specified with the symbol @samp{pp})
+tells SILex to pretty-print the contents of the table.  Normally, the
+table is displayed as a compact mass of characters fitting in about 75
+columns.  The option is useful only for a developer of SILex.  The
+Scheme code generated with the @samp{code} option is always
+pretty-printed.
+
+@ignore
+pp
+@end ignore
+
+@node Input, , Options, Generating
+@section Input methods
+
+@cindex Input
+@cindex Input port, input from an
+@cindex String, input from a
+@cindex Function, input from a
+
+        An analyser can take its input from three different objects: an
+input port, a string or a function.  The type of input and the input
+itself must be passed when an analyser is initialized and when an input
+system is created.  The input type is specified using one of the three
+symbols: @samp{port}, @samp{string} or @samp{procedure}.  For example:
+
+@example
+(lexer-init 'port (current-input-port))
+
+(lexer-make-IS 'string "Input string.")
+@end example
+
+        When an input port is used by an analyser, the program should
+avoid reading characters directly from the port.  This is because the
+analyser may have needed a look-ahead to do the analysis of the
+preceding token.  The program would not find what it expects on the
+port.  The analyser provides safe functions to get characters from the
+input.  The analyser never closes itself the port it has received, this
+task is left to the program.
+
+        When the analyser is initialized with a string, it takes a copy
+of it.  This way, eventual mutations of the string do not affect the
+analysis.
+
+        The use of a function as character source allows the analyser to
+parse any character stream, no matter how it is obtained.  For example,
+the characters may come from the decompression or decryption of a huge
+file, the task being done lazily in order to save space.  The function
+must take no argument and return a character each time it is called.
+When the end of file (or its logical equivalent) is reached, the
+function must return an object that is not a character (for example, the
+symbol @samp{eof}).  After the function has returned an end of file
+indicator, it is not called again.
+
+@ignore
+port / string / function
+Copy of the string
+The function at end-of-file is not called again
+@end ignore
+
+
+@c ---------- Interfacing with lalr.scm ----------
+@node Interface, Acknowledgements, Generating, Top
+@c    Node,      Next,             Prev,       Up
+@appendix Interfacing with an @sc{lalr}(1) parser
+
+@cindex Dominique Boucher
+@cindex @sc{lalr}(1) parser generator
+
+        A nice @sc{lalr}(1) parser generator for Scheme has been written
+by Dominique Boucher.  The generator is accessible at the Scheme
+Repository at @code{ftp://ftp.cs.indiana.edu} in the file
+@file{/pub/scheme-repository/code/lang/lalr-scm.tar.gz}.
+
+        The parsers that are generated need two functions to operate: a
+lexical analysis function and an error function.  The analysis function
+must take no argument and return a token each time it is called.  This
+is exactly the behavior of the lexical analysis functions created by
+SILex.
+
+        The @sc{lalr}(1) parsers expect that the tokens are pairs with a
+number in the @sc{car}, the token number, and any value in the @sc{cdr},
+the token attribute.  It is easy to respect this convention with a SILex
+lexical analyser since the actions can be any Scheme expressions.
+Furthermore, the file created by the @sc{lalr}(1) parser generator
+contains definitions that give names to the number of the tokens.  A
+lexical analyser can use those names in its actions in order to simplify
+the coordination between the two analysers.
+
+
+@c ---------- Acknowledgements ----------
+@node Acknowledgements, Index, Interface, Top
+@c    Node,             Next,  Prev,      Up
+@chapheading Acknowledgements
+
+        I would like to thank my comrades of the laboratory for their
+support in this project.  Especially Martin Larose and Marc Feeley for
+their numerous suggestions.
+
+        I hope SILex will be useful for many Scheme programmers.
+
+        If you find a bug, please let me know at
+@code{mailto:dube@@iro.umontreal.ca}.
+
+
+@c ---------- Index & tables of contents ----------
+@node Index,     , Acknowledgements, Top
+@c    Node,  Next, Prev,             Up
+@unnumbered Index
+
+@printindex cp
+
+@contents
+@bye
+
+Memos:
+Verifier si des trucs comme #\^L sont portables
-- 
cgit v1.2.3