                            OpenToken User Guide

User Manual

Index

   * Introduction
   * Step 1: Creating an enumeration of tokens
   * Step 2: Instantiate a token analyzer class
   * Step 3: Create a token recognizers for each token
        o Step 3.5: Creating a custom token recognizer type
   * Step 4: Map the recognizers to their tokens
   * Step 5: Create a Token Analyzer object
        o Advanced: Using your own Text Feeder
   * Use

Introduction

The OpenToken lexical analyzer generator packages consist of two parts:

  1. The lexical analysis engine itself (Token.Analyzer)
  2. A set of token recognizer packages (Token.Line_Comment, etc.)

There are 5 phases to creating your own lexical analyzer using OpenToken.

  1. Define an enumeration containing all your tokens.
  2. Instantiate a token analyzer class for your tokens.
  3. Create a token recognizer for each token.
  4. Map the recognizers to their tokensto create a syntax.
  5. Create a token analyzer object initialized with your syntax

The following sections will walk you through each of these steps in detail,
using the examples from chapter 3 of the "dragon book".

Step 1: Creating an enumeration of tokens

This step is fairly simple. Just create an enumerated type containing one
entry for each token you want to be recognized in the input. For our
example, we will assume the grammar in Example 3.6 of Compilers,
Principles, Techniques, & Tools.*


     type Example_Token_ID is (If_ID, Then_ID, Else_ID, ID_ID, Num,
     Relop, Whitespace);

Again, this is a very simple step once you know the list of tokens you
need. But of course figuring that out is not always so simple!

Step 2: Instantiate a token analyzer class

This step is trivial. Simply instantiate the generic Token.Analyzer package
with your token enumerated type.

     package Tokenizer is new Token.Analyzer (Example_Token_ID);

Step 3: Create a token recognizers for each token

Each token needs a recognizer object. Recognizer objects can be created in
one of two ways. The easy way is to use one of the recognizer classes in
the Token.* hierarchy of packages.

     If_Recognizer   : constant Token.Handle := new
     Token.Keyword.Instance'
                                                (Token.Keyword.Get
     ("if"));
     Then_Recognizer : constant Token.Handle := new
     Token.Keyword.Instance'
                                                (Token.Keyword.Get
     ("then"));
     Else_Recognizer : constant Token.Handle := new
     Token.Keyword.Instance'
                                                (Token.Keyword.Get
     ("else"));
     ID_Recognizer   : constant Token.Handle := new
     Token.Identifier.Instance'

     (Token.Identifier.Get));
     Num_Recognizer  : constant Token.Handle := new
     Token.Real.Instance'
                                                (Token.Real.Get));
     Whitesp_Recognizer : constant Token.Handle :=
       new Token.Character_Set.Instance'(Token.Character_Set.Get
     (Token.Character_Set.Standard_Whitespace);

Step 3.5: Creating a custom token recognizer type

If you have a token that cannot be recognized by any of the default
recognizers, there is an extra step. You have to create your own recognizer
routine. That may sound like a lot of work, but really it is not
significantly more complicated that creating a regular expression in lex
would be.

A recognizer is a tagged type that is derived from the type Token.Instance.
You should extend the type to provide yourself state information and to
keep track of any settings that your recognizer type may allow. Other
routines and information about this specific type of token may be placed in
there too. In our example the token Relop cannot be recognized by any of
the provided token recognizers, so we declare it as follows. The part that
is cut-and-paste is in black. The part that was custom for this recognizer
is is blue.


     with Token;
     package Relop_Example_Token is

        type Instance is new Token.Instance with private;


     ---------------------------------------------------------------------------

        -- This function will be called to create an Identifier token.
     Note that
        -- this is a simple recognizer, so Get doesn't need any
     parameters.

     ---------------------------------------------------------------------------

        function Get return Instance;

     private

        type State_ID is (First_Char, Equal_or_Greater, Equal, Done);

        type Instance is new Token.Instance with record
           State : State_ID := First_Char;
        end record;


     ---------------------------------------------------------------------------

        -- This procedure will be called when analysis on a new
     candidate string
        -- is started. The Token needs to clear its state (if any).

     ---------------------------------------------------------------------------

        procedure Clear (The_Token : in out Instance);



     ---------------------------------------------------------------------------

        -- This procedure will be called to perform further analysis
     on a token
        -- based on the given next character.

     ---------------------------------------------------------------------------

        procedure Analyze (The_Token : in out Instance;
                           Next_Char : in     Character;
                           Verdict   : out    Token.Analysis_Verdict);

     end Relop_Example_Token;


Note that very little code is in blue; just the name of the package and the
states between the first and last state. Of course more routines and field
in Instance may be added at your discretion depending on the needs of the
recognizer.

When naming states, I have found it easiest to stick to the following
standard:

   * The first and last states are named First_Char and Done respectively.
   * The intervening states are named for the current part of the token we
     are expecting to recognize, not for the item that was just recognized.

The package body requires a bit more thought. You will have to implement a
state machine for recognizing your token. At the end of any state you will
need to set the new state for the recognizer (if it changed) and return the
match result for the given character.

The result will be one of the enumeration values in Token.Analysis_Verdict.
Matches indicates that the string you have been fed so far (since the last
Clear call) does fully qualify as a token. So_Far_So_Good indicates that
the string in its current state does not match a token, but it could
possibly in the future match, depending on the next characters that are fed
in. Note that it is quite possible for the verdict to be Matches on one
call, and So_Far_So_Good on a later call, depending on the defnition of the
token. The final verdict, Failed, is different. You return it to indicate
that the string is not a legal token of your type, and can never be one no
matter how many more characters are fed in. Whenever you return this, you
should set the recognizer's state to Done as well.

     package body Relop_Example_Token is


     ---------------------------------------------------------------------------

        -- This procedure will be called when analysis on a new
     candidate string
        -- is started. The Token needs to clear its state (if any).

     ---------------------------------------------------------------------------

        procedure Clear (The_Token : in out Instance) is
        begin
           The_Token.State := First_Char;
        end Clear;


     ---------------------------------------------------------------------------

        -- This procedure will be called to create a Relop token
     recognizer

     ---------------------------------------------------------------------------

        function Get return Instance is
        begin
           return (Report => True,
                   State  => First_Char);
        end Get;


     --------------------------------------------------------------------------

        -- This procedure will be called to perform further analysis
     on a token
        -- based on the given next character.

     ---------------------------------------------------------------------------

        procedure Analyze (The_Token : in out Instance;
                           Next_Char : in Character;
                           Verdict   : out Token.Analysis_Verdict) is
        begin

           case The_Token.State is

           when First_Char =>
              -- If the first char is a <, =, or >, its a match
              case Next_Char is
                 when '<' =>
                    Verdict         := Token.Matches;
                    The_Token.State := Equal_Or_Greater;

                 when '>' =>
                    Verdict         := Token.Matches;
                    The_Token.State := Equal;


                 when '=' =>
                    Verdict         := Token.Matches;
                    The_Token.State := Done;

                 when others =>
                    Verdict         := Token.Failed;
                    The_Token.State := Done;
              end case;

           when Equal_Or_Greater =>

              -- If the next char is a > or =, its a match
              case Next_Char is
                 when '>' | '=' =>
                    Verdict         := Token.Matches;
                    The_Token.State := Done;

                 when others =>
                    Verdict         := Token.Failed;
                    The_Token.State := Done;
                 end case;

           when Equal =>

              -- If the next char is a =, its a match
              if Next_Char = '=' then
                 Verdict         := Token.Matches;
                 The_Token.State := Done;
              else
                 Verdict         := Token.Failed;
                 The_Token.State := Done;
              end if;

           when Done =>
                   Verdict := Token.Failed;
           end case;
        end Analyze;

     end Relop_Example_Token;

Now the only thing that remains is to create a token recognizer object of
your new recognizer type, just like you did for the predefined recognizer
types.

     Relop_Recognizer  : constant Token.Handle := new
     Relop_Example_Token.Instance'

     (Relop_Example_Token.Get));

Step 4: Map the recognizers to their tokens

This step is quite simple. Just declare an object of type Tokenizer.Syntax
(assuming your instantiation of the analyzer package in step 2 was named
Tokenizer). Initialize the array with the proper token recognizers for each
token index. For our example it would look like this:


     Syntax : constant Tokenizer.Syntax :=
        (If_ID      => If_Recognizer,
         Then_ID    => Then_Recognizer,
         Else_ID    => Else_Recognizer,
         ID_ID      => ID_Recognizer,
         Num        => Num_Recognizer,
         Relop      => Relop_Recognizer,
         Whitespace => Whitesp_Recognizer
        );

Note that steps 3 and 4 could easily be combined into one step. eg:

     Syntax : constant Tokenizer.Syntax :=
        (If_ID   => new Token.Keyword.Instance'(Token.Keyword.Get
     ("if")),
         Then_ID => new Token.Keyword.Instance'(Token.Keyword.Get
     ("then")),
         Else_ID => new Token.Keyword.Instance'(Token.Keyword.Get
     ("else")),
         ID_ID   => new
     Token.Identifier.Instance'(Token.Identifier.Get),
         Num     => new Token.Real.Instance'(Token.Real.Get),
         Relop   => new
     Relop_Example_Token.Instance'(Relop_Example_Token.Get),
         Whitespace => new
     Token.Character_Set.Instance'(Token.Character_Set.Get

     (Token.Character_Set.Standard_Whitespace))
        );

Step 5: Create a Token Analyzer object

Now we are ready to create our token analyzer. All we have to do is declare
an object of type Tokenizer.Instance (again, assuming that Tokenizer is the
name of the analyzer instantiated back in step 2) , and initialize it via
the Tokenizer.Initialize call. For this call we supply the syntax object
from step 4.

     Analyzer : Tokenizer.Instance := Tokenizer.Initialize (Syntax);

This creates an analyzer that will read input from
Ada.Text_IO.Current_Input, and attempt to match it to the given syntax. By
default this will be standard input, but than can be redirected to the file
of your choice using Ada.Text_IO.Set_Input.

Advanced: Using your own Text Feeder

In the majority of cases this will be sufficient. However, if you want to
preserve the ability to read user input from standard input, you can
instead create your own text feeder function and pass a pointer to it when
you create the Analyzer:


     function My_Text_Feeder return String;

     Analyzer : Tokenizer.Instance := Tokenizer.Initialize
        (Language_Syntax => Syntax,
         Feeder          => My_Text_Feeder'access);

The text feeder function is just a pointer to a function that will be
called to retrieve a string of data to be analyzed. Whenver the analyzer
runs out of characters to process, it will request more from the feeder
function. If you do not supply one, a default is used which reads input
from the standard input stream. If you want to change the feeder function
during analysis, use the function Set_Text_Feeder:

        Tokenizer.Set_Text_Feeder (Analyzer => Analyzer,
                                   Feeder   => My_New_Text_Feeder
                                  );

Use

Now we have our own token analyzer. To use it, all we have to do is call
Tokenizer.Find_Next once for each token we want to find. Tokenizer.Token
will return the ID of the token that was found. Tokenizer.Lexeme returns
the actual string that was matched.

The full source that was used for this tutorial is available in the
Examples/ASU_Example_3_6 directory, along with a sample input file. To run
it, issue the "make" command in that directory. When the command completes,
type in "asu_example_3_6" to run it. You should see the following list of
tokens recognized:

     Found IF_ID
     Found ID_ID
     Found RELOP
     Found ID_ID
     Found THEN_ID
     Found ELSE_ID
     Found RELOP
     Found REAL
     Found RELOP
     Found INT

  ------------------------------------------------------------------------
*  This is the classic text on compiler theory. Note that for this example
we have some minor modifications to the syntax to keep things simple. For
instance, the "num" terminal has been split into the following 2 terminals:

     integer -> (+ | -)? digit+
     real -> (+ | -)? (digit | _)* digit . (digit | _)* ( (e | E) (- |
     +)? (digit)+ )?

This change has been made simply because it matches the definition used for
the Integer and Real tokens provided with the OpenToken package. A joint
"num" token could have been created to exactly match the num specified in
ASD, but we will leave that as an excersize for the reader.
  ------------------------------------------------------------------------

Revisions

$Log: UsersGuide.txt,v $
Revision 1.2  1999/08/17 03:21:42  Ted
Add log line

