OpenToken User Guide

User Manual

Index

Introduction

The OpenToken lexical analyzer generator packages consist of two parts:
  1. The lexical analysis engine itself (Token.Analyzer)
  2. A set of token recognizer packages (Token.Line_Comment, etc.)
There are 5 phases to creating your own lexical analyzer using OpenToken.
  1. Define an enumeration containing all your tokens.
  2. Instantiate a token analyzer class for your tokens.
  3. Create a token recognizer for each token.
  4. Map the recognizers to their tokensto create a syntax.
  5. Create a token analyzer object initialized with your syntax
The following sections will walk you through each of these steps in detail, using the examples from chapter 3 of the "dragon book".

Step 1: Creating an enumeration of tokens

This step is fairly simple. Just create an enumerated type containing one entry for each token you want to be recognized in the input. For our example, we will assume the grammar in Example 3.6 of Compilers, Principles, Techniques, & Tools.*
 
type Example_Token_ID is (If_ID, Then_ID, Else_ID, ID_ID, Num, Relop, Whitespace);
Again, this is a very simple step once you know the list of tokens you need. But of course figuring that out is not always so simple!

Step 2: Instantiate a token analyzer class

This step is trivial. Simply instantiate the generic Token.Analyzer package with your token enumerated type.
package Tokenizer is new Token.Analyzer (Example_Token_ID);

Step 3: Create a token recognizers for each token

Each token needs a recognizer object. Recognizer objects can be created in one of two ways. The easy way is to use one of the recognizer classes in the Token.* hierarchy of packages.
If_Recognizer   : constant Token.Handle := new Token.Keyword.Instance'
                                           (Token.Keyword.Get ("if"));
Then_Recognizer : constant Token.Handle := new Token.Keyword.Instance'
                                           (Token.Keyword.Get ("then"));
Else_Recognizer : constant Token.Handle := new Token.Keyword.Instance'
                                           (Token.Keyword.Get ("else"));
ID_Recognizer   : constant Token.Handle := new Token.Identifier.Instance'
                                           (Token.Identifier.Get));
Num_Recognizer  : constant Token.Handle := new Token.Real.Instance'
                                           (Token.Real.Get));
Whitesp_Recognizer : constant Token.Handle :=
  new Token.Character_Set.Instance'(Token.Character_Set.Get (Token.Character_Set.Standard_Whitespace);

Step 3.5: Creating a custom token recognizer type

If you have a token that cannot be recognized by any of the default recognizers, there is an extra step. You have to create your own recognizer routine. That may sound like a lot of work, but really it is not significantly more complicated that creating a regular expression in lex would be.

A recognizer is a tagged type that is derived from the type Token.Instance. You should extend the type to provide yourself state information and to keep track of any settings that your recognizer type may allow. Other routines and information about this specific type of token may be placed in there too. In our example the token Relop cannot be recognized by any of the provided token recognizers, so we declare it as follows. The part that is cut-and-paste is in black. The part that was custom for this recognizer is is blue.
 

with Token;
package Relop_Example_Token is

   type Instance is new Token.Instance with private;

   ---------------------------------------------------------------------------
   -- This function will be called to create an Identifier token. Note that
   -- this is a simple recognizer, so Get doesn't need any parameters.
   ---------------------------------------------------------------------------
   function Get return Instance;

private

   type State_ID is (First_Char, Equal_or_Greater, Equal, Done);

   type Instance is new Token.Instance with record
      State : State_ID := First_Char;
   end record;

   ---------------------------------------------------------------------------
   -- This procedure will be called when analysis on a new candidate string
   -- is started. The Token needs to clear its state (if any).
   ---------------------------------------------------------------------------
   procedure Clear (The_Token : in out Instance);
 

   ---------------------------------------------------------------------------
   -- This procedure will be called to perform further analysis on a token
   -- based on the given next character.
   ---------------------------------------------------------------------------
   procedure Analyze (The_Token : in out Instance;
                      Next_Char : in     Character;
                      Verdict   : out    Token.Analysis_Verdict);

end Relop_Example_Token;
 

Note that very little code is in blue; just the name of the package and the states between the first and last state. Of course more routines and field in Instance may be added at your discretion depending on the needs of the recognizer.

When naming states, I have found it easiest to stick to the following standard:

The package body requires a bit more thought. You will have to implement a state machine for recognizing your token. At the end of any state you will need to set the new state for the recognizer (if it changed) and return the match result for the given character.

The result will be one of the enumeration values in Token.Analysis_Verdict. Matches indicates that the string you have been fed so far (since the last Clear call) does fully qualify as a token. So_Far_So_Good indicates that the string in its current state does not match a token, but it could possibly in the future match, depending on the next characters that are fed in. Note that it is quite possible for the verdict to be Matches on one call, and So_Far_So_Good on a later call, depending on the defnition of the token. The final verdict, Failed, is different. You return it to indicate that the string is not a legal token of your type, and can never be one no matter how many more characters are fed in. Whenever you return this, you should set the recognizer's state to Done as well.

package body Relop_Example_Token is

   ---------------------------------------------------------------------------
   -- This procedure will be called when analysis on a new candidate string
   -- is started. The Token needs to clear its state (if any).
   ---------------------------------------------------------------------------
   procedure Clear (The_Token : in out Instance) is
   begin
      The_Token.State := First_Char;
   end Clear;

   ---------------------------------------------------------------------------
   -- This procedure will be called to create a Relop token recognizer
   ---------------------------------------------------------------------------
   function Get return Instance is
   begin
      return (Report => True,
              State  => First_Char);
   end Get;

   --------------------------------------------------------------------------
   -- This procedure will be called to perform further analysis on a token
   -- based on the given next character.
   ---------------------------------------------------------------------------
   procedure Analyze (The_Token : in out Instance;
                      Next_Char : in Character;
                      Verdict   : out Token.Analysis_Verdict) is
   begin

      case The_Token.State is

      when First_Char =>
         -- If the first char is a <, =, or >, its a match
         case Next_Char is
            when '<' =>
               Verdict         := Token.Matches;
               The_Token.State := Equal_Or_Greater;

            when '>' =>
               Verdict         := Token.Matches;
               The_Token.State := Equal;
 

            when '=' =>
               Verdict         := Token.Matches;
               The_Token.State := Done;

            when others =>
               Verdict         := Token.Failed;
               The_Token.State := Done;
         end case;

      when Equal_Or_Greater =>

         -- If the next char is a > or =, its a match
         case Next_Char is
            when '>' | '=' =>
               Verdict         := Token.Matches;
               The_Token.State := Done;

            when others =>
               Verdict         := Token.Failed;
               The_Token.State := Done;
            end case;

      when Equal =>

         -- If the next char is a =, its a match
         if Next_Char = '=' then
            Verdict         := Token.Matches;
            The_Token.State := Done;
         else
            Verdict         := Token.Failed;
            The_Token.State := Done;
         end if;

      when Done =>
              Verdict := Token.Failed;
      end case;
   end Analyze;

end Relop_Example_Token;

Now the only thing that remains is to create a token recognizer object of your new recognizer type, just like you did for the predefined recognizer types.
Relop_Recognizer  : constant Token.Handle := new Relop_Example_Token.Instance'
                                             (Relop_Example_Token.Get));

Step 4: Map the recognizers to their tokens

This step is quite simple. Just declare an object of type Tokenizer.Syntax (assuming your instantiation of the analyzer package in step 2 was named Tokenizer). Initialize the array with the proper token recognizers for each token index. For our example it would look like this:
 
Syntax : constant Tokenizer.Syntax :=
   (If_ID      => If_Recognizer,
    Then_ID    => Then_Recognizer,
    Else_ID    => Else_Recognizer,
    ID_ID      => ID_Recognizer,
    Num        => Num_Recognizer,
    Relop      => Relop_Recognizer,
    Whitespace => Whitesp_Recognizer
   );
Note that steps 3 and 4 could easily be combined into one step. eg:
Syntax : constant Tokenizer.Syntax :=
   (If_ID   => new Token.Keyword.Instance'(Token.Keyword.Get ("if")),
    Then_ID => new Token.Keyword.Instance'(Token.Keyword.Get ("then")),
    Else_ID => new Token.Keyword.Instance'(Token.Keyword.Get ("else")),
    ID_ID   => new Token.Identifier.Instance'(Token.Identifier.Get),
    Num     => new Token.Real.Instance'(Token.Real.Get),
    Relop   => new Relop_Example_Token.Instance'(Relop_Example_Token.Get),
    Whitespace => new Token.Character_Set.Instance'(Token.Character_Set.Get
                        (Token.Character_Set.Standard_Whitespace))
   );

Step 5: Create a Token Analyzer object

Now we are ready to create our token analyzer. All we have to do is declare an object of type Tokenizer.Instance (again, assuming that Tokenizer is the name of the analyzer instantiated back in step 2) , and initialize it via the Tokenizer.Initialize call. For this call we supply the syntax object from step 4.
Analyzer : Tokenizer.Instance := Tokenizer.Initialize (Syntax);
This creates an analyzer that will read input from Ada.Text_IO.Current_Input, and attempt to match it to the given syntax. By default this will be standard input, but than can be redirected to the file of your choice using Ada.Text_IO.Set_Input.

Advanced: Using your own Text Feeder

In the majority of cases this will be sufficient. However, if you want to preserve the ability to read user input from standard input, you can instead create your own text feeder function and pass a pointer to it when you create the Analyzer:
 
function My_Text_Feeder return String;

Analyzer : Tokenizer.Instance := Tokenizer.Initialize
   (Language_Syntax => Syntax,
    Feeder          => My_Text_Feeder'access);

The text feeder function is just a pointer to a function that will be called to retrieve a string of data to be analyzed. Whenver the analyzer runs out of characters to process, it will request more from the feeder function. If you do not supply one, a default is used which reads input from the standard input stream. If you want to change the feeder function during analysis, use the function Set_Text_Feeder:
   Tokenizer.Set_Text_Feeder (Analyzer => Analyzer,
                              Feeder   => My_New_Text_Feeder
                             );

Use

Now we have our own token analyzer. To use it, all we have to do is call Tokenizer.Find_Next once for each token we want to find. Tokenizer.Token will return the ID of the token that was found. Tokenizer.Lexeme returns the actual string that was matched.

The full source that was used for this tutorial is available in the Examples/ASU_Example_3_6 directory, along with a sample input file. To run it, issue the "make" command in that directory. When the command completes, type in "asu_example_3_6" to run it. You should see the following list of tokens recognized:

Found IF_ID
Found ID_ID
Found RELOP
Found ID_ID
Found THEN_ID
Found ELSE_ID
Found RELOP
Found REAL
Found RELOP
Found INT

*  This is the classic text on compiler theory. Note that for this example we have some minor modifications to the syntax to keep things simple. For instance, the "num" terminal has been split into the following 2 terminals:
integer -> (+ | -)? digit+
real -> (+ | -)? (digit | _)* digit . (digit | _)* ( (e | E) (- | +)? (digit)+ )?
This change has been made simply because it matches the definition used for the Integer and Real tokens provided with the OpenToken package. A joint "num" token could have been created to exactly match the num specified in ASD, but we will leave that as an excersize for the reader.

Revisions

$Log: UsersGuide.html,v $ Revision 1.2 1999/08/17 03:21:41 Ted Add log line