• Introduction to Flex and Bison


    Before you read this article, you should know the basis prerequisites knowledge:

    • Regular Expression
    • Usage of makefile
    • The basic knowledge of compiler

    Generally speaking, a compiler has 3 parts:

    • Front-end: Lexer/Scanner, Parser, Semantic Analyzer
      • Front-end will produce IR (Internal Result).
    • Middle-end: Optimizer
    • Back-end: Code Generator
      • Back-end will generate the binary code related to specific machines.

    The tools "Flex & Bison" can help us to build the front-end of a compiler. In this article, we will introduce the basic usages of Flex and Bison via implementing a calculator.

    Actually, flex and bison are commands in unix-like operating systems, see man flex or man bison.

    Intro

    Lexer

    Lexer is called "词法分析器" in Chinese. Lex is the process of converting a sequence of characters into a sequence of tokens. For example, if there is an expression var = 1 - 3 ** 2 in our source code, the lexer should (and it will) output:

    +--------+----------------------+
    | Lexeme |    Token Type        |
    +--------+----------------------+
    |  var   | Varaible             |
    |  =     | Assignment Operator  |
    |  1     | Number               |
    |  -     | Subtraction Operator |
    |  3     | Number               |
    |  **    | Power Operator       |
    |  2     | Number               |
    +--------+----------------------+
    

    Parser

    Parser is called "语法分析器" in Chinese. Parsing is the process of analyzing a sequence of tokens to determine its grammatical structure (aka, Abstract Syntax Tree, AST). Syntax errors are identified during this stage.

    The parser of a compiler should (and it will) convert the <lexeme, token type> pairs into AST:

         =
       /   \
     foo    -
          /   \
         1     **
              /   \
             3     2
    

    Semantic Analyzer

    Semantic analyzer is called "语义分析器" in Chinese. Semantic analysis is the process of performing semantic checks.

    So what is the "semantic checks"? For example, in strong type language, type checking, object binding both belong to semantic checks. If we have a declaration float f = "hello" , then semantic analyzer should (and it will) output some message like error:incompatible type .


    Code Generator

    Code generating is the process by which a compiler's code generator converts some intermediate representation of source code into a form (e.g. machine code) that can be readily executed by a machine.

    For example:

    int hello() { return -1; }
    

    The code generator should output some internal result like assemble code:

    foo:
        add $esp, -16
        mov $eax, -1
    

    Flex

    The word "Flex", means the fast lexical analyser generator. It can help us to generate the code of a Lexer.

    Basic Usage

    Let's have a look on an simple example Word Count.

    /* just like Unix command `wc` */
    %{
    int chars = 0;
    int words = 0;
    int lines = 0;
    %}
    
    %%
    
    [a-zA-Z]+  { words++; chars += strlen(yytext); }
    \n         { chars++; lines++; }
    .          { chars++; }
    
    %%
    
    int main(int argc, char **argv)
    {
      yylex();
      printf("%8d%8d%8d\n", lines, words, chars);
      return 0;
    }
    

    In above file flex.l , it contains 3 sections (each section is separated by %% line):

    • The first section is declarations (surrounded by %{ /* source code */ %}), within which shoule be the C source code. This section usually contains declarations and option settings. In this section, we can also include some library like #include <string.h> .
    • The second section is rules, a list of patterns and actions, a 'pattern' is a regex, and an 'action' is represented by C source code. The variable yytext is the string that matches to the regex pattern.
    • The third section is source code, including a main function.

    We can build this flex.l file via:

    # it will generate lex.yy.c
    > flex flex.l
    # On MacOS, -ll means linking flex library (like -lmath, -lpthread).
    # On Ubuntu, it should be -lfl.
    > gcc lex.yy.c -ll
    > ./a.out
    Hello world, I am skb.  # Ctrl + D
           1       5      23
    

    If we want to count the number of a file, we can:

    > ./a.out < lex.yy.c
        1749    6570   44115
    > ./a.out < flex.l
          20      37     295
    

    From this example, we can know that flex can help us to generate the code of a lexer, and the rules (written in regular expressions) of the Lexer are defined by us.

    bison is similar to flex, it will generate the code of a parser.

    Tokenizer

    Let's see some advaned usages, we will tokenize an expression in this section.

    %{
    #include <stdio.h>
    // #include "bison.tab.h"
    typedef enum
    {
      NUMBER = 258,
      ADD,
      SUB,
      MUL,
      DIV,
      EOL,
    } token_type_t;
    int yyval;
    %}
    
    %%
    
    "+"    { return ADD; }
    "-"    { return SUB; }
    "*"    { return MUL; }
    "/"    { return DIV; }
    [0-9]+ { yyval = atoi(yytext); return NUMBER; }
    \n     { return EOL; }
    [ \t]  {}
    .      { printf("Unknow string: %s\n", yytext); }
    
    %%
    
    int main(int argc, char *argv[])
    {
      int ret;
      while ((ret = yylex()) != 0)
      {
        if (ret == NUMBER)
          printf("number = %d, type = %d\n", yyval, ret);
        else if (ret == EOL)
          printf("token = \'\\n\', type = %d\n", ret);
        else
          printf("token = \'%s\', type = %d\n", yytext, ret);
      }
    }
    
    

    Build it:

    > flex flex.l; gcc lex.yy.c -ll
    > cat input.txt
    111 + 222 - 333 * 456 - 789 / 1000
    > ./a.out < input.txt
    number = 111, type = 258
    token = '+', type = 259
    number = 222, type = 258
    token = '-', type = 260
    number = 333, type = 258
    token = '*', type = 261
    number = 456, type = 258
    token = '-', type = 260
    number = 789, type = 258
    token = '/', type = 262
    number = 1000, type = 258
    

    Bison

    Suppose we have got all the <token, type> pairs, and we want to convert them into an AST, that's what Bison can help us do.

    1 * 2 + 3 * 4 + 5
    
            +
          /   \
         +     5
       /   \
      *     *
     / \   /  \
    1   2 3    4
    

    BNF

    Backus-Naur Form (BNF), is called "BNF 范式" in some Chinese textbooks. Our BNF example here is very simple (and naive):

    <exp>    ::= <factor>
             |   <exp> + <factor>
    <factor> ::= NUMBER
             |   <factor> * NUMBER
    

    In BNF, ::= can be read as "is a", and | can be read as "or". For example, the 1st BNF above can be read as "<exp> is a <factor> or <exp> + <factor>" .

    Let's Build a Calculator

    The source code of this part can be found in sql-parser/calcdemo of the github repo tinydb.

    In this part, we will show the basic usages of Bison via building a calculator (support +, -, *, / ) example.

    Bison programs have (not by coincidence) the same three-part structure as flex programs, with declarations, rules, and C code. In bison, we replace ::= in BNF with :, and we add a semicolon after each BNF.

    For the BNF rules in bison:

    • Each symbol in a bison rule has a value, the value of the target symbol (the one to the left of the colon) is called $$ in the action code. And the values on the right are numbered 1, 2 and so forth, up to the number of symbols in the rule.
    • The values of tokens (declared by %token line) are whatever was in yylval when the scanner returned the token; the values of other symbols are set in rules in the parser. In this parser, the values of the factor, term, and exp symbols are the value of the expression they represent.
    // file: calc.y
    %{
    #include <stdio.h>
    #include <stdlib.h>
    extern int yylex();
    extern int yyparse();
    void yyerror(const char *msg)
    {
        fprintf(stderr, "error: %s\n", msg);
    }
    
    int yywrap() { return 1; }
    
    void prompt() { printf("expr > "); }
    
    int main(int argc, char *argv[])
    {
        prompt();
        yyparse();
    }
    %}
    
    %token NUMBER
    %token ADD SUB MUL DIV
    %token EOL SPACE EXIT
    
    %%
    calculation:
    | calculation line { prompt(); }
    ;
    
    line: EOL
    | exp EOL  { printf("%d\n", $1); }
    | EXIT EOL { printf("bye!"); exit(0); }
    ;
    
    exp: factor        { $$ = $1; }
    | exp ADD factor   { $$ = $1 + $3; }
    | exp SUB factor   { $$ = $1 - $3; }
    ;
    
    factor: term       { $$ = $1; }
    | factor MUL term  { $$ = $1 * $3; }
    | factor DIV term  { $$ = $1 / $3; }
    ;
    
    
    term: NUMBER { $$ = $1; }
    ;
    
    %%
    

    And we make some simple modifications on the flex source file above:

    // file: calc.l
    %{
    #include <stdio.h>
    #include "calc.tab.h"
    %}
    
    %%
    
    "+"    { return ADD; }
    "-"    { return SUB; }
    "*"    { return MUL; }
    "/"    { return DIV; }
    [0-9]+ { yylval = atoi(yytext); return NUMBER; }
    \n     { return EOL; }
    [ \t]  { }
    "exit" { return EXIT; }
    .      { printf("Unknow string: %s\n", yytext); }
    
    %%
    

    Actually, we don't need to explicitly declare an enum of tokens ADD, SUB, .... They will be automatically generated by flex. See the generated product lex.yy.c.

    And we build calc.l and calc.y by makefile:

    run:
    	flex calc.l
    	bison -d calc.y
    	# use 'gcc -fl' if you build on Linux
    	gcc -ll calc.tab.c lex.yy.c
    	./a.out
    

    Type make run, and you will enter the calculator program:

    terminal > ./a.out
    expr > 1 + 2 - 3 * 4 + 4 / 4 - 1
    -9
    expr > 1 + 2
    3
    expr > exit
    bye!
    

    Summary

    In this article, we introduce some basic usages of Flex & Bison. In the next article, we will introduce how to implement a SQL Parser via Flex & Bison. See the sql-parser branch of tinydb project.

    References

  • 相关阅读:
    Binary Search Tree
    uC/OS-II 内存管理
    RLP
    hbase (local mode) remote access
    arm swi 软中断 一例
    模拟uClinux系统调用
    docker with flannel
    tcp并发服务器(c20w)
    浏览器的同源策略及跨域解决方案
    Redis快速入门
  • 原文地址:https://www.cnblogs.com/sinkinben/p/15864745.html
Copyright © 2020-2023  润新知