先来看看词发分析器。例如语句,1024+ 78*pi,一个简单的表达式,
词发分析就是要得出上面的语句由这些Token组成,1024, + , 78, * ,pi。Token可以是运算符,可以是数字,也可以是字符串。在处理原来的字符串时,遇到第一个字符'1',把它放入一个buf里面,然后遇到0,仍然是合法的数字,放入buf里面,直到遇到'+',这时就可以把buf返回,得到1024这一个token。然后从刚才的位置继续,'+‘,直接返回这个Token,可以把其作为运算符一类,也可以是运算符细分之后的‘+’一类。后面遇到空格,跳过。后面的类似处理。
首先是Token,这里用到了TokenType,type用整数表示,本来是想用enum class TokenType的,但是在继承的时候遇到了问题。虽然这个enum class定义很像class,但是不支持继承。那么我想添加type的时候,就不是很方便。然后就有了下面的一个类来定义TokenType。
class TokenType { public: TokenType():_EOF(0),TEXT(1),NUMBER(2),name({"EOF","TEXT","NUMBER"}){ } string Name(int t) { return name.at(t); } public: const int _EOF; const int TEXT; const int NUMBER; vector<string> name;// = {"EOF","TEXT"}; }TokenType; class Token { public: Token(){} Token(int tp, string tx) { type = tp; text = tx; } int Type() const { return type; } string Text() const { return text; } friend ostream& operator << (ostream& out,const Token& token){ out << "<" << TokenType.Name(token.Type()) << "," << token.Text() << ">" << endl; } protected: int type; string text; };
/*! \brief consume the current char, and get the next char * if it is the end of input, set c as EOF */ void consume(){ p++; if (p >= text.length()) c = _eof; else c = text[p]; } /*! \brief match the current char to target char * * @param x */ void match(char x){ if(c==x) consume(); else LexerError err(x,c,p); } /*! \brief get token from the text, * here only support text word and digital number, * integer, float or scientific number * @return the next token */ Token NextToken(){ while(c!=_eof){ if(isWS()) WS(); else if(isLETTER()) return TEXT(); else if(isDIGIT()) return NUMBER(); else LexerError err(c,p); } return _EOF(); } bool isWS(){ return (c==' ' || c=='\t' || c=='\r'); } bool isLETTER(){ return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')); } bool isDIGIT(){ return (c >= '0' && c <= '9'); } /*! \brief Get all the following digits as an integer * * @return return the integer in the string format */ string Digit() { string buf; do { buf.push_back(c); consume(); } while (isDIGIT()); return buf; } /*! \brief Skip all whitespace * */ void WS(){ while(isWS()) consume(); } Token _EOF(){ return Token(TokenType._EOF,toString(_eof)); } /*! \brief Get all the following letter into a text word * * @return return the word as a TEXT token */ Token TEXT() { string buf; do { buf.push_back(c); consume(); } while (isLETTER()); return Token(TokenType.TEXT, buf); } /*! \brief Get the number, whether it is a integer, or a float, * or even in scientific format * * @return */ Token NUMBER() { string buf = Digit(); if (c == '.') buf += Digit(); if (c == 'e' || c == 'E') { consume(); buf.push_back(c); if (c == '+' || c == '-') { consume(); buf.push_back(c); buf += Digit(); } } return Token(TokenType.NUMBER, buf); }
可以看看收集NUMBER的过程,如果使用正则表达式,对应的模式为 (\d)+([.](\d)+)?([e|E][-|+]?(\d)+)?。或许正则表达式的过程就是上面的过程。