Boost学习之正则表达式regex

Boost学习之正则表达式regex
boost::regex类为C++提供了完整的正则表达式支持，并且已被接收为C++0x标准库。它同时也在Boost库中扮演着极重要的角色，不少Boost子库都需要它的支持，有不少人甚至就是为了它才下载使用Boost的。

注意使用Boost.Regex需要预先编译

完整编译请参考本站编译Boost的文章
如果只要编译Regex库，有两种方法(参考链接):
1. 在Boost根目录下运行bjam --toolset=<编译器名> --with-regex 其它参数
2. 到<boost>\libs egex\build里，找到对应编译器的makefile，然后make -f xxxx.mak
使用

Boost.Regex手里有七种武器和两****宝
其中的七种武器是:
```
regex_match 函数regex_search 函数regex_replace 函数regex_format 函数regex_grep 函数regex_split 函数RegEx 类
```
每种武器都又有诸多变化（每个函数都分别以C字符串类型、std::string类型、迭代器类型作为参数重载）,不过后面四种武器因年久失修已不建议使用.
两****宝是:
```
regex_iterator 迭代器regex_token_iterator 迭代器
```
这两****宝是整个Boost.Regex的灵魂，用熟它们以后那是“摘花飞叶即可伤人”啊~~

回到正题，下面边写边学。

所需头文件:
```
#include <boost/regex.hpp>
```
示例代码:

先准备一个测试用的数据备用，如果各位有雅兴可以参考本站的另一篇文章《Google Testing》使用Google Testing框架来做这个实验，花一样时间学两样啊~~
#include <iostream>

#include <boost/regex.hpp>

using namespace std;

int main(int argc, char* argv[])

{    //( 1 )   ((  3  )  2 )((  5 )4)(    6    )

    //(\w+)://((\w+\.)*\w+)((/\w*)*)(/\w+\.\w+)?

    //^协议://网址(x.x...x)/路径(n个\字串)/网页文件(xxx.xxx)

    const char *szReg = "(\\w+)://((\\w+\\.)*\\w+)((/\\w*)*)(/\\w+\\.\\w+)?";

    const char *szStr = "http://www.cppprog.com/2009/0112/48.html";

    //练习代码...





    cin.get(); //暂停

}
#include <iostream>#include <boost/regex.hpp>using namespace std;int main(int argc, char* argv[]){ //( 1 ) (( 3 ) 2 )(( 5 )4)( 6 ) //(\w+)://((\w+\.)*\w+)((/\w*)*)(/\w+\.\w+)? //^协议://网址(x.x...x)/路径(n个\字串)/网页文件(xxx.xxx) const char *szReg = "(\\w+)://((\\w+\\.)*\\w+)((/\\w*)*)(/\\w+\\.\\w+)?"; const char *szStr = "http://www.cppprog.com/2009/0112/48.html"; //练习代码... cin.get(); //暂停}
1.字符串匹配

要确定一行字符串是否与指定的正则表达式匹配，使用regex_match。
下面这个代码可以验证szStr字串（定义在上面）是否与szReg匹配。
{    //字符串匹配

    boost::regex reg( szReg );

    bool r=boost::regex_match( szStr , reg);

    assert(r); //是否匹配

}
{ //字符串匹配 boost::regex reg( szReg ); bool r=boost::regex_match( szStr , reg); assert(r); //是否匹配 }
boost::regex的构造函数中还可以加入标记参数用于指定它的行为，如:
//指定使用perl语法（默认），忽略大小写。

boost::regex reg1( szReg, boost::regex::perl|boost::regex::icase );

//指定使用POSIX扩展语法（其实也差不多）

boost::regex reg2( szReg, boost::regex::extended );
//指定使用perl语法（默认），忽略大小写。boost::regex reg1( szReg, boost::regex::perl|boost::regex::icase );//指定使用POSIX扩展语法（其实也差不多）boost::regex reg2( szReg, boost::regex::extended );
下面这个代码不仅验证是否匹配，而且可以从中提取出正则表达式括号对应的子串。
{    //提取子串

    boost::cmatch mat;

    boost::regex reg( szStr );

    bool r=boost::regex_match( szStr, mat, reg);

    if(r) //如果匹配成功

    {

        //显示所有子串

        for(boost::cmatch::iterator itr=mat.begin(); itr!=mat.end(); ++itr)

        {

            //       指向子串对应首位置        指向子串对应尾位置          子串内容

            cout << itr->first-szStr << ' ' << itr->second-szStr << ' ' << *itr << endl;

        }

    }

    //也可直接取指定位置信息

    if(mat[4].matched) cout << "Path is" << mat[4] << endl;

}
{ //提取子串 boost::cmatch mat; boost::regex reg( szStr ); bool r=boost::regex_match( szStr, mat, reg); if(r) //如果匹配成功 { //显示所有子串 for(boost::cmatch::iterator itr=mat.begin(); itr!=mat.end(); ++itr) { // 指向子串对应首位置指向子串对应尾位置子串内容 cout << itr->first-szStr << ' ' << itr->second-szStr << ' ' << *itr << endl; } } //也可直接取指定位置信息 if(mat[4].matched) cout << "Path is" << mat[4] << endl; }
其中，boost::cmatch是一个针对C字符串的特化版本，它还有另三位兄弟,如下:
```
typedef match_results<const char*> cmatch;typedef match_results<std::string::const_iterator> smatch;typedef match_results<const wchar_t*> wcmatch;typedef match_results<std::wstring::const_iterator> wsmatch;
```
可以把match_results看成是一个sub_match的容器，同时它还提供了format方法来代替regex_format函数。
一个sub_match就是一个子串，它从std::pair<BidiIterator, BidiIterator>继承而来，这个迭代器pair里的first和second分别指向了这个子串开始和结尾所在位置。同时，sub_match又提供了str()，length()方法来返回整个子串。

2.查找字符串

regex_match只验证是否完全匹配，如果想从一大串字符串里找出匹配的一小段字符串（比如从网页文件里找超链接），这时就要使用regex_search了。
下面这段代码从szStr中找数字
{ //查找

    boost::cmatch mat;

    boost::regex reg( "\\d+" );    //查找字符串里的数字

    if(boost::regex_search(szStr, mat, reg))

    {

        cout << "searched:" << mat[0] << endl;

    }

}
{ //查找 boost::cmatch mat; boost::regex reg( "\\d+" ); //查找字符串里的数字 if(boost::regex_search(szStr, mat, reg)) { cout << "searched:" << mat[0] << endl; } }
3.替换

regex_replace提供了简便的方法来部分替换源字符串
正则表达式中，使用$1~$9（或\1~\9）表示第几个子串,$&表示整个串，$`表示第一个串,$'表示最后未处理的串。
{ //替换1，把上面的HTTP的URL转成FTP的

    boost::regex reg( szReg );

    string s = boost::regex_replace( string(szStr), reg, "ftp://$2$5");

    cout << "ftp site:"<< s << endl;

}
{ //替换1，把上面的HTTP的URL转成FTP的 boost::regex reg( szReg ); string s = boost::regex_replace( string(szStr), reg, "ftp://$2$5"); cout << "ftp site:"<< s << endl; }
正则表达式中，使用(?1~?9新字串)表示把第几个子串替换成新字串
{ //替换2，使用format_all参数把<>&全部转换成网页字符

    string s1 = "(<)|(>)|(&)";

    string s2 = "(?1<)(?2>)(?3&)";

    boost::regex reg( s1 );

    string s = boost::regex_replace( string("cout << a&b << endl;"), reg, s2, boost::match_default | boost::format_all);

    cout << "HTML:"<< s << endl;

}
{ //替换2，使用format_all参数把<>&全部转换成网页字符 string s1 = "(<)|(>)|(&)"; string s2 = "(?1<)(?2>)(?3&)"; boost::regex reg( s1 ); string s = boost::regex_replace( string("cout << a&b << endl;"), reg, s2, boost::match_default | boost::format_all); cout << "HTML:"<< s << endl; }
4.使用regex_iterator查找

对应于C字符串和C++字符串以及宽字符，regex_iterator同样也有四个特化:
```
    typedef regex_iterator<const char*> cregex_iterator;    typedef regex_iterator<std::string::const_iterator> sregex_iterator;    typedef regex_iterator<const wchar_t*> wcregex_iterator;    typedef regex_iterator<std::wstring::const_iterator> wsregex_iterator;
```
这个迭代器的value_type定义是一个match_results。
{ //使用迭代器找出所有数字

    boost::regex reg( "\\d+" );    //查找字符串里的数字

    boost::cregex_iterator itrBegin(szStr, szStr+strlen(szStr), reg);

    boost::cregex_iterator itrEnd;

    for(boost::cregex_iterator itr=itrBegin; itr!=itrEnd; ++itr)

    {

            //       指向子串对应首位置        指向子串对应尾位置          子串内容

            cout << (*itr)[0].first-szStr << ' ' << (*itr)[0].second-szStr << ' ' << *itr << endl;

    }

}
{ //使用迭代器找出所有数字 boost::regex reg( "\\d+" ); //查找字符串里的数字 boost::cregex_iterator itrBegin(szStr, szStr+strlen(szStr), reg); boost::cregex_iterator itrEnd; for(boost::cregex_iterator itr=itrBegin; itr!=itrEnd; ++itr) { // 指向子串对应首位置指向子串对应尾位置子串内容 cout << (*itr)[0].first-szStr << ' ' << (*itr)[0].second-szStr << ' ' << *itr << endl; } }
Boost.Regex也提供了make_regex_iterator函数简化regex_iterator的构造，如上面的itrBegin可以写成:
```
itrBegin = make_regex_iterator(szStr,reg);
```
5.使用regex_token_iterator拆分字符串

它同样也有四个特化，形式和上面类似，就不再写一遍骗篇幅了。
这个迭代器的value_type定义是一个sub_match。
{ //使用迭代器拆分字符串

    boost::regex reg("/");  //按/符拆分字符串

    boost::cregex_token_iterator itrBegin(szStr, szStr+strlen(szStr), reg,-1);

    boost::cregex_token_iterator itrEnd;

    for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr)

    {

        cout << *itr << endl;

    }

}
{ //使用迭代器拆分字符串 boost::regex reg("/"); //按/符拆分字符串 boost::cregex_token_iterator itrBegin(szStr, szStr+strlen(szStr), reg,-1); boost::cregex_token_iterator itrEnd; for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr) { cout << *itr << endl; } }
Boost.Regex也提供了make_regex_token_iterator函数简化regex_token_iterator的构造，最后的那个参数-1表示以reg为分隔标志拆分字符串，如果不是-1则表示取第几个子串，并且可以使用数组来表示同时要取几个子串，例如:
{ //使用迭代器拆分字符串2

    boost::regex reg("(.)/(.)");  //取/的前一字符和后一字符（这个字符串形象貌似有点邪恶-_-）

    int subs[] = {1,2};        // 第一子串和第二子串

    boost::cregex_token_iterator itrBegin = make_regex_token_iterator(szStr,reg,subs); //使用-1参数时拆分，使用其它数字时表示取第几个子串，可使用数组取多个串

    boost::cregex_token_iterator itrEnd;

    for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr)

    {

        cout << *itr << endl;

    }

}
{ //使用迭代器拆分字符串2 boost::regex reg("(.)/(.)"); //取/的前一字符和后一字符（这个字符串形象貌似有点邪恶-_-） int subs[] = {1,2}; // 第一子串和第二子串 boost::cregex_token_iterator itrBegin = make_regex_token_iterator(szStr,reg,subs); //使用-1参数时拆分，使用其它数字时表示取第几个子串，可使用数组取多个串 boost::cregex_token_iterator itrEnd; for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr) { cout << *itr << endl; } }
完整测试代码:

#include <iostream>

#include <boost/regex.hpp>

using namespace std;

int main(int argc, char* argv[])

{    //( 1 )   ((  3  )  2 )((  5 )4)(    6    )

    //(\w+)://((\w+\.)*\w+)((/\w*)*)(/\w+\.\w+)?

    //^协议://网址(x.x...x)/路径(n个\字串)/网页文件(xxx.xxx)

    const char *szReg = "(\\w+)://((\\w+\\.)*\\w+)((/\\w*)*)(/\\w+\\.\\w+)?";

    const char *szStr = "http://www.cppprog.com/2009/0112/48.html";

    {    //字符串匹配

        boost::regex reg( szReg );

        bool r=boost::regex_match( szStr , reg);

        assert(r);

    }

    {    //提取子串

        boost::cmatch mat;

        boost::regex reg( szReg );

        bool r=boost::regex_match( szStr, mat, reg);

        if(r) //如果匹配成功

        {

            //显示所有子串

            for(boost::cmatch::iterator itr=mat.begin(); itr!=mat.end(); ++itr)

            {

                //       指向子串对应首位置        指向子串对应尾位置          子串内容

                cout << itr->first-szStr << ' ' << itr->second-szStr << ' ' << *itr << endl;

            }

        }

        //也可直接取指定位置信息

        if(mat[4].matched) cout << "Path is" << mat[4] << endl;

    }

    { //查找

        boost::cmatch mat;

        boost::regex reg( "\\d+" );    //查找字符串里的数字

        if(boost::regex_search(szStr, mat, reg))

        {

            cout << "searched:" << mat[0] << endl;

        }

    }

    { //替换

        boost::regex reg( szReg );

        string s = boost::regex_replace( string(szStr), reg, "ftp://$2$5");

        cout << "ftp site:"<< s << endl;

    }

    { //替换2，把<>&转换成网页字符

        string s1 = "(<)|(>)|(&)";

        string s2 = "(?1<)(?2>)(?3&)";

        boost::regex reg( s1 );

        string s = boost::regex_replace( string("cout << a&b << endl;"), reg, s2, boost::match_default | boost::format_all);

        cout << "HTML:"<< s << endl;

    }

    { //使用迭代器找出所有数字

        boost::regex reg( "\\d+" );    //查找字符串里的数字

        boost::cregex_iterator itrBegin = make_regex_iterator(szStr,reg); //(szStr, szStr+strlen(szStr), reg);

        boost::cregex_iterator itrEnd;

        for(boost::cregex_iterator itr=itrBegin; itr!=itrEnd; ++itr)

        {

                //       指向子串对应首位置        指向子串对应尾位置          子串内容

                cout << (*itr)[0].first-szStr << ' ' << (*itr)[0].second-szStr << ' ' << *itr << endl;

        }

    }

    { //使用迭代器拆分字符串

        boost::regex reg("/");  //按/符拆分字符串

        boost::cregex_token_iterator itrBegin = make_regex_token_iterator(szStr,reg,-1); //使用-1参数时拆分，使用其它数字时表示取第几个子串，可使用数组取多个串

        boost::cregex_token_iterator itrEnd;

        for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr)

        {

            cout << *itr << endl;

        }

    }

    { //使用迭代器拆分字符串2

        boost::regex reg("(.)/(.)");  //取/的前一字符和后一字符（这个字符串形象貌似有点邪恶-_-）

        int subs[] = {1,2};        // 第一子串和第二子串

        boost::cregex_token_iterator itrBegin = make_regex_token_iterator(szStr,reg,subs); //使用-1参数时拆分，使用其它数字时表示取第几个子串，可使用数组取多个串

        boost::cregex_token_iterator itrEnd;

        for(boost::cregex_token_iterator itr=itrBegin; itr!=itrEnd; ++itr)

        {

            cout << *itr << endl;

        }

    }

    cin.get();

    return 0;

}
相关阅读:
IE9不能在线打开InfoPath表单的解决办法
 在Word中使用Quick Parts功能
 .NET WinForm程序在Windows7下实现玻璃效果和任务栏进度条效果
 如何对已经发布过的InfoPath模板进行修改
 .NET WinForm中给DataGridView自定义ToolTip并设置ToolTip的样式
 有关SharePoint Client Object应用的笔记
 解决VS 2010中编译程序时弹出"Type universe cannot resolve assembly"的错误
 SharePoint客户端对象模型"(400) Bad Request"错误
 C# Method Attribute and Reflection
.NET WinForm下一个支持更新ProgressBar进度的DataGridView导出数据到Excel的类
原文地址：https://www.cnblogs.com/cy163/p/1689759.html

Boost学习之正则表达式regex

注意使用Boost.Regex需要预先编译

使用

回到正题，下面边写边学。

所需头文件:

示例代码:

1.字符串匹配

2.查找字符串

3.替换

4.使用regex_iterator查找

5.使用regex_token_iterator拆分字符串