Hadoop Hive概念学习系列之hive的正则表达式初步（六）

Hadoop Hive概念学习系列之hive的正则表达式初步（六）
说在前面的话

　　hive的正则表达式，是非常重要！作为大数据开发人员，用好hive，正则表达式，是必须品！

Hive中的正则表达式还是很强大的。数据工作者平时也离不开正则表达式。对此，特意做了个hive正则表达式的小结。所有代码都经过亲测，正常运行。

1.regexp

语法: A REGEXP B
操作类型: strings
描述: 功能与RLIKE相同
```
select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\d{8}'
```
与下面查询的效果是等效的：
```
select count(*) from olap_b_dw_hotelorder_f where create_date_wid not rlike '\d{8}';
```
2.regexp_extract

语法: regexp_extract(string subject, string pattern, int index)
返回值: string
说明：将字符串subject按照pattern正则表达式的规则拆分，返回index指定的字符。

hive> select regexp_extract('IloveYou','I(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

Total MapReduce CPU Time Spent: 7 seconds 340 msec

OK

love

Time taken: 28.067 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','I(.*?)(You)',2) from test1 limit 1;

Total jobs = 1

...

OK

You

Time taken: 26.067 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

OK

I

Time taken: 26.057 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',0) from test1 limit 1;

Total jobs = 1

...

OK

IloveYou

Time taken: 28.06 seconds, Fetched: 1 row(s)

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

OK

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

3.regexp_replace

语法: regexp_replace(string A, string B, string C)
返回值: string
说明：将字符串A中的符合Java正则表达式B的部分替换为C。注意，在有些情况下要使用转义字符,类似Oracle中的regexp_replace函数。

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

OK

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

hive> select regexp_replace("IloveYou","You","lili") from test1 limit 1;

Total jobs = 1

...

OK

Ilovelili

Hive里的正则表达式
如，https://cwiki.apache.org/confluence/display/Hive/GettingStarted

输入regex可查到

CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^]*) [()] ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?"
)
STORED AS TEXTFILE;

下面就是hive里的正则表达式，9个字段，对应定义那边也要9个
"input.regex" = "([^ ]*) ([^ ]*) ([^.]*) [(.*)] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)""

([^ ]*) ([^ ]*) ([^.]*) [(.*)] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)"
([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)"

数据来源，
yarn-root-nodemanager-master.log
或
yarn-spark-nodemanager-master.log
yarn-hadoop-nodemanager-master.log

这里，有个正则表达式的好工具！
RegexBuddy.exe

　　很好用的这款软件！双击它即可。

　　如上图所示颜色，代表我们测试的正则表达式，是正确的！
相关阅读:
paip.提升性能并行多核编程哈的数据结构list,set,map
paip.网页右键复制菜单限制解除解决方案
 paip.java swt 乱码问题解决
 paip.哈米架构CAO.txt
paip.提升性能协程“微线程”的使用.
paip.最省内存的浏览器评测 cah
paip.云计算以及分布式计算的区别
 paip.提升性能string split
paip.提升分词准确度常用量词表
 paip.提升中文分词准确度新词识别
原文地址：https://www.cnblogs.com/zlslch/p/6102789.html

Hadoop Hive概念学习系列之hive的正则表达式初步（六）

1.regexp

2.regexp_extract

3.regexp_replace