• Introducing Regular Expressions 学习笔记


    Introducing Regular Expressions 读书笔记


    工具:

    regexbuddy:http://download.csdn.net/tag/regexbuddy%E7%A0%B4%E8%A7%A3

    在线测试平台:

    http://www.regexpal.com/

    http://gskinner.com/RegExr/

    进阶读物:

    Mastering Regular Expressions

    Regular Expressions Cookbook


    资料:

    Notepad++ Regex: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions


    文中 “项 ” 表示 一个字符或者是一个字符组(group)的概念。

    chapter 1:

    以 707-827-7019 为例

    [ ]:字符集匹配

    d:匹配数字0-9,同 "[0-9]"

    D:匹配非数字,包括字母、标点等

    . :(dot)通配符,匹配所有字符(除换行符),但当出现在[ ]中时表示字符dot


    向后引用 (Backreferences)

    ( ):成组,capturing group

    1 或 $1:匹配与第一个括号的匹配项内容完全一样的项,不同的正则表达式的实现可能形式不一样,有的只支持1,有的只支持$1,有的都支持。

    例:

    (d)d1:匹配202,303,151,不能匹配234.


    修饰符 

    { }:大括号内为前一项的精确匹配次数,可为{3}(3次),{3,5}(3到5次,即3,4,5次都可以,注意逗号两边不能有空格)或{3,}(大于等于3次)这样的形式。

    ?:匹配前一项0或1次,尽可能多的匹配,贪心(greedy)

    +:匹配前一项1次或多次(内容不必完全一样,只要都符合相同的匹配规则即可,如 "d+",表示匹配尽可能多的数字),尽可能多的匹配,贪心(greedy)

    *:匹配前一项0次或多次(内容不必完全一样),尽可能多的匹配,贪心(greedy)

    |:(the vertical bar) indicates alternation, that is, a given choice of alternatives,该符号前面的匹配式匹配失败后就使用后面的匹配式进行匹配

    ^:出现在开头或 | 之后时表示 行首

    $:行尾

    :转义字符,可以将上述修饰符转成普通字符,如字符左括号“(”

    例:

    d{3}-?d{3}-?d{4}

    (d{3,4}[.-]?)+

    (d{3}[.-]?){2}d{4}

    ^((d{3})|^d{3}[.-]?)?d{3}[.-]?d{4}$:匹配(201)971-1975,201-971-1975,(201)971.1975,201.971.1975。


    chapter 2

    ^:match all but these,如[^0-9],表示匹配所有非数字

    w:匹配字母和数字,与 [0-9a-zA-Z] 相同

    W:与w相反,[^0-9a-zA-Z]

    .*:匹配0个或多个任意字符

    .+:匹配1个或多个任意字符


    dotall mode:开启这个模式后 ‘.’ 将匹配包括换行在内的所有字符


    下面是一个快捷字符表:

    Table 2-1. Character shorthands

    Character Shorthand | Description
    a: 
    Alert
     Word boundary,单词边界
    [] Backspace character
    B Non-word boundary,非单词边界
    c :x Control character
    d Digit character
    D :Non-digit character
    d xxx: Decimal value for a character
    f :Form feed character
    Carriage return
    Newline character,换行符
    pass:[<literal>o</literal>  :Octal value for a character
    <replaceable>xxx</replaceable>]
    s
    Space character,空白符
    S Non-space character
    Horizontal tab character,tab符
    v Vertical tab character

    w Word character
    W :Non-word character
    Nul character
    xxx Hexadecimal value for a character
    u xxxx :Unicode value for a character


    Table 2-2. Character shorthands for whitespace characters

    Character Shorthand | Description
    f :
    Form feed
    h: Horizontal whitespace
    H Not horizontal whitespace
    Newline
    Carriage return
    Horizontal tab
    v Vertical tab (whitespace)
    V Not vertical whitespace


    chapter 3 Boundaries

    This chapter focuses on assertions. Assertions mark boundaries, but they don’t consume
    characters—that is, characters will not be returned in a result. They are also
    known as zero-width assertions. A zero-width assertion doesn’t match a character, per
    se, but rather a location in a string. Some of these, such as ^ and $, are also called anchors.

    单词边界:
    有些应用中如vim,grep中使用 <、>作为单词边界

    Q 和 E:可以将特殊字符(.^$*+?|(){}[]-)解释为普通字符,相当于使用 。
    如:Q$E 与 ? 相同。在Q 与 E 之间的任意字符都被匹配为普通字符,如 Q.?*E 。

    multiline 模式: 多行匹配

    很多时候我们可以通过边界符号(^,$,,B)达到我们匹配某些字符串的目的。但是如果字符串有多行呢,这个其实很简单了,只需加个m就指定为多行匹配了。实例:

    var str = "first second third fourth fifth sixth";

    var patt = /(w+)$/gm

    console.log(str.match(patt));

    结果:

     ["second""fourth""sixth"]

    如果没有指定m,则只会得到 sixth了,加了m后实际上正则表达式是把 、 这些也换行和回车当成边界了,可以这么理解。

     

    var str2 = "first second third fourth fifth sixth";

    var patt2 = /^(w+)/gm

    console.log(str2.match(patt2 ))

    结果:

    ["first""third""fifth"]
    没指定m则只能是first了

    chapter 4 Alternation, Groups, and Backreferences

    You have already seen groups in action. Groups surround text with parentheses to help
    perform some operation, such as the following:
    • Performing alternation, a choice between two or more optional patterns
    • Creating subpatterns
    • Capturing a group to later reference with a backreference
    • Applying an operation to a grouped pattern, such as a quantifer
    • Using non-capturing groups
    • Atomic grouping (advanced)

    Table 4-1. Options in regular expressions
    Option Description Supported by
    (?d)
    Unix lines Java
    (?i) Case insensitive PCRE, Perl, Java
    (?J) Allow duplicate names PCRE*
    (?m) Multiline PCRE, Perl, Java
    (?s) Single line (dotall) PCRE, Perl, Java
    (?u) Unicode case Java
    (?U) Default match lazy PCRE
    (?x) Ignore whitespace, comments PCRE, Perl, Java
    (?-…) Unset or turn off options PCRE

    Alternation:
    (?i)the:忽略大小写匹配
    (the|The|THE):匹配 the 或 The 或 THE
    grep -Ec "(the|The|THE)" rime.txt

    Subpatterns:
    (t|T)h(e|eir)
    [tT]h[ceinry]*

    Capturing Groups and Backreferences:
    1 $1 、2  $2
    sed -En 's/(It is) (an ancyent Marinere)/2 1/p' rime.txt

    Named Groups:
    perl -ne 'print if s/(?<one>It is) (?<two>an ancyent Marinere)/u$+{two}
    l$+{one}/' rime.txt
    You can then use the group again like this:
    (?<z>0{3})k<z>
    Or this:
    (?<z>0{3})k'z'
    Or this:
    (?<z>0{3})g{z}

    Table 4-3. Named group syntax
    Syntax Description
    (?<name>…)
    A named group
    (?name…) Another named group
    (?P<name>…) A named group in Python
    k<name> Reference by name in Perl
    k'name' Reference by name in Perl
    g{name} Reference by name in Perl
    k{name} Reference by name in .NET
    (?P=name) Reference by name in Python

    Non-Capturing Groups:
    You don’t need to backreference anything, so you could write a non-capturing group
    this way:
    好处是可以提升性能,因为其值不需要储存在内存中。
    ( ?:the|The|THE)
    或插入i使其不区分大小写
    (?i)( ?:the)
    ( ?:(?i)the)
    ( ?i:the)

    Atomic Groups:
    Another kind of non-capturing group is the atomic group. If you are using a regex engine
    that does backtracking, this group will turn backtracking off, not for the entire regular
    expression but just for that part enclosed in the atomic group. The syntax looks like this:
    (?>the)
    When would you want to use atomic groups? One of the things that can really slow
    regex processing is backtracking. The reason why is, as it tries all the possibilities, it
    takes time and computing resources. Sometimes it can gobble up a lot of time. When
    it gets really bad, it’s called catastrophic backtracking.


    chapter 5 Character Classes

    [a-fA-F0-9]
    [ws] 与  [_a-zA-Z ] 相同
    [^aeiou]

    Union and Difference:
    不是所有实现都支持,java支持。
    If you wanted a union of two character sets, you could do it like this:
    [0-3[6-9]]
    The regex would match 0 through 3 or 6 through 9.
    To match a difference (in essence, subtraction):
    [a-z&&[^m-r]]
    which matches all the letters from a to z, except m through r。

    POSIX Character Classes:
    [[:xxxx:]]
    [[:^xxxx:]]
    [[:alnum:]]

    Table 5-1. POSIX character classes
    Character Class Description
    [[:alnum:]]
    Alphanumeric characters (letters and digits)
    [[:alpha:]] Alphabetic characters (letters)
    [[:ascii:]] ASCII characters (all 128)
    [[:blank:]] Blank characters
    [[:ctrl:]] Control characters
    [[:digit:]]Digits
    [[:graph:]] Graphic characters
    [[:lower:]] Lowercase letters
    [[:print:]] Printable characters
    [[:punct:]] Punctuation characters
    [[:space:]] Whitespace characters
    [[:upper:]] Uppercase letters
    [[:word:]] Word characters
    [[:xdigit:]] Hexadecimal digits

    chapter 6 Matching Unicode and Other Characters


    u00e9 16进制
    351 8进制

    vim :/\%u6c60

    Table 6-3. Matching Unicode and other characters

    Code Description
    uxxxx
    Unicode (four places)
    xxx Unicode (two places)
    x{xxxx} Unicode (four places)
    x{xx} Unicode (two places)
    00 Octal (base 8)
    cx Control character
    Null
    a Bell
    e Escape
    [] Backspace


    chapter 7 Quantifiers


    Greedy, Lazy, and Possessive
    Quantifiers are, by themselves, greedy. A greedy quantifier first tries to match the whole
    string. It grabs as much as it can, the whole input, trying to make a match. If the first
    attempt to match the whole string goes awry, it backs up one character and tries again.
    This is called backtracking. It keeps backing up one character at a time until it finds a
    match or runs out of characters to try. It also keeps track of what it is doing, so it puts
    the most load on resources compared with the next two approaches. It takes a mouthful,
    then spits back a little at a time, chewing on what it just ate. You get the idea.

    A lazy (sometimes called reluctant) quantifier takes a different tack. It starts at the
    beginning of the target, trying to find a match. It looks at the string one character at a
    time, trying to find what it is looking for. At last, it will attempt to match the whole
    string. To get a quantifier to be lazy, you have to append a question mark (?) to the
    regular quantifier. It chews one nibble at a time.

    A possessive quantifier grabs the whole target and then tries to find a match, but it
    makes only one attempt. It does not do any backtracking. A possessive quantifier appends
    a plus sign (+) to the regular quantifier. It doesn’t chew; it just swallows, then
    wonders what it just ate. I’ll demonstrate each of these in the pages that follow.

    ?、+、*、{} 默认下都是greedy(贪心)匹配。

    Lazy Quantifiers:在普通的修饰符(?、+、*、{})后加一个‘?’就成了Lazy quantifier。
    Table 7-3. Lazy quantifiers
    ?? :Lazy zero or one (optional)
    +? :Lazy one or more
    *? :Lazy zero or more
    {n}? :Lazy n
    {n,}? :Lazy n or more
    {m,n}?: Lazy m,n

    Possessive Quantifiers:在普通的修饰符(?、+、*、{})后加一个‘+’就成了 Possessive  q uantifier。
    Table 7-4. Possessive quantifiers
    Syntax Description
    ?+:
    Possessive zero or one (optional)
    ++ :Possessive one or more
    *+ :Possessive zero or more
    {n}+: Possessive n
    {n,}+ :Possessive n or more
    {m,n}+: Possessive m,n


    chapter 8 Lookarounds

    Lookarounds are non-capturing groups that match patterns based on what they find
    either in front of or behind a pattern. Lookarounds are also considered zero-width
    assertions.
    Lookarounds include:
    • Positive lookaheads
    • Negative lookaheads
    • Positive lookbehinds
    • Negative lookbehinds

    pear、ack支持,grep不支持。

    Positive Lookaheads
    (?i)ancyent( ?= marinere):忽略大小写匹配ancyent且该字符串后面的字符要为 marinere。

    Negative Lookaheads
    (?i)ancyent ( ?!marinere):忽略大小写匹配ancyent且该字符串后面的字符必须不为 marinere。

    Positive Lookbehinds
    (?i)( ?<=ancyent) marinere

    Negative Lookbehinds
    (?i)( ?<!ancyent) marinere

  • 相关阅读:
    React中路由的基本使用
    React中props
    一款超级炫酷的编辑代码的插件 Power Mode
    React中使用styled-components的基础使用
    对ES6的一次小梳理
    动态规划法(七)鸡蛋掉落问题(二)
    动态规划法(六)鸡蛋掉落问题(一)(egg dropping problem)
    三对角线性方程组(tridiagonal systems of equations)的求解
    Sherman-Morrison公式及其应用
    动态规划法(四)0-1背包问题(0-1 Knapsack Problem)
  • 原文地址:https://www.cnblogs.com/riskyer/p/3246907.html
Copyright © 2020-2023  润新知