• pandas 之 字符串处理


    import numpy as np 
    import pandas as pd
    

    Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing.(Python非常流行的一个原因在于它对字符串处理提供了非常灵活的操作方式). Most text operations are made simple with string object's built-in methods. For more complex pattern matching and text manipulations, reqular expressions may be needed(对于非常复杂的字符串操作,正则还是非常必要的). pandas adds to the mix by enabling you to apply string and reqular expressions concisely(简明地) on whole arrays of data, additionally handling the annonyance(烦恼) of missing data.

    字符串对象常用方法

    In many string munging and scriptiong applications, built-in methods are sufficient(内置的方法就已够用). As a example, a comma-separated string can be broken into pieces with split:

    val = 'a,b,    guido'
    
    val.split(',')
    
    ['a', 'b', '    guido']
    

    split is offen combined with strip to trim whitesplace(including line breaks): (split 通常和strip配合使用哦)

    pieces = [x.strip() for x in val.split(',')]
    
    pieces
    
    ['a', 'b', 'guido']
    

    These subtrings could be concatenated together with a two-colon delimiter using additon:

    first, second, thrid = pieces  # 拆包
    
    first + "::" + second + "::" + thrid
    
    'a::b::guido'
    

    But this isn't a practical(实际有效) generic mathod. A faster and more Pythonic way is to pass a list or tuple to the join method on the string "::".

    '::'.join(pieces)
    
    'a::b::guido'
    

    Other methods are concerned with locating substrings. Using Python's in keyword is the best way to detect a substring, though index and find can also be used:

    "guido" in val
    
    True
    
    val.index(',')  # 下标索引位置
    
    1
    
    val.find(":") # 返回第一次出现的下标, 没有则返回 -1
    
    -1
    

    Note the difference between find and index is that index raises an exception if the string isn't found (versus 相对于index的报错, find 返回 -1, 健壮性好)

    val.index(':')
    
    ---------------------------------------------------------------------------
    
    ValueError                                Traceback (most recent call last)
    
    <ipython-input-37-2c016e7367ac> in <module>
    ----> 1 val.index(':')
    
    
    ValueError: substring not found
    
    val.find(":")
    

    Relatedly, count returns the number of occurrences of a particular substring:

    val.count(',')
    

    replace will substitute(替换) occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

    val
    
    val.replace(',', ':')  # 是深拷贝, 创建新对象了哦
    
    'a:b:    guido'
    
    val  # 原来的没变哦
    
    'a,b,    guido'
    
    val.replace(',', '') # 替换为空
    
    'ab    guido'
    

    See Table 7-3 for a listing of some of Python's string methods.

    Regular expressions can also be used with many of these operations, as you'll see.

    Argument Description
    count 计数某元素出现的次数
    endswith Return True if string ends with suffix
    startswith 判断是否以某元素结尾
    join 字符串拼接
    index 返回某元素第一次出现的下标, 没有则报错
    find 返回某元素第一次出现的下标,没有则返回-1
    rfind 从右边往左开始寻找
    replace 替换某元素
    strip 清除两侧空白符
    rstrip for each element
    lstrip
    split 分割
    lower 小写
    upper 大写
    casefold 将字符转换为小写,并将任何特定于区域的变量字符组合转换为常见形式
    ljust 调整字符距离
    rjust

    正则表达式

    Regular expression provide a flexible way to search or match(often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed(形成的) according to the regular expression language. Python's built-in re module is responsible for applying regular expressions to strings; I'll give a number of examples of its use here.

    The art of writing regular expressions could be a chapter of its own and thus is outside the book's scope. There are many excellent tutorials and references available on the internet and in other books.

    The re module functions fall into three categories:pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let's look at a simple example:
    Suppose we want to split a string with a variable number of whitespace characters(tabs, spaces, and newlines). The regex describing one or more whitespace characters is "s+":

    import re 
    
    text = "foo    bar	  baz   	qux"
    
    re.split("s+", text) # 按空白符分割
    
    ['foo', 'bar', 'baz', 'qux']
    

    When you call re.split('s+', text), the regular expression is first compiled, and then its split method method is called on the passed text. You can complie the regex yourself with re.compile forming a reusable regex object:

    regex = re.compile('s+')  # cj 编译模式在代码复用时挺好
    
    regex.split(text)
    
    ['foo', 'bar', 'baz', 'qux']
    

    If, instead(替换), you want to get a list of all patterns matching the regex, you can use the findall method:

    regex.findall(text)  # cj,匹配所有满足要求的, 并返回列表
    
    ['    ', '	  ', '   	']
    

    To avoid unwanted escaping with in a regular expression, use raw string literals(原生字面符) like r'C:x' instead of the equivalent 'C:x'

    Creating a regex object with re.complie is highly recommended if you intent to apply the same expression to many strings; doing so will save CPU cycles(周期)
    (提高代码复用, 节省CPU空间)

    match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly(严格地), match only matches at the beginning of the string. As a less trivial(不重要地)example, let's consider a block of text and a regular expression capable(能干的) of identifying most email addresses:

    text = """Dave dave@google.com
    Steve steve@gmail.com
    Rob rob@gmail.com
    Ryan ryan@yahoo.com
    """
    
    "匹配出所有邮箱"
    pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
    
    # re.IGNORECASE makes the regex case-insensitive
    regex = re.compile(pattern, flags=re.IGNORECASE)
    
    '匹配出所有邮箱'
    

    Using findall on the text produces a list of the email addresses:

    regex.findall(text)
    
    ['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
    

    search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

    m = regex.search(text) # 只返回第一个匹配到的结果
    m  # 是一个Match对象
    
    
    <_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
    
    text[m.start():m.end()]
    
    'dave@google.com'
    

    regex.match returns None, as it only will mathch if the pattern occurs at the start of the string:

    # 第一个参数必须是正则表达式, 没有匹配则None
    print(regex.match(text)) 
    
    None
    

    Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string.

    # 参数: pattern, replace_value, text, count
    print(regex.sub('REDACTED', text)) 
    
    Dave REDACTED
    Steve REDACTED
    Rob REDACTED
    Ryan REDACTED
    

    Suppose you wanted to find email addresses and simultaneously(同时地) segment each address into its three components(部分): username, domain name, and domain suffix. To do this, put parentheses around the parts of pattern to segment:

    pattern = r'([a-z0-9+_.%-]+)@([a-z0-9+-._]+).([a-z0-9]{2,4})'  # () 用来分组
    
    regex = re.compile(pattern, flags=re.IGNORECASE) 
    

    A match object produced by this modified regex return a tuple of the pattern components with its groups method:

    m = regex.match("wesm@bring.net")
    
    m.groups()
    
    ('wesm', 'bring', 'net')
    

    findall returns a list of tuples when the pattern has groups:

    regex.findall(text)  # 数据清洗非常有用啊,正则
    
    [('dave', 'google', 'com'),
     ('steve', 'gmail', 'com'),
     ('rob', 'gmail', 'com'),
     ('ryan', 'yahoo', 'com')]
    

    sub also has access to groups in each match using special symbols like 1 and 2. The symbol 1 correspons to the first matched group, 2 corresponds to the second, and so forth:

    "感觉真的是数据清洗的利器"
    
    print(regex.sub(r'Username: 1, Domain: 2, Suffix: 3', text))
    
    
    '感觉真的是数据清洗的利器'
    
    
    
    Dave Username: dave, Domain: google, Suffix: com
    Steve Username: steve, Domain: gmail, Suffix: com
    Rob Username: rob, Domain: gmail, Suffix: com
    Ryan Username: ryan, Domain: yahoo, Suffix: com
    

    There is much more to regular expression in Python, most of which is outside the book's scope, Table 7-4 provides a brief summary.

    Argument Description
    findall 匹配所有满足条件的元素, 返回是个列表
    finditer Like findall, but returns an iterator
    match 从头开始严格匹配, 一旦匹配到则返回match对象, 否则None
    search 所有满足条件的元素从任意位置, 匹配放回match对象, 否则None
    split 按正则表达式分割
    sub, subn 替换匹配字串,返回新字串, 1, 2..分组显示等

    批量字符串处理

    Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

    data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
     'Rob': 'rob@gmail.com', 'Wes': np.nan}
    
    data = pd.Series(data)
    
    data
    
    Dave     dave@google.com
    Steve    steve@gmail.com
    Rob        rob@gmail.com
    Wes                  NaN
    dtype: object
    
    data.isnull()
    
    Dave     False
    Steve    False
    Rob      False
    Wes       True
    dtype: bool
    

    You can apply string and regular expression methods can be applied(passing a lambda or other function) to each value using data.map, but it will fail on the NA values(apply能传一个方法去处理去映射每个元素, 但缺失值就麻爪了). To cope with(处理)this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series's str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains

    data.str.contains("gmail")  # like 'in'
    
    Dave     False
    Steve     True
    Rob       True
    Wes        NaN
    dtype: object
    

    Regular expressions can be used, too, along with any re option like IGNORECASE:

    pattern
    
    '([a-z0-9+_.%-]+)@([a-z0-9+-._]+)\.([a-z0-9]{2,4})'
    
    data.str.findall(pattern, flags=re.IGNORECASE)  # 映射每个元素
    
    Dave     [(dave, google, com)]
    Steve    [(steve, gmail, com)]
    Rob        [(rob, gmail, com)]
    Wes                        NaN
    dtype: object
    

    There are a couple of(一对) ways to do vectorized element retrieval. Either use str.get or index into the str attribute:

    matches = data.str.match(pattern, flags=re.IGNORECASE)
    
    matches
    
    Dave     True
    Steve    True
    Rob      True
    Wes       NaN
    dtype: object
    

    To access elements in the embedded lists(列表嵌套), we can pass an index to either of these functions:

    matches.str.get(1)
    
    Dave    NaN
    Steve   NaN
    Rob     NaN
    Wes     NaN
    dtype: float64
    
    matches.str[0]
    
    Dave    NaN
    Steve   NaN
    Rob     NaN
    Wes     NaN
    dtype: float64
    

    You can similarly slice strings using this syntax:

    data.str[:5]
    
    Dave     dave@
    Steve    steve
    Rob      rob@g
    Wes        NaN
    dtype: object
    

    See Table 7-5 for more pandas string methods

    • cat
    • contains
    • count
    • extract 用正则表达式提取
    • endswith
    • startswith
    • findall
    • get index into each element
    • isalnum 判断是否为字母or数字
    • islaph
    • isdecimal
    • isdigit
    • islower
    • isupper
    • isnumeric
    • join
    • len
    • lower/ upper
    • match
    • pad Add whitespace to left, right or both sides of strings
    • repeat
    • replace
    • slice
    • split
    • strip
    • rstrip
    • lstrip

    小结

    Effective data preparation can significantly improve productive by enabling you to spend more time analyzing data and less time getting it ready for analyingsis.
    (能高效便捷进行数据清洗和预处理能让我们有更多的时间去分析问题而非一直在处理数据)
    We have explored a number of tools in this chapter, but the coverage here is by no means comprehensive. In the next chapter, we will explore pandas's joining and grouping functionality.

  • 相关阅读:
    网络编程[28]
    网络编程[30]
    网络编程[20]
    网络编程[29]
    网络编程[19]
    网络编程[15]
    网络编程[12]
    hdu 3802【Ipad,IPhone】
    hdu 2616【Kill the monster】
    hdu 1026【Ignatius and the Princess I】
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/11920822.html
Copyright © 2020-2023  润新知