最近写了一些代码,用R处理文本格式。就写一些我最近用的一些处理字符串的函数
- grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
value=FALSE表示的是grep()返回的是所匹配字符串的位置 value=TRUE返回是所要找的字符串。perl=TRUE代表可以用适合perl的正则表达式
fixed=TRUE表示pattern is a string to be matched as is,use exact matching.
> grep("[aeiou]",c("apple","banana","peak"))
[1] 1 2 3
> grep("[aeio]",c("apple","banana","peak","unit","bbddff"),value = TRUE)
[1] "apple" "banana" "peak" "unit"
用正则表达式时,如果搜索的结果是字符如*. ,则需要用\来表示 如"\*"来表示*
2,字符串的替代函数
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)只替代第一个
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE) 能匹配的都被替代
3,字符数的统计和翻译
nchar(x, type = c("bytes", "chars", "width"), allowNA = FALSE)
例如:
nchar(c("apple","banana","peak","unit","NA"),type = "width",allowNA = TRUE)
[1] 5 6 4 4 2
另外三个函数用法也很简单:
- > DNA <- "AtGCtttACC"
- > tolower(DNA)
- [1] "atgctttacc"
- > toupper(DNA)
- [1] "ATGCTTTACC"
- > chartr("Tt", "Uu", DNA)
- [1] "AuGCuuuACC"
- > chartr("Tt", "UU", DNA)
- [1] "AUGCUUUACC"
4,字符串的连接
paste (x,y, sep = " ", collapse = NULL) sep参数是设置连接每个x[1]和y[1]之间的方式,collapse参数设置的是x[1]y[1]和x[2]y[2]之间连接的方式
> paste("A", 1:6,collapse = ".")
[1] "A 1.A 2.A 3.A 4.A 5.A 6"
> paste("A", 1:6, sep = " ")
[1] "A 1" "A 2" "A 3" "A 4" "A 5" "A 6"
5,字符串的拆分
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)