R正则表达式语法细节
R正则表达式语法细节
正则表达其实就是对文本举办模式匹配,所有语言中的正则表达式都有一些配合的特征。我们利用help(regex)呼吁查察R正则表达的辅佐内容。下面我们对元字符(metacharacters)、数量词(quantitifiers)、序列(sequences)、字符类(character class)和POSIX字符种别离举办说明。
1.Metacharacters
最简朴的正则表达式是匹配一个简朴的字符,如字母、数字和标点等。对付标点等非凡字符,凡是被称为“metacharacter”,在匹配这些元字符时,R语言内里需要利用’\’。主要的元字符有:. $ * + ? | \ ^ [ ] { } ( )等。
# 带元字符的单词
money = "$money"
# 错误的匹配方法
sub(pattern = "$", replacement = "", x = money)
## [1] "$money"
# 正确的匹配方法
sub(pattern = "\\$", replacement = "", x = money)
## [1] "money"
# 雷同的案例
sub("\\$", "", "$Peace-Love")
## [1] "Peace-Love"
sub("\\.", "", "Peace.Love")
## [1] "PeaceLove"
sub("\\+", "", "Peace+Love")
## [1] "PeaceLove"
sub("\\^", "", "Peace^Love")
## [1] "PeaceLove"
sub("\\|", "", "Peace|Love")
## [1] "PeaceLove"
sub("\\(", "", "Peace(Love)")
## [1] "PeaceLove)"
sub("\\)", "", "Peace(Love)")
## [1] "Peace(Love"
sub("\\[", "", "Peace[Love]")
## [1] "PeaceLove]"
sub("\\[", "", "Peace[Love]")
## [1] "PeaceLove]"
sub("\\{", "", "Peace{Love}")
## [1] "PeaceLove}"
sub("\\}", "", "Peace{Love}")
## [1] "Peace{Love"
sub("\\\\", "", "Peace\\Love")
## [1] "PeaceLove"
2.Sequences
Sequences用于匹配字符序列,主要的序列有:
\d 匹配数字字符
\D 匹配非数字字符
\s 匹配隔断符
\S 匹配非隔断符
\w 匹配单词字符
\W 匹配非单词字符
\b 匹配词界
\B 匹配非词界
\h 匹配程度隔断
\H 匹配非程度隔断
\v 匹配垂直隔断
\V 匹配非垂直隔断
2.1数字和非数字
# 用'_'替换数字
sub("\\d", "_", "the dandelion war 2010")
## [1] "the dandelion war _010"
gsub("\\d", "_", "the dandelion war 2010")
## [1] "the dandelion war ____"
# 用'_'替换非数字型字符
sub("\\D", "_", "the dandelion war 2010")
## [1] "_he dandelion war 2010"
gsub("\\D", "_", "the dandelion war 2010")
## [1] "__________________2010"
2.2空格与非空格
# 用'_'替空格
sub("\\s", "_", "the dandelion war 2010")
## [1] "the_dandelion war 2010"
gsub("\\s", "_", "the dandelion war 2010")
## [1] "the_dandelion_war_2010"
# 用'_'替非空格字符
sub("\\S", "_", "the dandelion war 2010")
## [1] "_he dandelion war 2010"
gsub("\\S", "_", "the dandelion war 2010")
## [1] "___ _________ ___ ____"
2.3单词与非单词
# 用'_'替单词
sub("\\w", "_", "the dandelion war 2010")
## [1] "_he dandelion war 2010"
gsub("\\w", "_", "the dandelion war 2010")
## [1] "___ _________ ___ ____"
# 用'_'替非单词
sub("\\W", "_", "the dandelion war 2010")
## [1] "the_dandelion war 2010"
gsub("\\W", "_", "the dandelion war 2010")
## [1] "the_dandelion_war_2010"
2.4词界与非词界
# 用'_'替词界
sub("\\b", "_", "the dandelion war 2010")
## [1] "_the dandelion war 2010"
gsub("\\b", "_", "the dandelion war 2010")
## [1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"
# 用'_'替非词界
sub("\\B", "_", "the dandelion war 2010")
## [1] "t_he dandelion war 2010"
gsub("\\B", "_", "the dandelion war 2010")
## [1] "t_he d_an_de_li_on w_ar 2_01_0"
3.Character Class
字符类或字符集是用“[ ]”括起来的字符集,只要匹配字符会合的任意类。譬喻[aA]暗示匹配任意小写a或大写字母A,[0123456789]暗示匹配任意单个数字,这里要区别字符类与字符的区别。常见的一些字符类有:
[aeiou] 匹配任意元音字母
[AEIOU] 匹配任何一个大写元音
[0123456789] 匹配任意单个数字
[0-9] 匹配任意数字(同上)
[a-z] 匹配任何ASCII小写字母
[A-Z] 匹配任何ASCII大写字母
[a-zA-Z0-9] 匹配任意上面的类
[^aeiou] 匹配除小写元音外的字母
[^0-9] 匹配除数字外的字符
transport = c("car", "bike", "plane", "boat")
# 匹配'e'和'i'
grep(pattern = "[ei]", transport, value = TRUE)
## [1] "bike" "plane"
numerics = c("123", "17-April", "I-II-III", "R 3.0.1")
# 匹配含'0'或'1'的字符
grep(pattern = "[01]", numerics, value = TRUE)
## [1] "123" "17-April" "R 3.0.1"
# 匹配含任意数字的字符
grep(pattern = "[0-9]", numerics, value = TRUE)
## [1] "123" "17-April" "R 3.0.1"
# 匹配不含数字的字符
grep(pattern = "[^0-9]", numerics, value = TRUE)
## [1] "17-April" "I-II-III" "R 3.0.1"
4.POSIX Character Classes
POSIX字符类是用”[[ ]]“括起来的正则表达,常见的POSIX字符类有:
[[:lower:]] 小写字母
[[:upper:]] 大写字母
[[:alpha:]] 所以字母 ([[:lower:]] and [[:upper:]])
[[:digit:]] 数字: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
[[:alnum:]] 字母和数字 ([[:alpha:]] and [[:digit:]])
[[:blank:]] 空缺字符: space and tab
[[:cntrl:]] 节制字符
[[:punct:]] 标点标记: ! ” # % & ‘ ( ) * + , – . / : ;
[[:space:]] 空格字符:制表符,换行符, 垂直制表符,换页符,回车和空格
[[:xdigit:]] 十六进制数字: 0-9 A B C D E F a b c d e f
[[:print:]]节制字符 ([[:alpha:]], [[:punct:]] and space)
[[:graph:]] 图形化字符 ([[:alpha:]] and [[:punct:]])
# la vie (string)
la_vie = "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"
print(la_vie)
## [1] "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"
cat(la_vie)
## La vie en #FFC0CB (rose);
## Ces't la vie! tres jolie
# 删除空格字符
gsub(pattern = "[[:blank:]]", replacement = "", la_vie)
## [1] "Lavieen#FFC0CB(rose);\nCes'tlavie!tresjolie"
# 删除标点
gsub(pattern = "[[:punct:]]", replacement = "", la_vie)
## [1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"
# 删除数字
gsub(pattern = "[[:xdigit:]]", replacement = "", la_vie)
## [1] "L vi n # (ros);\ns't l vi! \ttrs joli"
# 删除节制字符
gsub(pattern = "[[:print:]]", replacement = "", la_vie)
## [1] "\n"
# 删除非节制符
gsub(pattern = "[^[:print:]]", replacement = "", la_vie)
## [1] "La vie en #FFC0CB (rose);Ces't la vie! \ttres jolie"
# 删除图形化字符
gsub(pattern = "[[:graph:]]", replacement = "", la_vie)
## [1] " \n \t "
# 删除非图形化字符
gsub(pattern = "[^[:graph:]]", replacement = "", la_vie)
## [1] "Lavieen#FFC0CB(rose);Ces'tlavie!tresjolie"
<
12下一页