R正则表达式语法细节

正则表达其实就是对文本举办模式匹配，所有语言中的正则表达式都有一些配合的特征。我们利用help(regex)呼吁查察R正则表达的辅佐内容。下面我们对元字符(metacharacters)、数量词(quantitifiers)、序列(sequences)、字符类(character class)和POSIX字符种别离举办说明。

1.Metacharacters

最简朴的正则表达式是匹配一个简朴的字符，如字母、数字和标点等。对付标点等非凡字符，凡是被称为“metacharacter”，在匹配这些元字符时，R语言内里需要利用’\’。主要的元字符有：. $ * + ? | \ ^ [ ] { } ( )等。

# 带元字符的单词
money = "$money"
# 错误的匹配方法
sub(pattern = "$", replacement = "", x = money)

## [1] "$money"

# 正确的匹配方法
sub(pattern = "\\$", replacement = "", x = money)

## [1] "money"


# 雷同的案例
sub("\\$", "", "$Peace-Love")

## [1] "Peace-Love"

sub("\\.", "", "Peace.Love")

## [1] "PeaceLove"

sub("\\+", "", "Peace+Love")

## [1] "PeaceLove"

sub("\\^", "", "Peace^Love")

## [1] "PeaceLove"

sub("\\|", "", "Peace|Love")

## [1] "PeaceLove"

sub("\\(", "", "Peace(Love)")

## [1] "PeaceLove)"

sub("\\)", "", "Peace(Love)")

## [1] "Peace(Love"

sub("\\[", "", "Peace[Love]")

## [1] "PeaceLove]"

sub("\\[", "", "Peace[Love]")

## [1] "PeaceLove]"

sub("\\{", "", "Peace{Love}")

## [1] "PeaceLove}"

sub("\\}", "", "Peace{Love}")

## [1] "Peace{Love"

sub("\\\\", "", "Peace\\Love")

## [1] "PeaceLove"

2.Sequences

Sequences用于匹配字符序列，主要的序列有：

\d 匹配数字字符

\D 匹配非数字字符

\s 匹配隔断符

\S 匹配非隔断符

\w 匹配单词字符

\W 匹配非单词字符

\b 匹配词界

\B 匹配非词界

\h 匹配程度隔断

\H 匹配非程度隔断

\v 匹配垂直隔断

\V 匹配非垂直隔断

2.1数字和非数字

# 用'_'替换数字
sub("\\d", "_", "the dandelion war 2010")

## [1] "the dandelion war _010"

gsub("\\d", "_", "the dandelion war 2010")

## [1] "the dandelion war ____"

# 用'_'替换非数字型字符
sub("\\D", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\D", "_", "the dandelion war 2010")

## [1] "__________________2010"

2.2空格与非空格

# 用'_'替空格
sub("\\s", "_", "the dandelion war 2010")

## [1] "the_dandelion war 2010"

gsub("\\s", "_", "the dandelion war 2010")

## [1] "the_dandelion_war_2010"

# 用'_'替非空格字符
sub("\\S", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\S", "_", "the dandelion war 2010")

## [1] "___ _________ ___ ____"

2.3单词与非单词

# 用'_'替单词
sub("\\w", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\w", "_", "the dandelion war 2010")

## [1] "___ _________ ___ ____"

# 用'_'替非单词
sub("\\W", "_", "the dandelion war 2010")

## [1] "the_dandelion war 2010"

gsub("\\W", "_", "the dandelion war 2010")

## [1] "the_dandelion_war_2010"

2.4词界与非词界

# 用'_'替词界
sub("\\b", "_", "the dandelion war 2010")

## [1] "_the dandelion war 2010"

gsub("\\b", "_", "the dandelion war 2010")

## [1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"

# 用'_'替非词界
sub("\\B", "_", "the dandelion war 2010")

## [1] "t_he dandelion war 2010"

gsub("\\B", "_", "the dandelion war 2010")

## [1] "t_he d_an_de_li_on w_ar 2_01_0"

3.Character Class

字符类或字符集是用“[ ]”括起来的字符集，只要匹配字符会合的任意类。譬喻[aA]暗示匹配任意小写a或大写字母A,[0123456789]暗示匹配任意单个数字，这里要区别字符类与字符的区别。常见的一些字符类有：

[aeiou] 匹配任意元音字母

[AEIOU] 匹配任何一个大写元音

[0123456789] 匹配任意单个数字

[0-9] 匹配任意数字(同上)

[a-z] 匹配任何ASCII小写字母

[A-Z] 匹配任何ASCII大写字母

[a-zA-Z0-9] 匹配任意上面的类

[^aeiou] 匹配除小写元音外的字母

^{[^0-9]} 匹配除数字外的字符

transport = c("car", "bike", "plane", "boat")
# 匹配'e'和'i'
grep(pattern = "[ei]", transport, value = TRUE)

## [1] "bike"  "plane"

numerics = c("123", "17-April", "I-II-III", "R 3.0.1")
# 匹配含'0'或'1'的字符
grep(pattern = "[01]", numerics, value = TRUE)

## [1] "123"      "17-April" "R 3.0.1"

# 匹配含任意数字的字符
grep(pattern = "[0-9]", numerics, value = TRUE)

## [1] "123"      "17-April" "R 3.0.1"

# 匹配不含数字的字符
grep(pattern = "[^0-9]", numerics, value = TRUE)

## [1] "17-April" "I-II-III" "R 3.0.1"

4.POSIX Character Classes

POSIX字符类是用”[[ ]]“括起来的正则表达，常见的POSIX字符类有：

[[:lower:]] 小写字母

[[:upper:]] 大写字母

[[:alpha:]] 所以字母 ([[:lower:]] and [[:upper:]])

[[:digit:]] 数字: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]] 字母和数字 ([[:alpha:]] and [[:digit:]])

[[:blank:]] 空缺字符: space and tab

[[:cntrl:]] 节制字符

[[:punct:]] 标点标记: ! ” # % & ‘ ( ) * + , – . / : ;

[[:space:]] 空格字符:制表符,换行符, 垂直制表符,换页符,回车和空格

[[:xdigit:]] 十六进制数字: 0-9 A B C D E F a b c d e f

[[:print:]]节制字符 ([[:alpha:]], [[:punct:]] and space)

[[:graph:]] 图形化字符 ([[:alpha:]] and [[:punct:]])

# la vie (string)
la_vie = "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"
print(la_vie)

## [1] "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"

cat(la_vie)

## La vie en #FFC0CB (rose);
## Ces't la vie!    tres jolie


# 删除空格字符
gsub(pattern = "[[:blank:]]", replacement = "", la_vie)

## [1] "Lavieen#FFC0CB(rose);\nCes'tlavie!tresjolie"


# 删除标点
gsub(pattern = "[[:punct:]]", replacement = "", la_vie)

## [1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"


# 删除数字
gsub(pattern = "[[:xdigit:]]", replacement = "", la_vie)

## [1] "L vi n # (ros);\ns't l vi! \ttrs joli"


# 删除节制字符
gsub(pattern = "[[:print:]]", replacement = "", la_vie)

## [1] "\n"


# 删除非节制符
gsub(pattern = "[^[:print:]]", replacement = "", la_vie)

## [1] "La vie en #FFC0CB (rose);Ces't la vie! \ttres jolie"


# 删除图形化字符
gsub(pattern = "[[:graph:]]", replacement = "", la_vie)

## [1] "    \n   \t "


# 删除非图形化字符
gsub(pattern = "[^[:graph:]]", replacement = "", la_vie)

## [1] "Lavieen#FFC0CB(rose);Ces'tlavie!tresjolie"

12 / 2 页下一页

当前位置：以往代写 > 其他教程 >R正则表达式语法细节