python re清理html

代码

 1def formatHtml(input):
 2    regular = re.compile('<\bp\b[^>]*>',re.IGNORECASE)
 3    input = regular.sub('<p>',input)
 4    regular = re.compile('</?SPAN[^>]*>',re.IGNORECASE)
 5    input = regular.sub('',input)
 6    regular = re.compile('</?o:p>',re.IGNORECASE)
 7    input = regular.sub('',input)
 8    regular = re.compile('</?FONT[^>]*>',re.IGNORECASE)
 9    input = regular.sub('',input)
10    regular = re.compile('</?\bB\b[^>]*>',re.IGNORECASE)
11    input = regular.sub('',input)
12    regular = re.compile('<?[^>]*>',re.IGNORECASE)
13    input = regular.sub('',input)
14    regular = re.compile('</?st1:[^>]*>',re.IGNORECASE)
15    input = regular.sub('',input)
16    regular = re.compile('</?\bchsdate\b[^>]*>',re.IGNORECASE)
17    input = regular.sub('',input)
18    regular = re.compile('<\bbr\b[^>]*>',re.IGNORECASE)
19    input = regular.sub('<br>',input)
20    regular = re.compile('</?\bchmetcnv\b[^>]*>',re.IGNORECASE)
21    input = regular.sub('',input)
22    regular = re.compile('<script[^>]*?>.*?</script>',re.IGNORECASE+re.DOTALL)
23    input = regular.sub('',input)
24    return input

是用re注意：

1、def sub(pattern, repl, string, count=0, flags=0):
第三个参数是count很容易误用成flags.
2、re.sub(‘<8888(g)>’,s) 其中g表示捕获的分组字符，0表示匹配的整个字符串，1表示第一个分组
3、(]*>)(.*?)()非贪婪
(]*>)(.*)()贪婪

常用正则表达式中特殊字符

^匹配字符串的开始。
$匹配字符串的结尾。
b匹配一个单词的边界。
d匹配任意数字。
D匹配任意非数字字符。
x?匹配一个可选的x字符（换句话说，它匹配1次或者0次x字符）。
x*匹配0次或者多次x字符。
x+匹配1次或者多次x字符。
x{n,m}匹配x字符，至少n次，至多m次。
(a|b|c)要么匹配a，要么匹配b，要么匹配c。
(x)一般情况下表示一个记忆组(remembered group). 你可以利用re.search函数返回对象的groups()函数获取它的值。
[^>]表示不匹配>字符以外的字符

如您感觉文章有用，可扫码捐赠本站！(If the article useful, you can scan the QR code to donate))

捐赠本站(Donate)

See Also

Latest articles

Categories

Tags

Links

Meta