python re清理html
代码 1def formatHtml(input): 2 regular = re.compile('<\bp\b[^>]*>',re.IGNORECASE) 3 input = regular.sub('<p>',input) 4 regular = re.compile('</?SPAN[^>]*>',re.IGNORECASE) 5 input = regular.sub('',input) 6 regular = re.compile('</?o:p>',re.IGNORECASE) 7 input = regular.sub('',input) 8 regular = re.compile('</?FONT[^>]*>',re.IGNORECASE) 9 input = regular.sub('',input) 10 regular = re.compile('</?\bB\b[^>]*>',re.IGNORECASE) 11 input = regular.sub('',input) 12 regular = re.compile('<?[^>]*>',re.IGNORECASE) 13 input = regular.sub('',input) 14 regular = re.compile('</?st1:[^>]*>',re.IGNORECASE) 15 input = regular.sub('',input) 16 regular = re.compile('</?\bchsdate\b[^>]*>',re.IGNORECASE) 17 input = regular.sub('',input) 18 regular = re.compile('<\bbr\b[^>]*>',re.IGNORECASE) 19 input = regular.sub('<br>',input) 20 regular = re.compile('</?\bchmetcnv\b[^>]*>',re.IGNORECASE) 21 input = regular.sub('',input) 22 regular = re.compile('<script[^>]*?>.*?</script>',re.IGNORECASE+re.DOTALL) 23 input = regular.sub('',input) 24 return input 是用re注意: 1、def sub(pattern, repl, string, count=0, flags=0): 第三个参数是count很容易误用成flags. 2、re.……