Thursday, March 22, 2012

Regular Expression for Tag Processing

Hi,

Last time I mentioned that regular expressions might be useful for breaking a tag into elements.  My regular expression is:
([^\s\"=]+\s*=\s*([^\s\"=]+|\"[^\s^\"]+\"))|([^\s\"=]+)
[Edit: Updating for single quotes and other improvements: ([^\s\"\'=]+\s*=\s*([^\s\"\'=]+|[\"\'][^\s\"\']+[\"\']))|([^\s\"\'=]+)]

Notice that it is awesome.  I tested it on the input data:
div class=hijkjh width=100 height = 6 src=   "hello.png" why=bec6=ause src="hello.png" src="hello.png?=yo?" src="hello.png" src="hello.png" src="hello.png" sup

. . . which is deliberately malformed in some cases.  You can test the regex at http://regexpal.com/.  Notice that the regex matches valid elements.  All I have to do to determine if a tag is well formed is check whether anything other than whitespace is unmatched.

I have merged these changes into the converter itself, so the new tag system should work much better.

Ian

No comments:

Post a Comment