Thursday, March 22, 2012

Revised Parser

Hi,

I revised the parser.  It can parse much better.  I fixed a few problems, cleaned up the code some, and in the process actually was able to make the necessary algorithms shorter.

I ended up not using regular expressions, because they were basically overkill.  All that was really necessary was to split the input stream into tags and content, instead of merely tags.  I kept most of the rest essentially the same (although I refactored the main algorithm into separate functions for cleanliness).

So, progress was made.  However, a new issue has arisen.  The parser can't properly parse something like <img src="imgs/srv?=cheese.png"> because of the equals in the quoted string.  In practice, the parser will spit out a warning that the HTML is malformed, and just output the tag as-is.  So, the parser is not fundamentally broken; it's just that it is properly formed and the parser just doesn't understand it.

When I was creating the parser, it never occurred to me that such situations could arise, so I deliberately attempt to have the parser look for equals.  In this case, the parser finds two equals inside what it considers one element: (e.g., for <div class="hello=freedom">, it thinks it has gotten a <div> tag with a single element of the form A=B=C, which is malformed).  The quotes don't matter.

To fix this, I may use regular expressions.  Though, as trends are going now, I'll probably think I will and then end up not.

Finally, from here-on out, the parser is now being called a converter.  The converter comprises the renderer, which converts GWT into HTML, a parser, which breaks that HTML up into lexical chunks, a formatter, which cleans up the HTML and formats everything, and a synthesizer, which tidies up and combines everything.

Ian

No comments:

Post a Comment