Hi,
I revised the parser. It can parse much better. I fixed a few problems, cleaned up the code some, and in the process actually was able to make the necessary algorithms shorter.
I ended up not using regular expressions, because they were basically overkill. All that was really necessary was to split the input stream into tags and content, instead of merely tags. I kept most of the rest essentially the same (although I refactored the main algorithm into separate functions for cleanliness).
So, progress was made. However, a new issue has arisen. The parser can't properly parse something like <img src="imgs/srv?=cheese.png"> because of the equals in the quoted string. In practice, the parser will spit out a warning that the HTML is malformed, and just output the tag as-is. So, the parser is not fundamentally broken; it's just that it is properly formed and the parser just doesn't understand it.
When I was creating the parser, it never occurred to me that such situations could arise, so I deliberately attempt to have the parser look for equals. In this case, the parser finds two equals inside what it considers one element: (e.g., for <div class="hello=freedom">, it thinks it has gotten a <div> tag with a single element of the form A=B=C, which is malformed). The quotes don't matter.
To fix this, I may use regular expressions. Though, as trends are going now, I'll probably think I will and then end up not.
Finally, from here-on out, the parser is now being called a converter. The converter comprises the renderer, which converts GWT into HTML, a parser, which breaks that HTML up into lexical chunks, a formatter, which cleans up the HTML and formats everything, and a synthesizer, which tidies up and combines everything.
Ian
No comments:
Post a Comment