I'm writing a SAX (Simple API for XML) implementation for AHK, modeled after SAX for Java
I have an input stream of bytes (in this case characters) that I would like to parse.
My question is basically, can anyone help me think of a good way to analyze the XML input character-by-character, turning it into meaningful tokens that I can send off to another function designed to handle them?
Let me start with an example.
An XML file:
Code:
<note>
<author name="Me" />
<title>Something</title>
<text>Some text</text>
</note>
Reading that one character at a time, I need to somehow generate the following events:
1. Start document
2. Start element (note)
3. Start element (author, including provided attribute)
4. End element (author)
5. Start element (title)
6. Character data (Something)
7. End element (title)
8. Start element (text)
9. Character Data (Some text)
10. End element (text)
11. End element (note)
12. End document
The XML language seems pretty difficult to parse in this way, because given a single character, it's very likely impossible to determine what type of element it is without reading ahead. But reading ahead would then require backtracking and re-reading if the element did not match the first type checked against.
One possibility I can think of is to create a sort of state table in AHK that, given a starting character, will determine all the possible states (element types) it could be, and then as you read each character, you narrow down the possible states until there is only one, at which point you can parse the input according to that state, then do the same thing for each following element.
Unfortunately I have not been able to come up with a good way to do this, so I'm looking for input from any of the AHK gurus here that may have some ideas.
I would like to keep performance heavily in mind, because this could potentially be used to parse very large documents (hence the reason why only a small portion of the document is kept in memory at any time, and previous elements are forgotten after their events have been generated).
Apologies if this is not at all clear--I've been buried in computer science and automaton theory all day trying to look for the best method of parsing this. I would be happy to further explain or provide better examples.
Thanks to anyone who reads this, and especially to anyone who can help me come up with a solution. Just explaining the solution will be enough; I'm happy to code it myself! Thanks...