 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
bmcclure
Joined: 24 Nov 2007 Posts: 774
|
Posted: Tue Jun 23, 2009 9:23 pm Post subject: Parsing XML from an input stream |
|
|
I'm writing a SAX (Simple API for XML) implementation for AHK, modeled after SAX for Java
I have an input stream of bytes (in this case characters) that I would like to parse.
My question is basically, can anyone help me think of a good way to analyze the XML input character-by-character, turning it into meaningful tokens that I can send off to another function designed to handle them?
Let me start with an example.
An XML file:
| Code: | <note>
<author name="Me" />
<title>Something</title>
<text>Some text</text>
</note> |
Reading that one character at a time, I need to somehow generate the following events:
1. Start document
2. Start element (note)
3. Start element (author, including provided attribute)
4. End element (author)
5. Start element (title)
6. Character data (Something)
7. End element (title)
8. Start element (text)
9. Character Data (Some text)
10. End element (text)
11. End element (note)
12. End document
The XML language seems pretty difficult to parse in this way, because given a single character, it's very likely impossible to determine what type of element it is without reading ahead. But reading ahead would then require backtracking and re-reading if the element did not match the first type checked against.
One possibility I can think of is to create a sort of state table in AHK that, given a starting character, will determine all the possible states (element types) it could be, and then as you read each character, you narrow down the possible states until there is only one, at which point you can parse the input according to that state, then do the same thing for each following element.
Unfortunately I have not been able to come up with a good way to do this, so I'm looking for input from any of the AHK gurus here that may have some ideas.
I would like to keep performance heavily in mind, because this could potentially be used to parse very large documents (hence the reason why only a small portion of the document is kept in memory at any time, and previous elements are forgotten after their events have been generated).
Apologies if this is not at all clear--I've been buried in computer science and automaton theory all day trying to look for the best method of parsing this. I would be happy to further explain or provide better examples.
Thanks to anyone who reads this, and especially to anyone who can help me come up with a solution. Just explaining the solution will be enough; I'm happy to code it myself! Thanks... _________________ Ben
My Trac projects
My Wiki
[Broken] - My music |
|
| Back to top |
|
 |
tank
Joined: 21 Dec 2007 Posts: 3700 Location: Louisville KY USA
|
Posted: Wed Jun 24, 2009 2:34 am Post subject: |
|
|
I see you posted on this second one.
I am curious how these implementations do not meet your needs or what your needing thats different from them before I dive in
Edit:
Reading some of your posts. have you considered taking your data stream and creating a document with it and then invoking DOM methods upon it?
Nuther Edit: Just started diving into your class library just noticed you are using COM
I know a little java as well I guess I need to understand is there some API or specific function your trying to mimic hear doesnt sound familiar to me _________________
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed; |
|
| Back to top |
|
 |
bmcclure
Joined: 24 Nov 2007 Posts: 774
|
Posted: Wed Jun 24, 2009 6:01 am Post subject: |
|
|
Yes, I'm implementing SAX, and am creating an XMLReader based on the XMLReader interface (to be provided by the XMLReaderFactory class). There is no official SAX API, but I'm using the SAX class reference and have created the appropriate interfaces and classes to match the Java implementation in AHK.
In answer:
1. xpath library won't do what I need, it loads the entire file and then every command essentially re-parses the document. I need something that does one pass over the document from an input stream, without loading the entire document into memory. This will allow documents of essentially unlimited (other than OS limits) size to be parsed efficiently.
2. I wrote an XMLDOM library with COM (that I mostly abandoned in favor of the DOM and SAX implementations I'm writing in pure AHK). Since I'm writing my own library, I don't want to utilize a redundant 3rd party library--I want it all to be native.
3. I plan to use this XMLReader as the back-end parser that constructs the DOM tree in my AHK DOM implementation, so loading the document into DOM and parsing it with SAX from there would create some sort of paradox
SAX and DOM are completely different for different purposes, which is why I'm implementing them both in AHK. SAX is fast, but if you need to access random parts of a document in memory, it's inefficient, and a DOM tree should be built instead. If, on the other hand, you just need to read an XML file and parse its contents, SAX is by far the better choice, as you don't need to load and maintain the entire document tree in memory.
You can use SAX to create a DOM tree (by creating a custom handler), and you can parse a DOM tree using SAX. Different approaches for different needs. For loading files, however, it makes more sense to go the SAX -> DOM route, since SAX can efficiently create the DOM tree that is fully compliant with DOM specifications.
Really, more-so than a question of creating a SAX XMLReader, is just figuring out an efficient way to parse the character stream and decode the elements without having to make multiple passes over any bytes.
Thanks very much for talking through some of this with me  _________________ Ben
My Trac projects
My Wiki
[Broken] - My music |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|