AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Parsing XML from an input stream

 
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
bmcclure



Joined: 24 Nov 2007
Posts: 774

PostPosted: Tue Jun 23, 2009 9:23 pm    Post subject: Parsing XML from an input stream Reply with quote

I'm writing a SAX (Simple API for XML) implementation for AHK, modeled after SAX for Java

I have an input stream of bytes (in this case characters) that I would like to parse.

My question is basically, can anyone help me think of a good way to analyze the XML input character-by-character, turning it into meaningful tokens that I can send off to another function designed to handle them?

Let me start with an example.

An XML file:
Code:
<note>
<author name="Me" />
<title>Something</title>
<text>Some text</text>
</note>


Reading that one character at a time, I need to somehow generate the following events:
1. Start document
2. Start element (note)
3. Start element (author, including provided attribute)
4. End element (author)
5. Start element (title)
6. Character data (Something)
7. End element (title)
8. Start element (text)
9. Character Data (Some text)
10. End element (text)
11. End element (note)
12. End document

The XML language seems pretty difficult to parse in this way, because given a single character, it's very likely impossible to determine what type of element it is without reading ahead. But reading ahead would then require backtracking and re-reading if the element did not match the first type checked against.

One possibility I can think of is to create a sort of state table in AHK that, given a starting character, will determine all the possible states (element types) it could be, and then as you read each character, you narrow down the possible states until there is only one, at which point you can parse the input according to that state, then do the same thing for each following element.

Unfortunately I have not been able to come up with a good way to do this, so I'm looking for input from any of the AHK gurus here that may have some ideas.

I would like to keep performance heavily in mind, because this could potentially be used to parse very large documents (hence the reason why only a small portion of the document is kept in memory at any time, and previous elements are forgotten after their events have been generated).

Apologies if this is not at all clear--I've been buried in computer science and automaton theory all day trying to look for the best method of parsing this. I would be happy to further explain or provide better examples.

Thanks to anyone who reads this, and especially to anyone who can help me come up with a solution. Just explaining the solution will be enough; I'm happy to code it myself! Thanks...
_________________
Ben

My Trac projects
My Wiki
[Broken] - My music
Back to top
View user's profile Send private message
tank



Joined: 21 Dec 2007
Posts: 3700
Location: Louisville KY USA

PostPosted: Wed Jun 24, 2009 2:34 am    Post subject: Reply with quote

Titan wrote:
xpath v3 - read and write XML documents with XPath syntax
majkinetor wrote:
Microsoft.XMLDOM
I see you posted on this second one.
I am curious how these implementations do not meet your needs or what your needing thats different from them before I dive in
Edit:
Reading some of your posts. have you considered taking your data stream and creating a document with it and then invoking DOM methods upon it?
Nuther Edit: Just started diving into your class library just noticed you are using COM
I know a little java as well I guess I need to understand is there some API or specific function your trying to mimic hear doesnt sound familiar to me
_________________

We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Back to top
View user's profile Send private message
bmcclure



Joined: 24 Nov 2007
Posts: 774

PostPosted: Wed Jun 24, 2009 6:01 am    Post subject: Reply with quote

Yes, I'm implementing SAX, and am creating an XMLReader based on the XMLReader interface (to be provided by the XMLReaderFactory class). There is no official SAX API, but I'm using the SAX class reference and have created the appropriate interfaces and classes to match the Java implementation in AHK.

In answer:
1. xpath library won't do what I need, it loads the entire file and then every command essentially re-parses the document. I need something that does one pass over the document from an input stream, without loading the entire document into memory. This will allow documents of essentially unlimited (other than OS limits) size to be parsed efficiently.

2. I wrote an XMLDOM library with COM (that I mostly abandoned in favor of the DOM and SAX implementations I'm writing in pure AHK). Since I'm writing my own library, I don't want to utilize a redundant 3rd party library--I want it all to be native.

3. I plan to use this XMLReader as the back-end parser that constructs the DOM tree in my AHK DOM implementation, so loading the document into DOM and parsing it with SAX from there would create some sort of paradox Smile

SAX and DOM are completely different for different purposes, which is why I'm implementing them both in AHK. SAX is fast, but if you need to access random parts of a document in memory, it's inefficient, and a DOM tree should be built instead. If, on the other hand, you just need to read an XML file and parse its contents, SAX is by far the better choice, as you don't need to load and maintain the entire document tree in memory.

You can use SAX to create a DOM tree (by creating a custom handler), and you can parse a DOM tree using SAX. Different approaches for different needs. For loading files, however, it makes more sense to go the SAX -> DOM route, since SAX can efficiently create the DOM tree that is fully compliant with DOM specifications.

Really, more-so than a question of creating a SAX XMLReader, is just figuring out an efficient way to parse the character stream and decode the elements without having to make multiple passes over any bytes.

Thanks very much for talking through some of this with me Smile
_________________
Ben

My Trac projects
My Wiki
[Broken] - My music
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group