AutoHotkey Community

It is currently May 26th, 2012, 8:54 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 3 posts ] 
Author Message
PostPosted: June 23rd, 2009, 10:23 pm 
Offline

Joined: November 24th, 2007, 9:07 pm
Posts: 774
I'm writing a SAX (Simple API for XML) implementation for AHK, modeled after SAX for Java

I have an input stream of bytes (in this case characters) that I would like to parse.

My question is basically, can anyone help me think of a good way to analyze the XML input character-by-character, turning it into meaningful tokens that I can send off to another function designed to handle them?

Let me start with an example.

An XML file:
Code:
<note>
<author name="Me" />
<title>Something</title>
<text>Some text</text>
</note>


Reading that one character at a time, I need to somehow generate the following events:
1. Start document
2. Start element (note)
3. Start element (author, including provided attribute)
4. End element (author)
5. Start element (title)
6. Character data (Something)
7. End element (title)
8. Start element (text)
9. Character Data (Some text)
10. End element (text)
11. End element (note)
12. End document

The XML language seems pretty difficult to parse in this way, because given a single character, it's very likely impossible to determine what type of element it is without reading ahead. But reading ahead would then require backtracking and re-reading if the element did not match the first type checked against.

One possibility I can think of is to create a sort of state table in AHK that, given a starting character, will determine all the possible states (element types) it could be, and then as you read each character, you narrow down the possible states until there is only one, at which point you can parse the input according to that state, then do the same thing for each following element.

Unfortunately I have not been able to come up with a good way to do this, so I'm looking for input from any of the AHK gurus here that may have some ideas.

I would like to keep performance heavily in mind, because this could potentially be used to parse very large documents (hence the reason why only a small portion of the document is kept in memory at any time, and previous elements are forgotten after their events have been generated).

Apologies if this is not at all clear--I've been buried in computer science and automaton theory all day trying to look for the best method of parsing this. I would be happy to further explain or provide better examples.

Thanks to anyone who reads this, and especially to anyone who can help me come up with a solution. Just explaining the solution will be enough; I'm happy to code it myself! Thanks...

_________________
Ben

My Trac projects
My Wiki
[Broken] - My music


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 24th, 2009, 3:34 am 
Offline
User avatar

Joined: December 21st, 2007, 3:14 pm
Posts: 3826
Location: Louisville KY USA
Titan wrote:
majkinetor wrote:
I see you posted on this second one.
I am curious how these implementations do not meet your needs or what your needing thats different from them before I dive in
Edit:
Reading some of your posts. have you considered taking your data stream and creating a document with it and then invoking DOM methods upon it?
Nuther Edit: Just started diving into your class library just noticed you are using COM
I know a little java as well I guess I need to understand is there some API or specific function your trying to mimic hear doesnt sound familiar to me

_________________
No matter what your oppinion Please join this discussion
Formal request to Polyethene
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 24th, 2009, 7:01 am 
Offline

Joined: November 24th, 2007, 9:07 pm
Posts: 774
Yes, I'm implementing SAX, and am creating an XMLReader based on the XMLReader interface (to be provided by the XMLReaderFactory class). There is no official SAX API, but I'm using the SAX class reference and have created the appropriate interfaces and classes to match the Java implementation in AHK.

In answer:
1. xpath library won't do what I need, it loads the entire file and then every command essentially re-parses the document. I need something that does one pass over the document from an input stream, without loading the entire document into memory. This will allow documents of essentially unlimited (other than OS limits) size to be parsed efficiently.

2. I wrote an XMLDOM library with COM (that I mostly abandoned in favor of the DOM and SAX implementations I'm writing in pure AHK). Since I'm writing my own library, I don't want to utilize a redundant 3rd party library--I want it all to be native.

3. I plan to use this XMLReader as the back-end parser that constructs the DOM tree in my AHK DOM implementation, so loading the document into DOM and parsing it with SAX from there would create some sort of paradox :)

SAX and DOM are completely different for different purposes, which is why I'm implementing them both in AHK. SAX is fast, but if you need to access random parts of a document in memory, it's inefficient, and a DOM tree should be built instead. If, on the other hand, you just need to read an XML file and parse its contents, SAX is by far the better choice, as you don't need to load and maintain the entire document tree in memory.

You can use SAX to create a DOM tree (by creating a custom handler), and you can parse a DOM tree using SAX. Different approaches for different needs. For loading files, however, it makes more sense to go the SAX -> DOM route, since SAX can efficiently create the DOM tree that is fully compliant with DOM specifications.

Really, more-so than a question of creating a SAX XMLReader, is just figuring out an efficient way to parse the character stream and decode the elements without having to make multiple passes over any bytes.

Thanks very much for talking through some of this with me :)

_________________
Ben

My Trac projects
My Wiki
[Broken] - My music


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: Bing [Bot], BrandonHotkey, coinman, oldbrother and 64 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group