Over the years, I've done several web scraping projects which relied on AHK in some way. There are two "native" methods I know of, plus others:
For Simple Web Scraping:
- WinHttpRequest COM object, as in your attempted code above.
- InternetExplorer COM object. This is more robust than the method you tried, it entails manipulating an instance of the Internet Explorer browser using COM, which can be done silently in the background, or not, as in:
Code: Select all
; from AHK Help file
ie := ComObjCreate("InternetExplorer.Application")
ie.Visible := true ; This is known to work incorrectly on IE7.
ie.Navigate("https://autohotkey.com/")
- [Library] Chrome.ahk - Automate Google Chrome... is a "non-native" method which is very similar to InternetExplorer COM
In my experience, the above methods are suitable for simpler applications. Websites with lots of scripts, or which required authentication, link following, and keep alive (I suspect Facebook is one of them), are tricky.
For Complex Web Scraping:
What worked best for me was the use of two third-party applications, one of which can be controlled directly from an AHK script through command line:
- WinHTTrack command line tool. Your AHK code might look something like this:
Code: Select all
RunWait, %comspec% /c "httrack ""http://www.hidemyass.com/proxy-list/search-123456"" -O ""C:\MyScrapedWebsites"" -v -s0 -R2 -r0 -p1 -`%e0"
loop,C:\MyScrapedWebsites\*.html,0,0
{
FileRead,OutHtml,%A_LoopFileFullPath%
msgbox % OutHtml
}
- MetaProducts Systems Offline Explorer, which is a full-fledged GUI app, great for login authentication stuff
These last two methods proved very useful to me,
especially Httrack. You can make it do anything from the command line, which means you can make it do anything from an AHK script. I used it once, for example, to mirror my friends' profiles on a dying social network. It has a GUI version available too. On the other hand, Offline Explorer could get content I needed with least amount of preparation, you just use its wizard and go. It could even get stuff I had trouble getting with Httrack. But Offline Explorer isn't accessible to AHK in the same way Httrack is. With either method, I would use the third-party app to download relevant content and if needed, parse it using Regex etc. to extract information. The key to these apps is that they "pretend" to be a browser when identifying themselves to a server, in fact Httrack can pretend to be several different browsers. Servers sometimes reply with different information (HTML etc.) to different types of browsers. These apps have strategies to circumvent measures servers put in place to prevent access by scripts and bots.
If you consider using these tools, remember that you can easily misuse them. You could accidentally direct the app to download all of Facebook, for example, until your hard drive is full or until Facebook's servers realize what you're up to and ban your IP. Using tools like these is against the terms and conditions of many websites. You may notice a URL for a proxy list in my example code above - that's a snippet I used at the start of scripts in most projects to scrape a for proxy to use before mass-downloading some site, to ensure it wasn't noticed that the same computer was downloading everyday.
For Facebook Marketplace, I would try using Httrack. You will have to configure it to log in to Facebook.