Scraping Facebook Marketplace

Ruevil2 · 14 Nov 2018, 13:47

Attempted Code:

ComObjError(0)
WebRequest := ComObjCreate("WinHttp.WinHttpRequest.5.1")

URL := "https://www.facebook.com/marketplace/105999576097040"
WebRequest.Open("GET", URL, true)
WebRequest.Send()
WebRequest.WaitForResponse()

FileAppend, % WebRequest.ResponseText, %A_ScriptDir%\FaceBookData.txt
return

I have had success using this method to scrape some websites for data as well as working with returned JSON objects. Facebook marketplace seems to use a CSS that makes it very hard to read data from it. Probably due to it's habit of loading results as you scroll rather than dumping the whole results pattern all at once.

I am trying to query the marketplace so that I can apply my own search patterns and a larger array of conditions than facebook allows. One of these things would be to limit ads based on their age(I want to see only ads posted in the last day).

Do I need to get some sort of authentication involved with my query?

seven · 14 Nov 2018, 19:35

Over the years, I've done several web scraping projects which relied on AHK in some way. There are two "native" methods I know of, plus others:

For Simple Web Scraping:

WinHttpRequest COM object, as in your attempted code above.
InternetExplorer COM object. This is more robust than the method you tried, it entails manipulating an instance of the Internet Explorer browser using COM, which can be done silently in the background, or not, as in:
Code: Select all
```
; from AHK Help file
ie := ComObjCreate("InternetExplorer.Application")
ie.Visible := true  ; This is known to work incorrectly on IE7.
ie.Navigate("https://autohotkey.com/")
```
[Library] Chrome.ahk - Automate Google Chrome... is a "non-native" method which is very similar to InternetExplorer COM

In my experience, the above methods are suitable for simpler applications. Websites with lots of scripts, or which required authentication, link following, and keep alive (I suspect Facebook is one of them), are tricky.

For Complex Web Scraping:

What worked best for me was the use of two third-party applications, one of which can be controlled directly from an AHK script through command line:

WinHTTrack command line tool. Your AHK code might look something like this:

Code: Select all

RunWait, %comspec% /c "httrack ""http://www.hidemyass.com/proxy-list/search-123456"" -O ""C:\MyScrapedWebsites"" -v -s0 -R2 -r0 -p1 -`%e0"
loop,C:\MyScrapedWebsites\*.html,0,0
{
	FileRead,OutHtml,%A_LoopFileFullPath%
	msgbox % OutHtml
}

MetaProducts Systems Offline Explorer, which is a full-fledged GUI app, great for login authentication stuff

These last two methods proved very useful to me, especially Httrack. You can make it do anything from the command line, which means you can make it do anything from an AHK script. I used it once, for example, to mirror my friends' profiles on a dying social network. It has a GUI version available too. On the other hand, Offline Explorer could get content I needed with least amount of preparation, you just use its wizard and go. It could even get stuff I had trouble getting with Httrack. But Offline Explorer isn't accessible to AHK in the same way Httrack is. With either method, I would use the third-party app to download relevant content and if needed, parse it using Regex etc. to extract information. The key to these apps is that they "pretend" to be a browser when identifying themselves to a server, in fact Httrack can pretend to be several different browsers. Servers sometimes reply with different information (HTML etc.) to different types of browsers. These apps have strategies to circumvent measures servers put in place to prevent access by scripts and bots.

If you consider using these tools, remember that you can easily misuse them. You could accidentally direct the app to download all of Facebook, for example, until your hard drive is full or until Facebook's servers realize what you're up to and ban your IP. Using tools like these is against the terms and conditions of many websites. You may notice a URL for a proxy list in my example code above - that's a snippet I used at the start of scripts in most projects to scrape a for proxy to use before mass-downloading some site, to ensure it wasn't noticed that the same computer was downloading everyday.

For Facebook Marketplace, I would try using Httrack. You will have to configure it to log in to Facebook.

adegard · 28 Dec 2018, 09:02

thanks for sharing this @seven!
hhtrack is working fine...

as said above

You will have to configure it to log in to Facebook.

Could you explain how to do login with hhtrack?

Scraping Facebook Marketplace

Scraping Facebook Marketplace

Re: Scraping Facebook Marketplace

Re: Scraping Facebook Marketplace

Who is online