Lynda.com workaround

Get help with using AutoHotkey and its commands and hotkeys
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Lynda.com workaround

12 Feb 2019, 05:56

Hi there peeps :)

I've got a small question for those web-scraping guys out there.
I wanted to get a monthly overview of Lynda.com new courses. I asked them if they have any to give that info, and they refused to help, soooo, i'm going to scrape the courses page :)

I found that using "https://www.lynda.com/allcourses" gives out the courses by latest added, and that's great by me.
So, i'd just download to file once per day at midnight and presto, every month i had the daily added courses.
Thing is, how can i compile all those html files as one single "report" of the courses added?
I know that each page has the H3 tag with the course name, so i could extract all the H3 names into a file, and just weed out the dupes.

Do you know any easier way?
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

12 Feb 2019, 06:26

Ok, so another approach i think it might be possible is to scrape the listing itself, by the "li" tags on the html, and saving them to a file on a daily basis, then using Sort to remove dupes..

What do you think?
gregster
Posts: 3619
Joined: 30 Sep 2013, 06:48

Re: Lynda.com workaround

12 Feb 2019, 10:44

Sounds like the way to go. Focusing on H3 should give you the names of the courses (and urls, if you like).
If you are also interested in the description and course_id, it makes more sense to widen the scope to the <li>s and extract data from these.
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

12 Feb 2019, 11:57

Well, this might get optimized, and i'm horrible at it....
But, it works.

Once an hour it gives me a file with the links to the trainings.
The proggie is runing hourly because the website is changing the order on the courses randomly, so, after removing all dupes after 1 month, it gives me the clean list :)

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.
#SingleInstance Force
;~ #NoTrayIcon
#Persistent

SetTimer, RunLynda, 3600000 ; 1 RUN PER HOUR
return

RunLynda:
{
FileCreateDir, logs
UrlDownloadToFile, https://www.lynda.com/allcourses , daily.html

	SourceFile := "daily.html"
	DestFile := "ExtractedLinks.txt"
	SecondClean := "second.txt"
	LastSort := "third.txt"
	ClearNames := "Cleared.txt"
	SavedLog = logs\LyndaLinks_%A_Now%.txt
	Clear := "www.linkedin.com/learning/"
	Trash := "www.linkedin.com/learning/subscription/"

	
		Loop, read, %SourceFile%, %DestFile%
		{
			URLSearchString = %A_LoopReadLine%
			Gosub, URLSearch
		}


	FileRead, var, %DestFile%
			Sort, var, U

			Loop, Parse, var, `n, `r
				If instr(A_LoopField, Clear)
				New_data .= A_LoopField "`n"
				FileAppend, %New_data%, %SecondClean%
				New_data := ""
				var :=
			
	FileRead, var, %SecondClean%
			
			Loop, Parse, var, `n, `r
				If not instr(A_LoopField, Trash)
				New_data .= A_LoopField "`n"
				FileAppend, %New_data%, %LastSort%
				New_data := ""
				var :=
				
	FileRead, Contents,%LastSort%
				if not ErrorLevel
				{
					Sort, Contents
					FileAppend, %Contents%, %SavedLog%
					Contents =
				}
			
			FileDelete, %DestFile%
			FileDelete, %SecondClean%
			FileDelete, %LastSort%
			FileDelete, %SourceFile%
					
		return


URLSearch:
	StringGetPos, URLStart1, URLSearchString, www.

	URLStart = %URLStart1%
		Loop
		{
			ArrayElement := URLStart%A_Index%
			if ArrayElement =
				break
			if ArrayElement = -1
				continue
			if URLStart = -1
				URLStart = %ArrayElement%
			else
			{
				if ArrayElement <> -1
					if ArrayElement < %URLStart%
						URLStart = %ArrayElement%
			}
		}

	if URLStart = -1
		return

	StringTrimLeft, URL, URLSearchString, %URLStart%
		Loop, parse, URL, %A_Tab%%A_Space%<>
		{
			URL = %A_LoopField%
			break
		}
		
		StringReplace, URLCleansed, URL, ",, All
		FileAppend, %URLCleansed%`n
		LinkCount += 1

		StringLen, CharactersToOmit, URL
		CharactersToOmit += %URLStart%
		StringTrimLeft, URLSearchString, URLSearchString, %CharactersToOmit%
	
	Gosub, URLSearch
return
}
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

19 Feb 2019, 04:12

Aaaaaand all of a sudden, it does not work anymore..
Anyone knows a way of reading the course list (links) from https://www.lynda.com/allcourses
to a file?
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

19 Feb 2019, 04:51

Try this:

Code: Select all

url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"

html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output

GetHtml(url) {
   oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   oWhr.Option(6) ; no redirect
   oWhr.Open("GET", url, false)
   oWhr.Send()
   Return html := oWhr.ResponseText
}

GetCources(html) {
   doc := ComObjCreate("htmlfile")
   doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
   doc.write(html)
   courceContainer := doc.querySelector("ul.course-list")
   items := courceContainer.getElementsByTagName("h3")
   Loop % items.length {
      item := items[A_Index - 1]
      itemText := item.innerText
      itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
      text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink:  " . itemLink
   }
   Return text
}

SaveToFile(filePath, string) {
   oFile := FileOpen(filePath, "w")
   oFile.Write(string)
   oFile.Close()
}
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

19 Feb 2019, 07:31

And.. You blew my mind.
Really well played! This is just perfect!
This even does what i was doing manually after the scraping, that was to get the course title from the link!

Boss move right there mate :D
Kudos!
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

19 Feb 2019, 11:08

It's better to replace

Code: Select all

GetHtml(url) {
   oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   oWhr.Option(6) ; no redirect
   oWhr.Open("GET", url, false)
   oWhr.Send()
   Return html := oWhr.ResponseText
}
with

Code: Select all

GetHtml(url) {
   oWhr := ComObjCreate("Msxml2.XMLHTTP")
   oWhr.Open("GET", url, false)
   oWhr.SetRequestHeader("Pragma", "no-cache")
   oWhr.SetRequestHeader("Cache-Control", "no-cache, no-store")
   oWhr.Send()
   Return html := oWhr.ResponseText
}
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 02:41

Thanks mate, i will be testing that shortly.
I've noticed a curious thing... When im runing the script on my pc, i get perfect results.
If i run it on another one, i get some errors.

I've made a couple changes because i went for titles only, but with your vanilla script, the result is the same.
I'm getting:

Error: 0x80072EE2
WinHttp.WinHttpRequest
Operation Timed Out
Specifically: Send, on oWhr.send()

Code i'm using:

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

FileCreateDir, logs

;»»»»»»» Drag GUI with Mouse
OnMessage(0x201, "WM_LBUTTONDOWN")
	WM_LBUTTONDOWN()
	{
		PostMessage, 0xA1, 2
	}

	Gui, B:New, +LastFound +OwnDialogs -SysMenu +toolwindow
	Gui, B:Add, button, x10 yp+10 w60 gGoTime, Start
	Gui, B:Add, button, x+20 yp+0 w60 gTerminate, Exit
	Gui, B:Add, Text, x10 yp+30, Lynda.com Scrapping Util: 
	Gui, B:Add, Text, x+10 yp+0 w50 vStatus
	Gui, B:Show, w200
return

Terminate:
	ExitApp
return

GoTime:
   GuiControl,,Status, "ON"
loop{
	url := "https://www.lynda.com/allcourses/"
	output = logs\%A_Now%.txt
	html := GetHtml(url) ; this may take time
	cources := GetCources(html)
	SaveToFile(output, cources)

	Sleep 10000

	url :=
	output := 
	html :=
	cources :=
}


GetHtml(url) {
   oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   oWhr.Option(6) ; no redirect
   oWhr.Open("GET", url, false)
   oWhr.Send()
   Return html := oWhr.ResponseText
}

GetCources(html) {
   doc := ComObjCreate("htmlfile")
   doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
   doc.write(html)
   courceContainer := doc.querySelector("ul.course-list")
   items := courceContainer.getElementsByTagName("h3")
   Loop % items.length {
      item := items[A_Index - 1]
      itemText := item.innerText
      itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
      text .= (text ? "`r`n" : "") . itemText
	        ;~ text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink:  " . itemLink
   }
   Return text
}

SaveToFile(filePath, string) {
   oFile := FileOpen(filePath, "w")
   oFile.Write(string)
   oFile.Close()
}
Am i messing up something?
If i run on my pc no problems.
If i run on another PC, it gives out that error.

Other PC has a VPN, and it's the only thing that is different.
But, I can access the website on it no problem as well...
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

20 Feb 2019, 03:17

Replacing GetHtml() should fix the problem.
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 03:30

It is replaced on this last version, still the issue remains.. That's what i was finding strange.
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

20 Feb 2019, 04:14

What does the error message look like after replacing GetHtml()?
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 05:12

Same as above.
Using the code in the latest post gives me this result on other PCs..
But not on all, i've tested on my team mate laptop and works as intended.

Printscreen:
https://imgur.com/a/2Ce1oKj
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

20 Feb 2019, 05:12

Anyway, you could try this option:

Code: Select all

url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"

html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output

GetHtml(url) {
   size := LoadDataFromUrl(url, buff)
   Return html := StrGet(&buff, size, "utf-8")
}

GetCources(html) {
   doc := ComObjCreate("htmlfile")
   doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
   doc.write(html)
   courceContainer := doc.querySelector("ul.course-list")
   items := courceContainer.getElementsByTagName("h3")
   Loop % items.length {
      item := items[A_Index - 1]
      itemText := item.innerText
      itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
      text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink:  " . itemLink
   }
   Return text
}

SaveToFile(filePath, string) {
   oFile := FileOpen(filePath, "w")
   oFile.Write(string)
   oFile.Close()
}

LoadDataFromUrl(url, ByRef buff)  {
   static INTERNET_OPEN_TYPE_DIRECT := 1
        , flag1 := (INTERNET_FLAG_RELOAD := 0x80000000)
                 | (INTERNET_FLAG_IGNORE_CERT_DATE_INVALID := 0x2000)
                 | (INTERNET_FLAG_IGNORE_CERT_CN_INVALID := 0x1000)
                 | (INTERNET_FLAG_PRAGMA_NOCACHE := 0x100 )
                 | (INTERNET_FLAG_NO_CACHE_WRITE := 0x04000000)
        , flag2 := (HTTP_QUERY_FLAG_NUMBER := 0x20000000) | (HTTP_QUERY_CONTENT_LENGTH := 5)
        , userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
        
   if !hLib := DllCall("LoadLibrary", Str, "Wininet.dll")
      throw Exception("Can't load Wininet.dll")
   
   if !hInternet := DllCall("Wininet\InternetOpen", Str, userAgent, UInt, INTERNET_OPEN_TYPE_DIRECT, Ptr, 0, Ptr, 0, UInt, 0, Ptr) {
      DllCall("FreeLibrary", Ptr, hLib)
      throw Exception("InternetOpen failed")
   }
   Loop 1 {
      if !hUrl := DllCall("Wininet\InternetOpenUrl", Ptr, hInternet, Str, url, Ptr, 0, UInt, 0, UInt, flag1, Ptr, 0, Ptr) {
         error := "InternetOpenUrl failed"
         break
      }
      if !DllCall("Wininet\HttpQueryInfo", Ptr, hUrl, UInt, flag2, UIntP, fullSize, UIntP, l := 4, UIntP, idx := 0) {
         error := "HttpQueryInfo failed"
         break
      }
      VarSetCapacity(buff, fullSize, 0), bytesRead := 0
      while DllCall("Wininet\InternetQueryDataAvailable", Ptr, hUrl, UIntP, size, UInt, 0, Ptr, 0) && size > 0  {
         DllCall("Wininet\InternetReadFile", Ptr, hUrl, Ptr, &buff + bytesRead, UInt, size, UIntP, read)
         bytesRead += read
      }
      DllCall("Wininet.dll\InternetCloseHandle", Ptr, hUrl)
   }
   DllCall("Wininet\InternetCloseHandle", Ptr, hInternet)
   DllCall("FreeLibrary", Ptr, hLib)
   if error
      throw Exception(error)
   Return bytesRead
}
If this doesn't work, I'm not sure I can help.
Portwolf wrote: Other PC has a VPN, and it's the only thing that is different.
I don't know how to work with vpn properly.
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 05:20

It is not a VPN issue, as i have tried with another PC, not on VPN, and it also sends the same timeout errors.
I will now try this solution :)
Let's see :D

Btw, major kudos buddy, you're a real life-saver!



Edit:
Line #
064: Throw, Exception(error)

Using your code, without any change.
Works perfectly on my PC, running W10. Does not run on the other machine, also running W10.
Don't know.. Maybe ill work this around on individual machines on every shift (24H team) instead of keeping it scraping every 24h on a dedicated PC..
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

20 Feb 2019, 05:57

Portwolf wrote: Line #
064: Throw, Exception(error)
Update your AHK to actual version.
Two more options:

Code: Select all

url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"

html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output

GetHtml(url) {
   oWhr := ComObjCreate("Msxml2.ServerXMLHTTP.6.0")
   oWhr.setTimeouts(120000, 120000, 120000, 120000)
   oWhr.Open("GET", url, false)
   oWhr.SetRequestHeader("Pragma", "no-cache")
   oWhr.SetRequestHeader("Cache-Control", "no-cache, no-store")
   oWhr.Send()
   Return html := oWhr.ResponseText
}

GetCources(html) {
   doc := ComObjCreate("htmlfile")
   doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
   doc.write(html)
   courceContainer := doc.querySelector("ul.course-list")
   items := courceContainer.getElementsByTagName("h3")
   Loop % items.length {
      item := items[A_Index - 1]
      itemText := item.innerText
      itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
      text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink:  " . itemLink
   }
   Return text
}

SaveToFile(filePath, string) {
   oFile := FileOpen(filePath, "w")
   oFile.Write(string)
   oFile.Close()
}
...

Code: Select all

url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"

html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output

GetHtml(url) {
   size := LoadDataFromUrl(url, buff)
   Return html := StrGet(&buff, size, "utf-8")
}

GetCources(html) {
   doc := ComObjCreate("htmlfile")
   doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
   doc.write(html)
   courceContainer := doc.querySelector("ul.course-list")
   items := courceContainer.getElementsByTagName("h3")
   Loop % items.length {
      item := items[A_Index - 1]
      itemText := item.innerText
      itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
      text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink:  " . itemLink
   }
   Return text
}

SaveToFile(filePath, string) {
   oFile := FileOpen(filePath, "w")
   oFile.Write(string)
   oFile.Close()
}

LoadDataFromUrl(url, ByRef buff)  {
   static INTERNET_OPEN_TYPE_DIRECT := 1, INTERNET_OPTION_CONNECT_TIMEOUT := 2
        , flag1 := (INTERNET_FLAG_RELOAD := 0x80000000)
                 | (INTERNET_FLAG_IGNORE_CERT_DATE_INVALID := 0x2000)
                 | (INTERNET_FLAG_IGNORE_CERT_CN_INVALID := 0x1000)
                 | (INTERNET_FLAG_PRAGMA_NOCACHE := 0x100 )
                 | (INTERNET_FLAG_NO_CACHE_WRITE := 0x04000000)
        , flag2 := (HTTP_QUERY_FLAG_NUMBER := 0x20000000) | (HTTP_QUERY_CONTENT_LENGTH := 5)
        , userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
        
   if !hLib := DllCall("LoadLibrary", Str, "Wininet.dll")
      throw Exception("Can't load Wininet.dll")
   
   if !hInternet := DllCall("Wininet\InternetOpen", Str, userAgent, UInt, INTERNET_OPEN_TYPE_DIRECT, Ptr, 0, Ptr, 0, UInt, 0, Ptr) {
      DllCall("FreeLibrary", Ptr, hLib)
      throw Exception("InternetOpen failed")
   }
   DllCall("Wininet\InternetSetOption", Ptr, hInternet, UInt, INTERNET_OPTION_CONNECT_TIMEOUT, UIntP, t := 120000, UInt, 4)
   Loop 1 {
      if !hUrl := DllCall("Wininet\InternetOpenUrl", Ptr, hInternet, Str, url, Ptr, 0, UInt, 0, UInt, flag1, Ptr, 0, Ptr) {
         error := "InternetOpenUrl failed"
         break
      }
      if !DllCall("Wininet\HttpQueryInfo", Ptr, hUrl, UInt, flag2, UIntP, fullSize, UIntP, l := 4, UIntP, idx := 0) {
         error := "HttpQueryInfo failed"
         break
      }
      VarSetCapacity(buff, fullSize, 0), bytesRead := 0
      while DllCall("Wininet\InternetQueryDataAvailable", Ptr, hUrl, UIntP, size, UInt, 0, Ptr, 0) && size > 0  {
         DllCall("Wininet\InternetReadFile", Ptr, hUrl, Ptr, &buff + bytesRead, UInt, size, UIntP, read)
         bytesRead += read
      }
      DllCall("Wininet.dll\InternetCloseHandle", Ptr, hUrl)
   }
   DllCall("Wininet\InternetCloseHandle", Ptr, hInternet)
   DllCall("FreeLibrary", Ptr, hLib)
   if error
      throw Exception(error)
   Return bytesRead
}
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 06:18

Seriously.... I am the guy that is always on top of software updates on my company, and i get rekt by it without even noticing....

:facepalm: :facepalm: :facepalm: :facepalm: :facepalm:

I shall update now and retry. This was soooo awkward.

:lol:
Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Lynda.com workaround

20 Feb 2019, 08:26

nope.... Still, the same issue.
Im thinking about it being a problem with the software on the computers, like missing libraries or .Net or something like that (i am not very knowledgeable on the terms).
All codes are throwing errors with time-outs, so....
Anyway, i thank you for your help, awesome solutions.
teadrinker
Posts: 1130
Joined: 29 Mar 2015, 09:41
Contact:

Re: Lynda.com workaround

20 Feb 2019, 09:55

No idea, what the problem could be. For me on Windows 10 all the versions work, on Windows 7 most of them, but errors are different. The variant with LoadDataFromUrl() works everywhere. Maybe, someone else can give the cue.

Return to “Ask For Help”

Who is online

Users browsing this forum: BushMange, CEA6597, VACO BenQ, w0z and 193 guests