Hi there peeps
I've got a small question for those web-scraping guys out there.
I wanted to get a monthly overview of Lynda.com new courses. I asked them if they have any to give that info, and they refused to help, soooo, i'm going to scrape the courses page
I found that using "https://www.lynda.com/allcourses" gives out the courses by latest added, and that's great by me.
So, i'd just download to file once per day at midnight and presto, every month i had the daily added courses.
Thing is, how can i compile all those html files as one single "report" of the courses added?
I know that each page has the H3 tag with the course name, so i could extract all the H3 names into a file, and just weed out the dupes.
Do you know any easier way?
Lynda.com workaround
Re: Lynda.com workaround
Ok, so another approach i think it might be possible is to scrape the listing itself, by the "li" tags on the html, and saving them to a file on a daily basis, then using Sort to remove dupes..
What do you think?
What do you think?
Re: Lynda.com workaround
Sounds like the way to go. Focusing on H3 should give you the names of the courses (and urls, if you like).
If you are also interested in the description and course_id, it makes more sense to widen the scope to the <li>s and extract data from these.
If you are also interested in the description and course_id, it makes more sense to widen the scope to the <li>s and extract data from these.
Re: Lynda.com workaround
Well, this might get optimized, and i'm horrible at it....
But, it works.
Once an hour it gives me a file with the links to the trainings.
The proggie is runing hourly because the website is changing the order on the courses randomly, so, after removing all dupes after 1 month, it gives me the clean list
But, it works.
Once an hour it gives me a file with the links to the trainings.
The proggie is runing hourly because the website is changing the order on the courses randomly, so, after removing all dupes after 1 month, it gives me the clean list
Code: Select all
#NoEnv ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn ; Enable warnings to assist with detecting common errors.
SendMode Input ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir% ; Ensures a consistent starting directory.
#SingleInstance Force
;~ #NoTrayIcon
#Persistent
SetTimer, RunLynda, 3600000 ; 1 RUN PER HOUR
return
RunLynda:
{
FileCreateDir, logs
UrlDownloadToFile, https://www.lynda.com/allcourses , daily.html
SourceFile := "daily.html"
DestFile := "ExtractedLinks.txt"
SecondClean := "second.txt"
LastSort := "third.txt"
ClearNames := "Cleared.txt"
SavedLog = logs\LyndaLinks_%A_Now%.txt
Clear := "www.linkedin.com/learning/"
Trash := "www.linkedin.com/learning/subscription/"
Loop, read, %SourceFile%, %DestFile%
{
URLSearchString = %A_LoopReadLine%
Gosub, URLSearch
}
FileRead, var, %DestFile%
Sort, var, U
Loop, Parse, var, `n, `r
If instr(A_LoopField, Clear)
New_data .= A_LoopField "`n"
FileAppend, %New_data%, %SecondClean%
New_data := ""
var :=
FileRead, var, %SecondClean%
Loop, Parse, var, `n, `r
If not instr(A_LoopField, Trash)
New_data .= A_LoopField "`n"
FileAppend, %New_data%, %LastSort%
New_data := ""
var :=
FileRead, Contents,%LastSort%
if not ErrorLevel
{
Sort, Contents
FileAppend, %Contents%, %SavedLog%
Contents =
}
FileDelete, %DestFile%
FileDelete, %SecondClean%
FileDelete, %LastSort%
FileDelete, %SourceFile%
return
URLSearch:
StringGetPos, URLStart1, URLSearchString, www.
URLStart = %URLStart1%
Loop
{
ArrayElement := URLStart%A_Index%
if ArrayElement =
break
if ArrayElement = -1
continue
if URLStart = -1
URLStart = %ArrayElement%
else
{
if ArrayElement <> -1
if ArrayElement < %URLStart%
URLStart = %ArrayElement%
}
}
if URLStart = -1
return
StringTrimLeft, URL, URLSearchString, %URLStart%
Loop, parse, URL, %A_Tab%%A_Space%<>
{
URL = %A_LoopField%
break
}
StringReplace, URLCleansed, URL, ",, All
FileAppend, %URLCleansed%`n
LinkCount += 1
StringLen, CharactersToOmit, URL
CharactersToOmit += %URLStart%
StringTrimLeft, URLSearchString, URLSearchString, %CharactersToOmit%
Gosub, URLSearch
return
}
Re: Lynda.com workaround
Aaaaaand all of a sudden, it does not work anymore..
Anyone knows a way of reading the course list (links) from https://www.lynda.com/allcourses
to a file?
Anyone knows a way of reading the course list (links) from https://www.lynda.com/allcourses
to a file?
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
Try this:
Code: Select all
url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"
html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output
GetHtml(url) {
oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
oWhr.Option(6) ; no redirect
oWhr.Open("GET", url, false)
oWhr.Send()
Return html := oWhr.ResponseText
}
GetCources(html) {
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
courceContainer := doc.querySelector("ul.course-list")
items := courceContainer.getElementsByTagName("h3")
Loop % items.length {
item := items[A_Index - 1]
itemText := item.innerText
itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink: " . itemLink
}
Return text
}
SaveToFile(filePath, string) {
oFile := FileOpen(filePath, "w")
oFile.Write(string)
oFile.Close()
}
Re: Lynda.com workaround
And.. You blew my mind.
Really well played! This is just perfect!
This even does what i was doing manually after the scraping, that was to get the course title from the link!
Boss move right there mate
Kudos!
Really well played! This is just perfect!
This even does what i was doing manually after the scraping, that was to get the course title from the link!
Boss move right there mate
Kudos!
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
It's better to replace
with
Code: Select all
GetHtml(url) {
oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
oWhr.Option(6) ; no redirect
oWhr.Open("GET", url, false)
oWhr.Send()
Return html := oWhr.ResponseText
}
Code: Select all
GetHtml(url) {
oWhr := ComObjCreate("Msxml2.XMLHTTP")
oWhr.Open("GET", url, false)
oWhr.SetRequestHeader("Pragma", "no-cache")
oWhr.SetRequestHeader("Cache-Control", "no-cache, no-store")
oWhr.Send()
Return html := oWhr.ResponseText
}
Re: Lynda.com workaround
Thanks mate, i will be testing that shortly.
I've noticed a curious thing... When im runing the script on my pc, i get perfect results.
If i run it on another one, i get some errors.
I've made a couple changes because i went for titles only, but with your vanilla script, the result is the same.
I'm getting:
Error: 0x80072EE2
WinHttp.WinHttpRequest
Operation Timed Out
Specifically: Send, on oWhr.send()
Code i'm using:
Am i messing up something?
If i run on my pc no problems.
If i run on another PC, it gives out that error.
Other PC has a VPN, and it's the only thing that is different.
But, I can access the website on it no problem as well...
I've noticed a curious thing... When im runing the script on my pc, i get perfect results.
If i run it on another one, i get some errors.
I've made a couple changes because i went for titles only, but with your vanilla script, the result is the same.
I'm getting:
Error: 0x80072EE2
WinHttp.WinHttpRequest
Operation Timed Out
Specifically: Send, on oWhr.send()
Code i'm using:
Code: Select all
#NoEnv ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn ; Enable warnings to assist with detecting common errors.
SendMode Input ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir% ; Ensures a consistent starting directory.
FileCreateDir, logs
;»»»»»»» Drag GUI with Mouse
OnMessage(0x201, "WM_LBUTTONDOWN")
WM_LBUTTONDOWN()
{
PostMessage, 0xA1, 2
}
Gui, B:New, +LastFound +OwnDialogs -SysMenu +toolwindow
Gui, B:Add, button, x10 yp+10 w60 gGoTime, Start
Gui, B:Add, button, x+20 yp+0 w60 gTerminate, Exit
Gui, B:Add, Text, x10 yp+30, Lynda.com Scrapping Util:
Gui, B:Add, Text, x+10 yp+0 w50 vStatus
Gui, B:Show, w200
return
Terminate:
ExitApp
return
GoTime:
GuiControl,,Status, "ON"
loop{
url := "https://www.lynda.com/allcourses/"
output = logs\%A_Now%.txt
html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Sleep 10000
url :=
output :=
html :=
cources :=
}
GetHtml(url) {
oWhr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
oWhr.Option(6) ; no redirect
oWhr.Open("GET", url, false)
oWhr.Send()
Return html := oWhr.ResponseText
}
GetCources(html) {
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
courceContainer := doc.querySelector("ul.course-list")
items := courceContainer.getElementsByTagName("h3")
Loop % items.length {
item := items[A_Index - 1]
itemText := item.innerText
itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
text .= (text ? "`r`n" : "") . itemText
;~ text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink: " . itemLink
}
Return text
}
SaveToFile(filePath, string) {
oFile := FileOpen(filePath, "w")
oFile.Write(string)
oFile.Close()
}
If i run on my pc no problems.
If i run on another PC, it gives out that error.
Other PC has a VPN, and it's the only thing that is different.
But, I can access the website on it no problem as well...
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
Replacing GetHtml() should fix the problem.
Re: Lynda.com workaround
It is replaced on this last version, still the issue remains.. That's what i was finding strange.
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
What does the error message look like after replacing GetHtml()?
Re: Lynda.com workaround
Same as above.
Using the code in the latest post gives me this result on other PCs..
But not on all, i've tested on my team mate laptop and works as intended.
Printscreen:
https://imgur.com/a/2Ce1oKj
Using the code in the latest post gives me this result on other PCs..
But not on all, i've tested on my team mate laptop and works as intended.
Printscreen:
https://imgur.com/a/2Ce1oKj
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
Anyway, you could try this option:
If this doesn't work, I'm not sure I can help.
Code: Select all
url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"
html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output
GetHtml(url) {
size := LoadDataFromUrl(url, buff)
Return html := StrGet(&buff, size, "utf-8")
}
GetCources(html) {
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
courceContainer := doc.querySelector("ul.course-list")
items := courceContainer.getElementsByTagName("h3")
Loop % items.length {
item := items[A_Index - 1]
itemText := item.innerText
itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink: " . itemLink
}
Return text
}
SaveToFile(filePath, string) {
oFile := FileOpen(filePath, "w")
oFile.Write(string)
oFile.Close()
}
LoadDataFromUrl(url, ByRef buff) {
static INTERNET_OPEN_TYPE_DIRECT := 1
, flag1 := (INTERNET_FLAG_RELOAD := 0x80000000)
| (INTERNET_FLAG_IGNORE_CERT_DATE_INVALID := 0x2000)
| (INTERNET_FLAG_IGNORE_CERT_CN_INVALID := 0x1000)
| (INTERNET_FLAG_PRAGMA_NOCACHE := 0x100 )
| (INTERNET_FLAG_NO_CACHE_WRITE := 0x04000000)
, flag2 := (HTTP_QUERY_FLAG_NUMBER := 0x20000000) | (HTTP_QUERY_CONTENT_LENGTH := 5)
, userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
if !hLib := DllCall("LoadLibrary", Str, "Wininet.dll")
throw Exception("Can't load Wininet.dll")
if !hInternet := DllCall("Wininet\InternetOpen", Str, userAgent, UInt, INTERNET_OPEN_TYPE_DIRECT, Ptr, 0, Ptr, 0, UInt, 0, Ptr) {
DllCall("FreeLibrary", Ptr, hLib)
throw Exception("InternetOpen failed")
}
Loop 1 {
if !hUrl := DllCall("Wininet\InternetOpenUrl", Ptr, hInternet, Str, url, Ptr, 0, UInt, 0, UInt, flag1, Ptr, 0, Ptr) {
error := "InternetOpenUrl failed"
break
}
if !DllCall("Wininet\HttpQueryInfo", Ptr, hUrl, UInt, flag2, UIntP, fullSize, UIntP, l := 4, UIntP, idx := 0) {
error := "HttpQueryInfo failed"
break
}
VarSetCapacity(buff, fullSize, 0), bytesRead := 0
while DllCall("Wininet\InternetQueryDataAvailable", Ptr, hUrl, UIntP, size, UInt, 0, Ptr, 0) && size > 0 {
DllCall("Wininet\InternetReadFile", Ptr, hUrl, Ptr, &buff + bytesRead, UInt, size, UIntP, read)
bytesRead += read
}
DllCall("Wininet.dll\InternetCloseHandle", Ptr, hUrl)
}
DllCall("Wininet\InternetCloseHandle", Ptr, hInternet)
DllCall("FreeLibrary", Ptr, hLib)
if error
throw Exception(error)
Return bytesRead
}
I don't know how to work with vpn properly.
Re: Lynda.com workaround
It is not a VPN issue, as i have tried with another PC, not on VPN, and it also sends the same timeout errors.
I will now try this solution
Let's see
Btw, major kudos buddy, you're a real life-saver!
Edit:
Line #
064: Throw, Exception(error)
Using your code, without any change.
Works perfectly on my PC, running W10. Does not run on the other machine, also running W10.
Don't know.. Maybe ill work this around on individual machines on every shift (24H team) instead of keeping it scraping every 24h on a dedicated PC..
I will now try this solution
Let's see
Btw, major kudos buddy, you're a real life-saver!
Edit:
Line #
064: Throw, Exception(error)
Using your code, without any change.
Works perfectly on my PC, running W10. Does not run on the other machine, also running W10.
Don't know.. Maybe ill work this around on individual machines on every shift (24H team) instead of keeping it scraping every 24h on a dedicated PC..
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
Update your AHK to actual version.
Two more options:
Code: Select all
url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"
html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output
GetHtml(url) {
oWhr := ComObjCreate("Msxml2.ServerXMLHTTP.6.0")
oWhr.setTimeouts(120000, 120000, 120000, 120000)
oWhr.Open("GET", url, false)
oWhr.SetRequestHeader("Pragma", "no-cache")
oWhr.SetRequestHeader("Cache-Control", "no-cache, no-store")
oWhr.Send()
Return html := oWhr.ResponseText
}
GetCources(html) {
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
courceContainer := doc.querySelector("ul.course-list")
items := courceContainer.getElementsByTagName("h3")
Loop % items.length {
item := items[A_Index - 1]
itemText := item.innerText
itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink: " . itemLink
}
Return text
}
SaveToFile(filePath, string) {
oFile := FileOpen(filePath, "w")
oFile.Write(string)
oFile.Close()
}
Code: Select all
url := "https://www.lynda.com/allcourses/"
output := A_ScriptDir . "\cources.txt"
html := GetHtml(url) ; this may take time
cources := GetCources(html)
SaveToFile(output, cources)
Run, % output
GetHtml(url) {
size := LoadDataFromUrl(url, buff)
Return html := StrGet(&buff, size, "utf-8")
}
GetCources(html) {
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
courceContainer := doc.querySelector("ul.course-list")
items := courceContainer.getElementsByTagName("h3")
Loop % items.length {
item := items[A_Index - 1]
itemText := item.innerText
itemLink := item.getElementsByTagName("a")[0].getAttribute("href")
text .= (text ? "`r`n`r`n" : "") . "Title: " . itemText . "`r`nLink: " . itemLink
}
Return text
}
SaveToFile(filePath, string) {
oFile := FileOpen(filePath, "w")
oFile.Write(string)
oFile.Close()
}
LoadDataFromUrl(url, ByRef buff) {
static INTERNET_OPEN_TYPE_DIRECT := 1, INTERNET_OPTION_CONNECT_TIMEOUT := 2
, flag1 := (INTERNET_FLAG_RELOAD := 0x80000000)
| (INTERNET_FLAG_IGNORE_CERT_DATE_INVALID := 0x2000)
| (INTERNET_FLAG_IGNORE_CERT_CN_INVALID := 0x1000)
| (INTERNET_FLAG_PRAGMA_NOCACHE := 0x100 )
| (INTERNET_FLAG_NO_CACHE_WRITE := 0x04000000)
, flag2 := (HTTP_QUERY_FLAG_NUMBER := 0x20000000) | (HTTP_QUERY_CONTENT_LENGTH := 5)
, userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
if !hLib := DllCall("LoadLibrary", Str, "Wininet.dll")
throw Exception("Can't load Wininet.dll")
if !hInternet := DllCall("Wininet\InternetOpen", Str, userAgent, UInt, INTERNET_OPEN_TYPE_DIRECT, Ptr, 0, Ptr, 0, UInt, 0, Ptr) {
DllCall("FreeLibrary", Ptr, hLib)
throw Exception("InternetOpen failed")
}
DllCall("Wininet\InternetSetOption", Ptr, hInternet, UInt, INTERNET_OPTION_CONNECT_TIMEOUT, UIntP, t := 120000, UInt, 4)
Loop 1 {
if !hUrl := DllCall("Wininet\InternetOpenUrl", Ptr, hInternet, Str, url, Ptr, 0, UInt, 0, UInt, flag1, Ptr, 0, Ptr) {
error := "InternetOpenUrl failed"
break
}
if !DllCall("Wininet\HttpQueryInfo", Ptr, hUrl, UInt, flag2, UIntP, fullSize, UIntP, l := 4, UIntP, idx := 0) {
error := "HttpQueryInfo failed"
break
}
VarSetCapacity(buff, fullSize, 0), bytesRead := 0
while DllCall("Wininet\InternetQueryDataAvailable", Ptr, hUrl, UIntP, size, UInt, 0, Ptr, 0) && size > 0 {
DllCall("Wininet\InternetReadFile", Ptr, hUrl, Ptr, &buff + bytesRead, UInt, size, UIntP, read)
bytesRead += read
}
DllCall("Wininet.dll\InternetCloseHandle", Ptr, hUrl)
}
DllCall("Wininet\InternetCloseHandle", Ptr, hInternet)
DllCall("FreeLibrary", Ptr, hLib)
if error
throw Exception(error)
Return bytesRead
}
Re: Lynda.com workaround
Seriously.... I am the guy that is always on top of software updates on my company, and i get rekt by it without even noticing....
I shall update now and retry. This was soooo awkward.
I shall update now and retry. This was soooo awkward.
Re: Lynda.com workaround
nope.... Still, the same issue.
Im thinking about it being a problem with the software on the computers, like missing libraries or .Net or something like that (i am not very knowledgeable on the terms).
All codes are throwing errors with time-outs, so....
Anyway, i thank you for your help, awesome solutions.
Im thinking about it being a problem with the software on the computers, like missing libraries or .Net or something like that (i am not very knowledgeable on the terms).
All codes are throwing errors with time-outs, so....
Anyway, i thank you for your help, awesome solutions.
-
- Posts: 4347
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Lynda.com workaround
No idea, what the problem could be. For me on Windows 10 all the versions work, on Windows 7 most of them, but errors are different. The variant with LoadDataFromUrl() works everywhere. Maybe, someone else can give the cue.