Need help web scraping
Need help web scraping
Hello all, I am in need of a script for webscraping. You see, I am very competent at web scraping, but only using VB. VB only lets you go so far when web scraping, what I mean is it only lets you scrap this from the page source (the same text as you see when you press ctrl+u), this just isn't cutting it for what I now want to do. From what I've seen on web scraping in AHK, it seems like you are able to scrap things from the same text you see when you do ctrl+shift+i (which is a lot more things you can scrap). The problems I've been having is that foreign never appear in the page source (ctrl+u) but that is the thing I want to scrap.
Here is my problem, I don't need anyone to completely make a script but just get the ball rolling as I should be good once I know roughly what to do:
I am language learning and constantly use a website called Jisho.org this website lets you see words in a sentence by simply typing "#sentence" + "word", example jisho.org/search/%E6%97%A5%20%23sentences
In this page source if you want to extract the (full) Japanese sentence it isn't simple, the Japanese sentence is stored in a website called https://tatoeba.org/. This means you have to extract the link to that website to then extract one of the Japanese sentences.
Let me show you what I mean, if you go to https://jisho.org/search/%E6%97%A5%20%23sentences , first you have to go to this link (from the page source) https://tatoeba.org/eng/sentences/show/4940 and then the Japanese sentence is stored in this element {<span ng-if="!vm.sentence.furigana">日本人ならそんなことはけっしてしないでしょう。</span>}.
Thanks, I hope I explained this well enough for someone to be kind and lend me a hand.
Here is my problem, I don't need anyone to completely make a script but just get the ball rolling as I should be good once I know roughly what to do:
I am language learning and constantly use a website called Jisho.org this website lets you see words in a sentence by simply typing "#sentence" + "word", example jisho.org/search/%E6%97%A5%20%23sentences
In this page source if you want to extract the (full) Japanese sentence it isn't simple, the Japanese sentence is stored in a website called https://tatoeba.org/. This means you have to extract the link to that website to then extract one of the Japanese sentences.
Let me show you what I mean, if you go to https://jisho.org/search/%E6%97%A5%20%23sentences , first you have to go to this link (from the page source) https://tatoeba.org/eng/sentences/show/4940 and then the Japanese sentence is stored in this element {<span ng-if="!vm.sentence.furigana">日本人ならそんなことはけっしてしないでしょう。</span>}.
Thanks, I hope I explained this well enough for someone to be kind and lend me a hand.
Last edited by BoBo on 31 May 2020, 02:36, edited 1 time in total.
Reason: Fixed broken links.
Reason: Fixed broken links.
Re: Need help web scraping
Welcome to AHK!
I'm sorry, but I'm not sure what you want to do with AHK . (and I have not heard about web scraping earlier )
Do you read something from somewhere and want to translate it in some way?
or...
I'm sorry, but I'm not sure what you want to do with AHK . (and I have not heard about web scraping earlier )
Do you read something from somewhere and want to translate it in some way?
or...
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
Hello!
Try this example:
Try this example:
Code: Select all
url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to dowload html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
coll := japaneseElem.querySelectorAll(".unlinked")
japaneseText := ""
Loop % coll.length
japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
Re: Need help web scraping
This script got on my clipboard at some point:
For more on web scraping with AHK I would study http://the-automator.com/web-scraping-with-autohotkey/ and/or https://www.autohotkey.com/boards/viewtopic.php?f=6&t=42890
Code: Select all
^Space::
if (Clipboard != "") { ;if cliboard data is not an image or blank
SearchUrl := "http://jisho.org/search/" . Clipboard ; Search text gets appended to the end of the string
Run, %SearchUrl%
} else {
Msgbox, your clipboard doesn't have recognised text.
}
Return
For more on web scraping with AHK I would study http://the-automator.com/web-scraping-with-autohotkey/ and/or https://www.autohotkey.com/boards/viewtopic.php?f=6&t=42890
Re: Need help web scraping
Code: Select all
Thank you very much for taking the time to write this, I am grateful. Is there a way to make it so that window at the end lets you select text from there, meaning that I can pick which sentence I want to use? Otherwise this is great!teadrinker wrote: ↑31 May 2020, 17:24Hello!
Try this example:Code: Select all
url := "https jisho.org /search/%E6%97%A5%20%23sentences" Broken Link for safety whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") whr.Open("GET", url, false) whr.Send() status := whr.status if (status != 200) throw "Failed to dowload html. Status: " . status html := whr.responseText doc := ComObjCreate("htmlfile") doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">") doc.write(html) sentences := doc.querySelectorAll(".sentence_content") Loop % sentences.length { sentence := sentences[A_Index - 1] japaneseElem := sentence.querySelector(".japanese_sentence") englishElem := sentence.querySelector(".english") englishText := englishElem.innerText coll := japaneseElem.querySelectorAll(".unlinked") japaneseText := "" Loop % coll.length japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1") AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n" } MsgBox, % Clipboard := AllText
Another thing is, while I know you have taken likely a while to write this, could you explain how this works? I am not used to AHK code and have no experience in web scraping with it either. Maybe using comments in the code.
Thanks again.
Also another thing, I tried changing it to my liking (Having two sentences on one line and also not having the website hard-coded into the url variable) but when I run it, it saves my clipboard into the URL variable and then outputs the correct text, but if I was to do it again with another website it still outputs the same text from last time (if even I run it 5 times it still outputs the first text that it did).
Code: Select all
^+3::
{
send, ^c
sleep, 100
url := Clipboard
sleep, 100
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to dowload html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
coll := japaneseElem.querySelectorAll(".unlinked")
japaneseText := ""
Loop % coll.length
japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
StringReplace, japaneseText, japaneseText, %A_SPACE%,, All
AllText .= japaneseText . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
Return
}
-
- Posts: 1472
- Joined: 05 May 2018, 12:23
Re: Need help web scraping
add this to the top of the script
Code: Select all
AllText :=
Re: Need help web scraping
Wow, Thank you so much!
Re: Need help web scraping
I have another problem, the output of the Japanese sentences contains spaces (Japanese sentences don't use spaces), for example:
彼女 と 友達 に なれた 時 は それは うれしかった です よ。 I was overjoyed when I was able to make friends with her!
How do I get rid of these spaces properly?
I have tried:
japaneseText := StrReplace(japaneseText," ")
While this does get rid of the spaces, for reason it also gets rid of some punctuation and characters from one of the Japanese alphabets (Katakana)
I have also tried:
StringReplace, TotalVar, TotalVar, %A_SPACE%, _, All
(with replacing with the right variables of course)
This also doesn't work as it gets rid of punctuation as well.
I'm guessing this is something to do with the characters being in another language.
Do you have any other suggestions?
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
For web scraping it's necessary to have at least basic JavaScript knowledge. Commenting is useless without this knowledge. In addition, you should be able to work with Chrome DevTools. When you open the page in Chrome DevTools and find elements you interested in, many questions will disappear.
Like this, maybe:
Code: Select all
url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to dowload html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
coll := japaneseElem.querySelectorAll(".unlinked")
japaneseText := ""
Loop % coll.length
japaneseText .= coll[A_Index - 1].innerText
japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
Re: Need help web scraping
I hope I am not being dumb, but I can't see what you have changed.teadrinker wrote: ↑01 Jun 2020, 08:19
Like this, maybe:Code: Select all
url := "https jisho.org /search/%E6%97%A5%20%23sentences" Broken Link for safety whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") whr.Open("GET", url, false) whr.Send() status := whr.status if (status != 200) throw "Failed to dowload html. Status: " . status html := whr.responseText doc := ComObjCreate("htmlfile") doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">") doc.write(html) sentences := doc.querySelectorAll(".sentence_content") Loop % sentences.length { sentence := sentences[A_Index - 1] japaneseElem := sentence.querySelector(".japanese_sentence") englishElem := sentence.querySelector(".english") englishText := englishElem.innerText coll := japaneseElem.querySelectorAll(".unlinked") japaneseText := "" Loop % coll.length japaneseText .= coll[A_Index - 1].innerText japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1") AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n" } MsgBox, % Clipboard := AllText
Here is my current code (which I changed the original to my liking):
Code: Select all
^+4::
AllText :=
send, ^c
sleep, 200
WordLookup := clipboard
url := "https jisho.org /search/" Broken Link for safety . clipboard . "%20%23sentences"
sleep, 200
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to dowload html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
coll := japaneseElem.querySelectorAll(".unlinked")
japaneseText := ""
Loop % coll.length
japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
AllText .= japaneseText . englishText . "`r`n`r`n"
}
Clipboard := AllText
Run, C:\Windows\System32\notepad.exe
sleep, 200
send, Examples for %WordLookup%:
send, {Enter}
send, {Enter}
send, ^v
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
Hello again, when the code is ran, do you what would be assigned to the variable sentences at the following line
Code: Select all
sentences := doc.querySelectorAll(".sentence_content")
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
You can copy to the Clipboard text only, but this variable contains a JavaScript object called NodeList:
Code: Select all
...
sentences := doc.querySelectorAll(".sentence_content")
MsgBox, % sentences.toString()
...
Re: Need help web scraping
I have found the problem:
https://imgur.com/a/zKCm605
The reason why the punctuation isn't being picked up is because it isn't in the same class as the Japanese text. Could you find a way to fix this? I have been trying for hours and haven't been able to do it.
Thanks.
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
Haha, I just didn't notice the punctuation.
Code: Select all
url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to download html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
coll := japaneseElem.getElementsByClassName("furigana")
Loop % coll.length
try coll[A_Index - 1].parentNode.removeChild(coll[A_Index - 1])
japaneseText := RegExReplace(japaneseElem.innerText, "\R")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
Re: Need help web scraping
While this does get all of the punctuation perfectly, at also get the furigana helper text (the text that is in small above some of the words), this kind of ruins the sentence because it is basically grabbing the word twice, similar to it grabbing an english sentence like this "he missed his first meetingmeeting". Is it possible to make it not do this? (also, the script only does this sometimes, maybe about once every sentence)teadrinker wrote: ↑02 Jun 2020, 08:58Haha, I just didn't notice the punctuation.Code: Select all
url := "https jisho.org /search/%E6%97%A5%20%23sentences" Broken Link for safety whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") whr.Open("GET", url, false) whr.Send() status := whr.status if (status != 200) throw "Failed to download html. Status: " . status html := whr.responseText doc := ComObjCreate("htmlfile") doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">") doc.write(html) sentences := doc.querySelectorAll(".sentence_content") Loop % sentences.length { sentence := sentences[A_Index - 1] japaneseElem := sentence.querySelector(".japanese_sentence") coll := japaneseElem.getElementsByClassName("furigana") Loop % coll.length try coll[A_Index - 1].parentNode.removeChild(coll[A_Index - 1]) japaneseText := RegExReplace(japaneseElem.innerText, "\R") englishElem := sentence.querySelector(".english") englishText := englishElem.innerText AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText } MsgBox, % Clipboard := AllText
If my explanation was bad then have a look at this: https://imgur.com/c2R1abh
Also, if making this script work on Jisho is too difficult, it would actually be more useful doing it on "https cooljugator.com /ja/%E3%81%84%E3%82%8B" Broken Link for safety for example, it gives the word in a sentence and also conjugates it at the top.
Thanks again for all your help so far.
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
You are right, I was inattentive.
Fixed:
Code: Select all
url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to download html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
coll := japaneseElem.getElementsByClassName("furigana")
Loop % coll.length
coll[0].parentNode.removeChild(coll[0])
japaneseText := RegExReplace(japaneseElem.innerText, "\R")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
Re: Need help web scraping
Oh also, do you know where I can learn about using ComObjCreate("WinHttp.WinHttpRequest.5.1") and such, because I am struggling to do things on my own when it comes to AHK.teadrinker wrote: ↑03 Jun 2020, 07:28You are right, I was inattentive.
Fixed:Code: Select all
url := "https jisho.org /search/%E6%97%A5%20%23sentences" Broken Link for safety whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") whr.Open("GET", url, false) whr.Send() status := whr.status if (status != 200) throw "Failed to download html. Status: " . status html := whr.responseText doc := ComObjCreate("htmlfile") doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">") doc.write(html) sentences := doc.querySelectorAll(".sentence_content") Loop % sentences.length { sentence := sentences[A_Index - 1] japaneseElem := sentence.querySelector(".japanese_sentence") coll := japaneseElem.getElementsByClassName("furigana") Loop % coll.length coll[0].parentNode.removeChild(coll[0]) japaneseText := RegExReplace(japaneseElem.innerText, "\R") englishElem := sentence.querySelector(".english") englishText := englishElem.innerText AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText } MsgBox, % Clipboard := AllText
-
- Posts: 4349
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: Need help web scraping
I am confused about why this code isn't workingteadrinker wrote: ↑03 Jun 2020, 07:54WinHttpRequest object
COM Object Reference
Also look for examples on this forum.
Code: Select all
^+2::
AllText :=
send, ^c
sleep, 200
WordLookup := clipboard
url := "https jisho.org /search/" Broken Link for safety . clipboard . "%20%23sentences"
sleep, 200
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
throw "Failed to download html. Status: " . status
html := whr.responseText
doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)
sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
sentence := sentences[A_Index - 1]
japaneseElem := sentence.querySelector(".japanese_sentence")
coll := japaneseElem.getElementsByClassName("furigana")
Loop % coll.length
coll[0].parentNode.removeChild(coll[0])
japaneseText := RegExReplace(japaneseElem.innerText, "\R")
englishElem := sentence.querySelector(".english")
englishText := englishElem.innerText
if (japaneseText < 10000) {
AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . japaneseText . englishText
}
}
clipboard := AllText
Code: Select all
if (japaneseText < 10000) {
AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . japaneseText . englishText
}
Who is online
Users browsing this forum: Bing [Bot] and 138 guests