Need help web scraping

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Need help web scraping

31 May 2020, 00:36

Hello all, I am in need of a script for webscraping. You see, I am very competent at web scraping, but only using VB. VB only lets you go so far when web scraping, what I mean is it only lets you scrap this from the page source (the same text as you see when you press ctrl+u), this just isn't cutting it for what I now want to do. From what I've seen on web scraping in AHK, it seems like you are able to scrap things from the same text you see when you do ctrl+shift+i (which is a lot more things you can scrap). The problems I've been having is that foreign never appear in the page source (ctrl+u) but that is the thing I want to scrap.

Here is my problem, I don't need anyone to completely make a script but just get the ball rolling as I should be good once I know roughly what to do:
I am language learning and constantly use a website called Jisho.org this website lets you see words in a sentence by simply typing "#sentence" + "word", example jisho.org/search/%E6%97%A5%20%23sentences
In this page source if you want to extract the (full) Japanese sentence it isn't simple, the Japanese sentence is stored in a website called https://tatoeba.org/. This means you have to extract the link to that website to then extract one of the Japanese sentences.

Let me show you what I mean, if you go to https://jisho.org/search/%E6%97%A5%20%23sentences , first you have to go to this link (from the page source) https://tatoeba.org/eng/sentences/show/4940 and then the Japanese sentence is stored in this element {<span ng-if="!vm.sentence.furigana">日本人ならそんなことはけっしてしないでしょう。</span>}.

Thanks, I hope I explained this well enough for someone to be kind and lend me a hand.
Last edited by BoBo on 31 May 2020, 02:36, edited 1 time in total.
Reason: Fixed broken links.
Albireo
Posts: 1756
Joined: 16 Oct 2013, 13:53

Re: Need help web scraping

31 May 2020, 16:54

Welcome to AHK!
I'm sorry, but I'm not sure what you want to do with AHK :( . (and I have not heard about web scraping earlier :roll: )
Do you read something from somewhere and want to translate it in some way?
or...
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

31 May 2020, 17:24

Hello!
Try this example:

Code: Select all

url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
   AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
User avatar
Chunjee
Posts: 1435
Joined: 18 Apr 2014, 19:05
Contact:

Re: Need help web scraping

31 May 2020, 22:23

This script got on my clipboard at some point:

Code: Select all

^Space::
if (Clipboard != "") { ;if cliboard data is not an image or blank
	SearchUrl := "http://jisho.org/search/" . Clipboard   ; Search text gets appended to the end of the string
	Run, %SearchUrl%
} else {
	Msgbox, your clipboard doesn't have recognised text.
}
Return

For more on web scraping with AHK I would study http://the-automator.com/web-scraping-with-autohotkey/ and/or https://www.autohotkey.com/boards/viewtopic.php?f=6&t=42890
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

01 Jun 2020, 04:21

teadrinker wrote:
31 May 2020, 17:24
Hello!
Try this example:

Code: Select all

url := "https jisho.org /search/%E6%97%A5%20%23sentences"  Broken Link for safety
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
   AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
Thank you very much for taking the time to write this, I am grateful. Is there a way to make it so that window at the end lets you select text from there, meaning that I can pick which sentence I want to use? Otherwise this is great!

Another thing is, while I know you have taken likely a while to write this, could you explain how this works? I am not used to AHK code and have no experience in web scraping with it either. Maybe using comments in the code.

Thanks again.

Also another thing, I tried changing it to my liking (Having two sentences on one line and also not having the website hard-coded into the url variable) but when I run it, it saves my clipboard into the URL variable and then outputs the correct text, but if I was to do it again with another website it still outputs the same text from last time (if even I run it 5 times it still outputs the first text that it did).

Code: Select all

^+3::
{
send, ^c
sleep, 100

url := Clipboard
sleep, 100

whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
   
   StringReplace, japaneseText, japaneseText, %A_SPACE%,, All

   AllText .= japaneseText . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText

Return
}
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Need help web scraping

01 Jun 2020, 06:15

add this to the top of the script

Code: Select all

 AllText := 
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

01 Jun 2020, 06:22

AHKStudent wrote:
01 Jun 2020, 06:15
add this to the top of the script

Code: Select all

 AllText := 
Wow, Thank you so much!
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

01 Jun 2020, 07:01

AHKStudent wrote:
01 Jun 2020, 06:15
add this to the top of the script

Code: Select all

 AllText := 
I have another problem, the output of the Japanese sentences contains spaces (Japanese sentences don't use spaces), for example:
彼女 と 友達 に なれた 時 は それは うれしかった です よ。 I was overjoyed when I was able to make friends with her!

How do I get rid of these spaces properly?

I have tried:
japaneseText := StrReplace(japaneseText," ")
While this does get rid of the spaces, for reason it also gets rid of some punctuation and characters from one of the Japanese alphabets (Katakana)

I have also tried:
StringReplace, TotalVar, TotalVar, %A_SPACE%, _, All
(with replacing with the right variables of course)
This also doesn't work as it gets rid of punctuation as well.

I'm guessing this is something to do with the characters being in another language.

Do you have any other suggestions?
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

01 Jun 2020, 08:19

JawGBoi wrote: could you explain how this works? I am not used to AHK code and have no experience in web scraping with it either.
For web scraping it's necessary to have at least basic JavaScript knowledge. Commenting is useless without this knowledge. In addition, you should be able to work with Chrome DevTools. When you open the page in Chrome DevTools and find elements you interested in, many questions will disappear.
Image
JawGBoi wrote: How do I get rid of these spaces properly?
Like this, maybe:

Code: Select all

url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
   AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

01 Jun 2020, 09:38

teadrinker wrote:
01 Jun 2020, 08:19

Like this, maybe:

Code: Select all

url := "https jisho.org /search/%E6%97%A5%20%23sentences"  Broken Link for safety
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")
   AllText .= "japanese: " . japaneseText . "`r`nenglish: " . englishText . "`r`n`r`n"
}
MsgBox, % Clipboard := AllText
I hope I am not being dumb, but I can't see what you have changed.

Here is my current code (which I changed the original to my liking):

Code: Select all

^+4::
AllText := 

send, ^c
sleep, 200
WordLookup := clipboard

url := "https jisho.org /search/"  Broken Link for safety . clipboard . "%20%23sentences"
sleep, 200

whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to dowload html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]
   japaneseElem := sentence.querySelector(".japanese_sentence")
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   coll := japaneseElem.querySelectorAll(".unlinked")
   japaneseText := ""
   Loop % coll.length
      japaneseText .= (A_Index = 1 ? "" : " ") . coll[A_Index - 1].innerText
   japaneseText .= RegExReplace(japaneseElem.innerText, "s).*(\S)", "$1")

   AllText .= japaneseText . englishText . "`r`n`r`n"
   }
Clipboard := AllText

Run, C:\Windows\System32\notepad.exe
sleep, 200

send, Examples for %WordLookup%:
send, {Enter}
send, {Enter}
send, ^v
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

01 Jun 2020, 10:17

JawGBoi wrote: I can't see what you have changed.
(A_Index = 1 ? "" : " ") . has been removed.
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

01 Jun 2020, 14:50

teadrinker wrote:
01 Jun 2020, 10:17
(A_Index = 1 ? "" : " ") . has been removed.
Hello again, when the code is ran, do you what would be assigned to the variable sentences at the following line

Code: Select all

sentences := doc.querySelectorAll(".sentence_content")
I am trying to make a totally new script for a new purpose and am trying to use the script that you kindly created. The reason this line is bugging me is because when I try and copy it to the clipboard to test and see what it is storing it says "Invalid value. Specifically: An object." meaning that I can't check what it is (in my new code).
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

01 Jun 2020, 17:32

JawGBoi wrote: I try and copy it to the clipboard
You can copy to the Clipboard text only, but this variable contains a JavaScript object called NodeList:

Code: Select all

...
sentences := doc.querySelectorAll(".sentence_content")
MsgBox, % sentences.toString()
...
so it can't be copied to the Clipboard.
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

02 Jun 2020, 08:11

teadrinker wrote:
01 Jun 2020, 10:17
JawGBoi wrote: I can't see what you have changed.
(A_Index = 1 ? "" : " ") . has been removed.
I have found the problem:
https://imgur.com/a/zKCm605

The reason why the punctuation isn't being picked up is because it isn't in the same class as the Japanese text. Could you find a way to fix this? I have been trying for hours and haven't been able to do it.
Thanks.
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

02 Jun 2020, 08:58

JawGBoi wrote: The reason why the punctuation isn't being picked up
Haha, I just didn't notice the punctuation. :)

Code: Select all

url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to download html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]

   japaneseElem := sentence.querySelector(".japanese_sentence")
   coll := japaneseElem.getElementsByClassName("furigana")
   Loop % coll.length
      try coll[A_Index - 1].parentNode.removeChild(coll[A_Index - 1])
   japaneseText := RegExReplace(japaneseElem.innerText, "\R")
   
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   
   AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

03 Jun 2020, 05:49

teadrinker wrote:
02 Jun 2020, 08:58
JawGBoi wrote: The reason why the punctuation isn't being picked up
Haha, I just didn't notice the punctuation. :)

Code: Select all

url := "https jisho.org /search/%E6%97%A5%20%23sentences"  Broken Link for safety
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to download html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]

   japaneseElem := sentence.querySelector(".japanese_sentence")
   coll := japaneseElem.getElementsByClassName("furigana")
   Loop % coll.length
      try coll[A_Index - 1].parentNode.removeChild(coll[A_Index - 1])
   japaneseText := RegExReplace(japaneseElem.innerText, "\R")
   
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   
   AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
While this does get all of the punctuation perfectly, at also get the furigana helper text (the text that is in small above some of the words), this kind of ruins the sentence because it is basically grabbing the word twice, similar to it grabbing an english sentence like this "he missed his first meetingmeeting". Is it possible to make it not do this? (also, the script only does this sometimes, maybe about once every sentence)

If my explanation was bad then have a look at this: https://imgur.com/c2R1abh

Also, if making this script work on Jisho is too difficult, it would actually be more useful doing it on "https cooljugator.com /ja/%E3%81%84%E3%82%8B" Broken Link for safety for example, it gives the word in a sentence and also conjugates it at the top.

Thanks again for all your help so far.
teadrinker
Posts: 4349
Joined: 29 Mar 2015, 09:41
Contact:

Re: Need help web scraping

03 Jun 2020, 07:28

JawGBoi wrote: also get the furigana helper text
You are right, I was inattentive.
Fixed:

Code: Select all

url := "https://jisho.org/search/%E6%97%A5%20%23sentences"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to download html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]

   japaneseElem := sentence.querySelector(".japanese_sentence")
   coll := japaneseElem.getElementsByClassName("furigana")
   Loop % coll.length
      coll[0].parentNode.removeChild(coll[0])
   japaneseText := RegExReplace(japaneseElem.innerText, "\R")
   
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   
   AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

03 Jun 2020, 07:40

teadrinker wrote:
03 Jun 2020, 07:28
JawGBoi wrote: also get the furigana helper text
You are right, I was inattentive.
Fixed:

Code: Select all

url := "https jisho.org /search/%E6%97%A5%20%23sentences"  Broken Link for safety
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to download html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]

   japaneseElem := sentence.querySelector(".japanese_sentence")
   coll := japaneseElem.getElementsByClassName("furigana")
   Loop % coll.length
      coll[0].parentNode.removeChild(coll[0])
   japaneseText := RegExReplace(japaneseElem.innerText, "\R")
   
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   
   AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . "japanese: " . japaneseText . "`r`nenglish: " . englishText
}
MsgBox, % Clipboard := AllText
Oh also, do you know where I can learn about using ComObjCreate("WinHttp.WinHttpRequest.5.1") and such, because I am struggling to do things on my own when it comes to AHK.
JawGBoi
Posts: 15
Joined: 05 May 2020, 14:17

Re: Need help web scraping

07 Jun 2020, 10:29

teadrinker wrote:
03 Jun 2020, 07:54
WinHttpRequest object
COM Object Reference
Also look for examples on this forum.
I am confused about why this code isn't working

Code: Select all

^+2::
AllText := 

send, ^c
sleep, 200
WordLookup := clipboard

url := "https jisho.org /search/"  Broken Link for safety . clipboard . "%20%23sentences"
sleep, 200

whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", url, false)
whr.Send()
status := whr.status
if (status != 200)
   throw "Failed to download html. Status: " . status
html := whr.responseText

doc := ComObjCreate("htmlfile")
doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=9"">")
doc.write(html)

sentences := doc.querySelectorAll(".sentence_content")
Loop % sentences.length {
   sentence := sentences[A_Index - 1]

   japaneseElem := sentence.querySelector(".japanese_sentence")
   coll := japaneseElem.getElementsByClassName("furigana")
   Loop % coll.length
      coll[0].parentNode.removeChild(coll[0])
	  
   japaneseText := RegExReplace(japaneseElem.innerText, "\R")
   
   englishElem  := sentence.querySelector(".english")
   englishText := englishElem.innerText
   
   if (japaneseText < 10000) {
    AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . japaneseText . englishText
   }
	
   }
   
clipboard := AllText
Specifically this part:

Code: Select all

if (japaneseText < 10000) {
    AllText .= (A_Index = 1 ? "" : "`r`n`r`n") . japaneseText . englishText
   }
I made it "< 10000" because I wanted to see it work, but no matter what expression I use it won't output anything to my clipboard. I am trying to only output short sentences (maybe sentences with less than 30 characters). Did I use the wrong variable?

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Bing [Bot] and 138 guests