Trying to extract text from webpage, then parsing the text.

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Trying to extract text from webpage, then parsing the text.

10 May 2022, 10:42

Hi, please could someone point me in the right direction?

I have a web page in chrome with a bunch of audio tracks. Each one has several pieces of information I need to copy and store as variables to be pasted back into a separate page.

So far, I have the first bit working... You can select one of the pieces of text for one of the audio tracks.. in this case I'm using the ISRC code. When I run the script, it will copy the ISCR code and store it as a variable (%isrc%). Then it will select all the text on the page and copy it to the clipboard, and then store this as another variable (%page%).

So far so good..

Now I want to parse the variable (%page%) to find the ISRC code, then from there I want to get the line which is 8 lines above the ISRC code, and store that as a variable (%TrackNumber%), and also the line which is 8 lines below the ISRC code, and save that as another variable (%TimeStamp%). Which ever ISRC I select, the track number is always 8 lines above the ISRC, and the timestamp is always 8 lines after it.

I have it working so it searches the variable (%page%), and finds the correct ISRC code, but I'm not sure how to then get it to find the text 8 lines above and below it and to store them as variables? This is the code I have so far..

Code: Select all


        ^+y::

	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	Send, ^c
	ClipWait, 2	
	isrc := clipboard
	
	msgbox ISRC: %isrc%
	
	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	sleep, 200
	Send, {CTRLDOWN}ac{CTRLUP}
	ClipWait, 2
	page := clipboard

	Loop , parse , page , `n
{
	line := A_LoopField
	if line contains %isrc% ; now the right line has been found, parse that line
	{
		StringGetPos, OutputVar, line, %A_Tab% , R ; find the first space from the right
		outputvar += 1	; to get rid of the A_Tab
		StringTrimLeft, CODE, line, OutputVar ; trims everything but the isrc code
		break ; we don't need to parse anything else
	}
}
msgbox Found: "%CODE%"
	

return

ESC::ExitApp
This is how the info for each track is layed out after I've copied all the text from the page..

01 ; this is the track number I want to save as variable %TrackNumber% (8 lines above the ISCR including blank lines)
Song Title ; I don't need this
Artist Name ; I don't need this

Spotify link ; I don't need this

Primary ; I don't need this

GX4R52290977 ; this is the ISRC code
Good



View Luna Audio Report
Not Explicit

00:03:28 ; This is the time stamp I want to save as variable %TimeStamp% (8 lines below the ISCR including blank lines)

Edit - Upon further tests I've found that the other bits I need are not always 8 lines above or below, its sometimes 7 or 9.. so I may need to re-think.


Any help is very much appreciated, thanks in advance!
teadrinker
Posts: 4412
Joined: 29 Mar 2015, 09:41
Contact:

Re: Trying to extract text from webpage, then parsing the text.

10 May 2022, 16:49

If you could provide a link to the page, someone might be able to help.
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 08:47

@teadrinker Thanks for your reply,

Unfortunately I'm unable to share the page as it's a work thing I'm trying to figure out, and the site can only be accessed on a private server.

I can try to explain a bit clearer what I'm trying to achieve.. I'm trying a slightly different approach which is getting close, but there are still some problems.

I load up a page which is a collection of audio tracks - it could be just a single with 1 track, or an album with multiple tracks.

Each track is contained within a box in the window which includes information like track title, artist name, spotify link (if there is one), spaces for featured artists etc and an ISRC code. There is also an audio player to listen to the track, and a display showing the current time you are listening to.

If I find a problem with the audio, I need to write a message which includes the track number, the iscr code, and the current time in the audio where the problem is.

Unfortunately I cant get to all the information I need using the tab button or shift up and down etc, so I need to find a way to extract all the data for each track, then save the bits I need as variables which I can use when putting the message together.

I can select the track number which is the first item in the box for each track. then hold shift and select the current time display, which is the last item in the box. Then copy all that data to the clipboard, and save the clipboard to a new variable (%page%)

The data in %page% looks like this;

Code: Select all

03 ;Track Number - I want this stored as %TrackNum%
track title
artist name
✎
https://open.spotify.com ; Spotify link (not all tracks have this)
Primary  ;Artist Role
                         ; blank line
GX4R52290979          ;ISRC code, I want to store this as %code%
Good                  
              ;blank line
              ;blank line
              ;blank line
View Luna Audio Report
Not Explicit
              ;blank line
00:02:41
00:00:24                           ;This is the current time I want to store as %Ctime%
In most cases it comes out like above - track number is line 1, ISRC is line 9, and current time is line 18.
So in most cases this code works:

Code: Select all


         ^+y::

	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	Send, ^c
	ClipWait, 2	
	Data := clipboard    ;Copies all the data for a single track to clipboard and saves to variable %Data%
	
	arr := StrSplit(Data, "`n", "`r"), text := "" 
for each, line in arr
	if a_index = 1
		TrackNum := Line
		else
	if a_index = 9
		code := Line
		else
	if a_index = 18
		Ctime := Line
		
MsgBox Tack Number: %TrackNum%`nISRC Code: %code%`nCurrent Time: %Ctime%
return
	
ESC::ExitApp

The problem is - some tracks don't have a spotify link, so in these cases the isrc code becomes line 8, and the track time becomes line 17. For this I'm wondering if I need to try and remove all lines which have a spotify link? then the ISRC would always be on line 8?

Similarly, the current time is always on the last line in the variable, but it is not always line 18.. if a track has multiple featured artists, the current time line gets pushed down, but it's alays the last line - is there a way to identify the last line of a variable and save that to a new variable?

I hope this all makes sense!


So in short - is there a way to go through my variable %Data% - save the first line to variable %TrackNum%, save the last line to variable %Ctime% and remove line which include the word spotify, so that the isrc code is always on line 8?
teadrinker
Posts: 4412
Joined: 29 Mar 2015, 09:41
Contact:

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 09:26

Try this:

Code: Select all

data =
(
03 ;Track Number - I want this stored as %TrackNum%
track title
artist name
✎
https://open.spotify.com ; Spotify link (not all tracks have this)
Primary  ;Artist Role

GX4R52290979          ;ISRC code, I want to store this as %code%
Good                  
              ;blank line
              ;blank line
              ;blank line
View Luna Audio Report
Not Explicit
              ;blank line
00:02:41
00:00:24                           ;This is the current time I want to store as %Ctime%
)

RegExMatch(data, "^(\d+).+?\R\R(\S+).+\R([\d:]+)", m)
TrackNum := m1
code := m2
Ctime := m3

MsgBox, % TrackNum . "`n"
        . code     . "`n"
        . Ctime
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 09:53

Thanks, but unfortunately that's just showing an empty message box.. am I doing something wrong?

Code: Select all

^+y::

	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	Send, ^c
	ClipWait, 2	
	Data := clipboard
	
	RegExMatch(data, "^(\d+).+?\R\R(\S+).+\R([\d:]+)", m)
TrackNum := m1
code := m2
Ctime := m3

MsgBox, % TrackNum . "`n"
        . code     . "`n"
        . Ctime
		
		return 
		
		ESC::ExitApp
teadrinker
Posts: 4412
Joined: 29 Mar 2015, 09:41
Contact:

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 10:01

Please provide the exact contents of the clipboard.
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 10:20

Thank you for taking the time to help!

The contents of the clipboard will be different depending on the track.. most come out like this for a normal track with a spotify link;

Code: Select all

01
Tú Sonrisa
Tinga Rodriguez
✎
https://open.spotify.com/artist/7gYJpQVWY5nnzgGFnPNBKa

Primary

GX4R52291142
Good



View Luna Audio Report
Not Explicit

00:02:34
00:00:23
for tracks without spotify links they come out like this;

Code: Select all

01
CREEP - Single
Alivia Valdez
✎

Primary

GX4R52291105
Good



View Luna Audio Report
Not Explicit

00:02:30
00:00:14
Or some tracks have more than one artist, so there are more lines like this;

Code: Select all

03
Krieger aus den Straßen
Nash
✎

Primary

GX4R52291124
Good



View Luna Audio Report
Explicit
3
Nash 
✎

Primary 

00:04:02
00:01:57

There could be many different versions depending on how many artists there are ect, but the track number is always the first line, the current time is always the last line , and if I could remove lines if they contain a spotify link, then the isrc would always be line 8.
teadrinker
Posts: 4412
Joined: 29 Mar 2015, 09:41
Contact:

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 10:38

This expression

Code: Select all

RegExMatch(data, "(\d+)(?:.+?\R\R){2}(\S+).+\R([\d:]+)", m)
works with all three options you provided:

Code: Select all

data1 =
(
01
Tú Sonrisa
Tinga Rodriguez
✎
https://open.spotify.com/artist/7gYJpQVWY5nnzgGFnPNBKa

Primary

GX4R52291142
Good



View Luna Audio Report
Not Explicit

00:02:34
00:00:23
)

data2 =
(
01
CREEP - Single
Alivia Valdez
✎

Primary

GX4R52291105
Good



View Luna Audio Report
Not Explicit

00:02:30
00:00:14
)

data3 =
(
03
Krieger aus den Straßen
Nash
✎

Primary

GX4R52291124
Good



View Luna Audio Report
Explicit
3
Nash 
✎

Primary 

00:04:02
00:01:57
)

Loop 3 {
   RegExMatch(data%A_Index%, "(\d+)(?:.+?\R\R){2}(\S+).+\R([\d:]+)", m)
   TrackNum := m1
   code := m2
   Ctime := m3

   MsgBox, % TrackNum . "`n"
           . code     . "`n"
           . Ctime
}
Descolada
Posts: 1202
Joined: 23 Dec 2021, 02:30

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 11:08

Hello,
In your example code, try replacing this part:

Code: Select all

arr := StrSplit(Data, "`n", "`r"), text := "" 
for each, line in arr
	if a_index = 1
		TrackNum := Line
		else
	if a_index = 9
		code := Line
		else
	if a_index = 18
		Ctime := Line
		
MsgBox Tack Number: %TrackNum%`nISRC Code: %code%`nCurrent Time: %Ctime%
with this:

Code: Select all

arr := StrSplit(Data, "`n", "`r")
TrackNum := arr[1]
RegexMatch(Data, "\b\w{5}\d{7}\b", code)
Ctime := arr[arr.MaxIndex()]
		
MsgBox Track Number: %TrackNum%`nISRC Code: %code%`nCurrent Time: %Ctime%
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 11:16

Sorry i'm sure its me doing something stupid, am I not adding your code into my script correctly? When I run it I just get an empty message box?

Code: Select all

^+y::

	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	Send, ^c
	ClipWait, 2	
	data := clipboard

	RegExMatch(data%A_Index%, "(\d+)(?:.+?\R\R){2}(\S+).+\R([\d:]+)", m)
	
   TrackNum := m1
   code := m2
   Ctime := m3

   MsgBox, % TrackNum . "`n"
           . code     . "`n"
           . Ctime

return
I got a little further with my previous attempt, but your version looks much better if I can get that to work. I was able to remove lines 2-7 and also remove blank lines, so that the isrc is now always the 2nd line, but i still cant figure out how to store the last line as a variable - but as I said, your version looks much better if I can figure out how to use it!

This is where I got to...

Code: Select all

^+y::

	Clipboard := ""
	winactivate, ahk_exe chrome.exe
	
	Send, ^c
	ClipWait, 2	
	Data := clipboard
	
	Output := LineDelete(Data, 2, 7) ; Deletes line 2.

Loop
{
    StringReplace, Output, Output, `r`n`r`n, `r`n, UseErrorLevel
    if (ErrorLevel = 0)  ; No more replacements needed.
        break
}

arr := StrSplit(Output, "`n", "`r"), text := "" 
for each, line in arr
	if a_index = 1
		TrackNum := Line
		else
	if a_index = 2
		code := Line
		else
	if a_index = 7
		Ctime := Line
		
MsgBox Tack Number: %TrackNum%`nISRC Code: %code%`nCurrent Time: %Ctime%`n`nTrack %TrackNum% (%code%) contains a speech sample at %Ctime%.
return

LineDelete(Str, Pos, EndPos := "", Opts := "", ByRef Match := ""){
	If (Pos = "" || Pos = 0 || Pos * 1 = "")
		Return Str
	LF := (inStr(Str, "`r`n")) ? ("`r`n")
		: (inStr(Str, "`n")) ? ("`n") : ("")
	L := StrLen(LF)
	Str := LF . StrReplace(Str, LF, LF, NumLines) . LF
	NumLines := (SubStr(Str, 0) != LF) ? (++NumLines) : (++NumLines)
	Pos := (Pos < 0 && Pos >= -NumLines) ? (Pos + NumLines + 1)
		 : (Pos < -NumLines || Pos > NumLines) ? ("") : (Pos)
	EndPos := (Pos = "" || EndPos * 1 = "" || EndPos = 0
			   || EndPos < -NumLines || EndPos > NumLines) ? ("")
			: (EndPos < 0) ? (EndPos + NumLines + 1) : (EndPos)
	(EndPos != "" && EndPos < Pos)
			 ? (Tmp := Pos, Pos := EndPos, EndPos := Tmp, Tmp := "")
	Opts := ((Opts != "B" && Opts != "") || (Pos = EndPos && Opts = "B")
			 || (EndPos = "" && Opts = "B")) ? ("") : (Opts)
	StartMatch := (Opts = "") ? (InStr(Str, LF,,, Pos) + L)
				: (InStr(Str, LF,,, Pos + 1) + L)
	EndMatch := (EndPos = "" && Opts = "") ? (InStr(Str, LF,,, Pos + 1))
			  : (EndPos != "" && Opts != "") ? (InStr(Str, LF,,, EndPos))
			  : (InStr(Str, LF,,, EndPos + 1))
	Match := ((EndPos - Pos = 1 && Opts = "B")
			  || Pos = "") ? ("")
		   : (SubStr(Str, StartMatch, EndMatch - StartMatch))
	Str := SubStr(SubStr(Str, 1, StartMatch - 1)
	     . SubStr(Str, EndMatch + L), L + 1, -L)
	Return Str
}
Barnaby Ray
Posts: 45
Joined: 09 Nov 2021, 07:48

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 11:21

Ah thank you @Descolada, your code works perfectly! :) That you both for your help!
teadrinker
Posts: 4412
Joined: 29 Mar 2015, 09:41
Contact:

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 11:31

Barnaby Ray wrote:

Code: Select all

RegExMatch(data%A_Index%, "(\d+)(?:.+?\R\R){2}(\S+).+\R([\d:]+)", m)
Not data%A_Index% but just data. The data%A_Index% was used in the loop to iterate all of the data1, data2, data3.
Perhaps this expression

Code: Select all

RegExMatch(data, "(\d+).*?(\b\w{5}\d{7}\b).+\R([\d:]+)", m)
would work more reliably (as @Descolada suggested), but only if the code structure is constant.
User avatar
tank
Posts: 3130
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 14:15

(?:(?<TrackNum>\d{1,2})(?:[\D].*?\n)+?(?<Code>\w{12})(?:.*?\n)+?\d\d\:\d\d\:\d\d(?:.*?\n)+?(?<Ctime>\d\d\:\d\d\:\d\d).*)
$1|$2|$3\n
Sample Spoofed Data
Spoiler
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
BoBo
Posts: 6564
Joined: 13 May 2014, 17:15

Re: Trying to extract text from webpage, then parsing the text.

11 May 2022, 14:36

@Barnaby Ray
I have a web page in chrome with a bunch of audio tracks.
I've used @jeeswg's :arrow: JEE_AccGetTextAll()-function that will provide the content of each displayed webpage element in an active Chrome session/window.
I'd assume that the mentioned song list can be extracted (easily) from there as well. JFTR :shh:

Usage samples: :arrow: here or :arrow: here :mrgreen:

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Marium0505 and 349 guests