extract parts of internet pages source code

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

extract parts of internet pages source code

18 Sep 2020, 13:52

Hello,
I'm trying to store in a variable the content between two strings of an internet source code page as I've seen in another topic.
For example, let's suppose that an internet page source code contains the string "Roses are red", so I want MatchName to be " are "

My code is:

Code: Select all

S::
Send ^u          ;opens source code page    
WinWait, view-source
Send ^a              ;selects all code in the source page
Send ^c              ;copies the selected code in the clipboard

Send ^w   ;closes source code page          



RegExMatch(clipboard, "Roses(?P<Name>.*?)red", Match) 

MsgBox % MatchName

clipboard:=MatchName

return
[Mod edit: [code][/code] tags added.]


The problem is that it doesn't work always in my case. Sometime the Msg Box displays the correct thing but the clipboard content is not what the MsgBox displays. Very often , in the clipboard i find the copied internet sourcecode page. Very often the MsgBox results empty. I'm not interested much in having the correct MsgBox but I need the clipboard content to be correct. It also seems to me that it works on some internet sites and not on others.

Any idea or other method to obtain a better or perfectly working result? I'm pretty noob at coding, so I'd be grateful if you could write the code.
Thanks
teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: extract parts of internet pages source code

18 Sep 2020, 14:20

Hi
Try:

Code: Select all

...
Send ^a
Clipboard := ""
Send ^c
ClipWait, 1
if ErrorLevel {
   MsgBox, Clipboard is empty
   Return
}
Send ^w
...
garry
Posts: 3763
Joined: 22 Dec 2013, 12:50

Re: extract parts of internet pages source code

18 Sep 2020, 14:49

@teadrinker , thank you
here an example , download URL to variable and then search between A and B ( function from user SKAN )

Code: Select all

;- search text in URL between A and B , example this line :
;<img src="https://i.imgur.com/h3PQNCv.png" class="postimage" alt="Image"></div>
; and then show result :
;----------------------------------------
;https://i.imgur.com/jkT9UtX.jpg
;https://i.imgur.com/6s4mx9x.png
;https://i.imgur.com/fb7p21s.png
;https://i.imgur.com/VOhpOuf.jpg
;--------------------------------------------------------------------------

url:="https://www.autohotkey.com/boards/viewtopic.php?f=17&t=52&start=4240"
f1 :=a_scriptdir . "\FoundText.txt"
;---------
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("Get",URL)
whr.Send()
whr.WaitForResponse()
H:= whr.ResponseText
A= <img src="
B= " class="postimage"
;---------
Loop,parse,H,`n,`r
{
 x:=a_loopfield
 if x=
   continue
 if x contains imgur
 {
 d:=xStr(x,,A,B)           ;- between A and B
 if d=
   continue
 e .=  d . "`r`n"
 }
}
;msgbox,%e%
;return
;---------
if e<>
{
ifnotexist,%f1%
  fileappend,%e%,%f1%
try
run,%f1%
}
e:=""
exitapp
esc::exitapp
;----------------------------------------------------------------
;- SKAN  xStr  for general text extraction and parsing XML  HTML 
;- https://www.autohotkey.com/boards/viewtopic.php?f=6&t=74050
xStr(ByRef H, C:=0, B:="", E:="",ByRef BO:=1, EO:="", BI:=1, EI:=1, BT:="", ET:="") {                           
Local L, LB, LE, P1, P2, Q, N:="", F:=0                 ; xStr v0.97 by SKAN on D1AL/D343 @ tiny.cc/xstr  
Return SubStr(H,!(ErrorLevel:=!((P1:=(L:=StrLen(H))?(LB:=StrLen(B))?(F:=InStr(H,B,C&1,BO,BI))?F+(BT=N?LB
:BT):0:(Q:=(BO=1&&BT>0?BT+1:BO>0?BO:L+BO))>1?Q:1:0)&&(P2:=P1?(LE:=StrLen(E))?(F:=InStr(H,E,C>>1,EO=N?(F
?F+LB:P1):EO,EI))?F+LE-(ET=N?LE:ET):0:EO=N?(ET>0?L-ET+1:L+1):P1+EO:0)>=P1))?P1:L+1,(BO:=Min(P2,L+1))-P1)  
}
;----------------------------------------------------------------
;==================================================
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

19 Sep 2020, 01:15

teadrinker wrote:
18 Sep 2020, 14:20
Hi
Try:

Code: Select all

...
Send ^a
Clipboard := ""
Send ^c
ClipWait, 1
if ErrorLevel {
   MsgBox, Clipboard is empty
   Return
}
Send ^w
...
Tried implementing this but still has issues. Seems to work 50 % of the times. 1 msg out of 2 is Clipboard is empty
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

19 Sep 2020, 01:54

garry wrote:
18 Sep 2020, 14:49
@teadrinker , thank you
here an example , download URL to variable and then search between A and B ( function from user SKAN )

Code: Select all

;- search text in URL between A and B , example this line :
;<img src="https i.imgur.com /h3PQNCv.png"  Broken Link for safety class="postimage" alt="Image"></div>
; and then show result :
;----------------------------------------
;https i.imgur.com /jkT9UtX.jpg  Broken Link for safety
;https i.imgur.com /6s4mx9x.png  Broken Link for safety
;https i.imgur.com /fb7p21s.png  Broken Link for safety
;https i.imgur.com /VOhpOuf.jpg  Broken Link for safety
;--------------------------------------------------------------------------

url:="https://www.autohotkey.com/boards/viewtopic.php?f=17&t=52&start=4240"
f1 :=a_scriptdir . "\FoundText.txt"
;---------
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("Get",URL)
whr.Send()
whr.WaitForResponse()
H:= whr.ResponseText
A= <img src="
B= " class="postimage"
;---------
Loop,parse,H,`n,`r
{
 x:=a_loopfield
 if x=
   continue
 if x contains imgur
 {
 d:=xStr(x,,A,B)           ;- between A and B
 if d=
   continue
 e .=  d . "`r`n"
 }
}
;msgbox,%e%
;return
;---------
if e<>
{
ifnotexist,%f1%
  fileappend,%e%,%f1%
try
run,%f1%
}
e:=""
exitapp
esc::exitapp
;----------------------------------------------------------------
;- SKAN  xStr  for general text extraction and parsing XML  HTML 
;- https://www.autohotkey.com/boards/viewtopic.php?f=6&t=74050
xStr(ByRef H, C:=0, B:="", E:="",ByRef BO:=1, EO:="", BI:=1, EI:=1, BT:="", ET:="") {                           
Local L, LB, LE, P1, P2, Q, N:="", F:=0                 ; xStr v0.97 by SKAN on D1AL/D343 @ tiny.cc/xstr  
Return SubStr(H,!(ErrorLevel:=!((P1:=(L:=StrLen(H))?(LB:=StrLen(B))?(F:=InStr(H,B,C&1,BO,BI))?F+(BT=N?LB
:BT):0:(Q:=(BO=1&&BT>0?BT+1:BO>0?BO:L+BO))>1?Q:1:0)&&(P2:=P1?(LE:=StrLen(E))?(F:=InStr(H,E,C>>1,EO=N?(F
?F+LB:P1):EO,EI))?F+LE-(ET=N?LE:ET):0:EO=N?(ET>0?L-ET+1:L+1):P1+EO:0)>=P1))?P1:L+1,(BO:=Min(P2,L+1))-P1)  
}
;----------------------------------------------------------------
;==================================================
Sorry but I don't undestand this code...From what i could understand H is the string that should contain all source page code, A and B are the two strings in the middle of which I should find the string I search. I tried putting H:="violets are blue", A="violets", B=blue. But no results
garry
Posts: 3763
Joined: 22 Dec 2013, 12:50

Re: extract parts of internet pages source code

19 Sep 2020, 06:52

try with a short example , use some msgbox to see what happens

Code: Select all

h=
(Ltrim join`r`n
aaaaaaaaaaaaaa 111111111111 violets
sssssviolets are blue fffffffffffffff
cccccccccccccc 222222222222 blue
yyyyy violets are bluejjjjjjjjjjjj
eeeeeeeeeeeee xxxxxxxxxxxxxx
)
a=violets
b=blue

Loop,parse,H,`n,`r
 {
 x:=a_loopfield
 d:=xStr(x,,A,B)
 if d=
   continue
 e .=  d . "`r`n"
 }
msgbox,%e%
e=
exitapp
;----------------------------------------------------------------
;- SKAN  xStr  for general text extraction and parsing XML  HTML 
;- https://www.autohotkey.com/boards/viewtopic.php?f=6&t=74050
xStr(ByRef H, C:=0, B:="", E:="",ByRef BO:=1, EO:="", BI:=1, EI:=1, BT:="", ET:="") {                           
Local L, LB, LE, P1, P2, Q, N:="", F:=0                 ; xStr v0.97 by SKAN on D1AL/D343 @ tiny.cc/xstr  
Return SubStr(H,!(ErrorLevel:=!((P1:=(L:=StrLen(H))?(LB:=StrLen(B))?(F:=InStr(H,B,C&1,BO,BI))?F+(BT=N?LB
:BT):0:(Q:=(BO=1&&BT>0?BT+1:BO>0?BO:L+BO))>1?Q:1:0)&&(P2:=P1?(LE:=StrLen(E))?(F:=InStr(H,E,C>>1,EO=N?(F
?F+LB:P1):EO,EI))?F+LE-(ET=N?LE:ET):0:EO=N?(ET>0?L-ET+1:L+1):P1+EO:0)>=P1))?P1:L+1,(BO:=Min(P2,L+1))-P1)  
}
;----------------------------------------------------------------
;==================================================
teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: extract parts of internet pages source code

19 Sep 2020, 07:59

Dario91 wrote: Seems to work 50 % of the times. 1 msg out of 2 is Clipboard is empty
Try adding sleep:

Code: Select all

...
Send ^a
Sleep, 100
Clipboard := ""
Sleep, 50
Send ^c
ClipWait, 1
if ErrorLevel {
   MsgBox, Clipboard is empty
   Return
}
Send ^w
... 
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

19 Sep 2020, 15:33

teadrinker wrote:
19 Sep 2020, 07:59
Dario91 wrote: Seems to work 50 % of the times. 1 msg out of 2 is Clipboard is empty
Try adding sleep:

Code: Select all

...
Send ^a
Sleep, 100
Clipboard := ""
Sleep, 50
Send ^c
ClipWait, 1
if ErrorLevel {
   MsgBox, Clipboard is empty
   Return
}
Send ^w
...
Thanks for the help;I just tested it like ten times, it doesn't work 100% of the times but seems to work more often than before. Does it always work to you?
User avatar
mikeyww
Posts: 26883
Joined: 09 Sep 2014, 18:38

Re: extract parts of internet pages source code

19 Sep 2020, 16:31

I've run some other tests with ClipWait. I found that if I have a lot of clips in a row, it misses some of them without an intervening Sleep. I do use a separate clipboard manager, so I'm not sure whether the issue is related to that.
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

20 Sep 2020, 00:46

teadrinker wrote:
19 Sep 2020, 17:44
@Dario91, which browser do you use?
google chrome
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

20 Sep 2020, 00:49

mikeyww wrote:
19 Sep 2020, 16:31
I've run some other tests with ClipWait. I found that if I have a lot of clips in a row, it misses some of them without an intervening Sleep. I do use a separate clipboard manager, so I'm not sure whether the issue is related to that.
It could be...I don't really know. How do you make a separate clipboard manager?
Thanks
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

20 Sep 2020, 01:42

garry wrote:
19 Sep 2020, 06:52
try with a short example , use some msgbox to see what happens

Code: Select all

h=
(Ltrim join`r`n
aaaaaaaaaaaaaa 111111111111 violets
sssssviolets are blue fffffffffffffff
cccccccccccccc 222222222222 blue
yyyyy violets are bluejjjjjjjjjjjj
eeeeeeeeeeeee xxxxxxxxxxxxxx
)
a=violets
b=blue

Loop,parse,H,`n,`r
 {
 x:=a_loopfield
 d:=xStr(x,,A,B)
 if d=
   continue
 e .=  d . "`r`n"
 }
msgbox,%e%
e=
exitapp
;----------------------------------------------------------------
;- SKAN  xStr  for general text extraction and parsing XML  HTML 
;- https://www.autohotkey.com/boards/viewtopic.php?f=6&t=74050
xStr(ByRef H, C:=0, B:="", E:="",ByRef BO:=1, EO:="", BI:=1, EI:=1, BT:="", ET:="") {                           
Local L, LB, LE, P1, P2, Q, N:="", F:=0                 ; xStr v0.97 by SKAN on D1AL/D343 @ tiny.cc/xstr  
Return SubStr(H,!(ErrorLevel:=!((P1:=(L:=StrLen(H))?(LB:=StrLen(B))?(F:=InStr(H,B,C&1,BO,BI))?F+(BT=N?LB
:BT):0:(Q:=(BO=1&&BT>0?BT+1:BO>0?BO:L+BO))>1?Q:1:0)&&(P2:=P1?(LE:=StrLen(E))?(F:=InStr(H,E,C>>1,EO=N?(F
?F+LB:P1):EO,EI))?F+LE-(ET=N?LE:ET):0:EO=N?(ET>0?L-ET+1:L+1):P1+EO:0)>=P1))?P1:L+1,(BO:=Min(P2,L+1))-P1)  
}
;----------------------------------------------------------------
;==================================================
So...I tried it and the result is "are are" so it works with strings written in the script. Then i tried to go beyond.
I canceled the exitapp thing.
Then i tried using it on an internet source code page.
so I've put:

Code: Select all

P::
Send ^u              
WinWait, view-source
Send ^a
Sleep, 100              
Clipboard := ""
Sleep, 50
Send ^c
ClipWait, 3
if ErrorLevel {
   MsgBox, Clipboard is empty
  Return
}     
Send ^w
h=
(Ltrim join`r`n
clipboard)
...
[Mod edit: [code][/code] tags added.]

But it seems to no work. At this point I think the real issue is the clipboard thing as another guy suggested...but idk.
Dario91
Posts: 11
Joined: 18 Sep 2020, 02:19

Re: extract parts of internet pages source code

20 Sep 2020, 01:56

garry wrote:
19 Sep 2020, 06:52
try with a short example , use some msgbox to see what happens

Code: Select all

h=
(Ltrim join`r`n
aaaaaaaaaaaaaa 111111111111 violets
sssssviolets are blue fffffffffffffff
cccccccccccccc 222222222222 blue
yyyyy violets are bluejjjjjjjjjjjj
eeeeeeeeeeeee xxxxxxxxxxxxxx
)
a=violets
b=blue

Loop,parse,H,`n,`r
 {
 x:=a_loopfield
 d:=xStr(x,,A,B)
 if d=
   continue
 e .=  d . "`r`n"
 }
msgbox,%e%
e=
exitapp
;----------------------------------------------------------------
;- SKAN  xStr  for general text extraction and parsing XML  HTML 
;- https://www.autohotkey.com/boards/viewtopic.php?f=6&t=74050
xStr(ByRef H, C:=0, B:="", E:="",ByRef BO:=1, EO:="", BI:=1, EI:=1, BT:="", ET:="") {                           
Local L, LB, LE, P1, P2, Q, N:="", F:=0                 ; xStr v0.97 by SKAN on D1AL/D343 @ tiny.cc/xstr  
Return SubStr(H,!(ErrorLevel:=!((P1:=(L:=StrLen(H))?(LB:=StrLen(B))?(F:=InStr(H,B,C&1,BO,BI))?F+(BT=N?LB
:BT):0:(Q:=(BO=1&&BT>0?BT+1:BO>0?BO:L+BO))>1?Q:1:0)&&(P2:=P1?(LE:=StrLen(E))?(F:=InStr(H,E,C>>1,EO=N?(F
?F+LB:P1):EO,EI))?F+LE-(ET=N?LE:ET):0:EO=N?(ET>0?L-ET+1:L+1):P1+EO:0)>=P1))?P1:L+1,(BO:=Min(P2,L+1))-P1)  
}
;----------------------------------------------------------------
;==================================================
Maybe I managed to do it, i ve put in the h=(....%clipboard%...). Still need to test it more because I have to go work*
teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: extract parts of internet pages source code

20 Sep 2020, 03:38

Dario91 wrote: google chrome
Try this:

Code: Select all

SetBatchLines, -1
Return

P::
   Clipboard := ""
   Sleep, 50
   RunJsFromChromeAddressBar(GetJS(), "chrome.exe")
   ClipWait, 2
   MsgBox, % ErrorLevel ? "Failed to get value" : Clipboard
   Return
   
RunJsFromChromeAddressBar(js, exe := "chrome.exe") {
   static WM_GETOBJECT := 0x3D
        , ROLE_SYSTEM_TEXT := 0x2A
        , STATE_SYSTEM_FOCUSABLE := 0x100000
        , SELFLAG_TAKEFOCUS := 0x1
        , AccAddrBar
   if !AccAddrBar {
      window := "ahk_class Chrome_WidgetWin_1 ahk_exe " . exe
      SendMessage, WM_GETOBJECT, 0, 1, Chrome_RenderWidgetHostHWND1, % window
      AccChrome := AccObjectFromWindow( WinExist(window) )
      AccAddrBar := SearchElement(AccChrome, {Role: ROLE_SYSTEM_TEXT, State: STATE_SYSTEM_FOCUSABLE})
   }
   AccAddrBar.accValue(0) := "javascript:" . js
   AccAddrBar.accSelect(SELFLAG_TAKEFOCUS, 0)
   ControlSend,, {Enter}, % window, Chrome Legacy Window
}

GetJS() {
   js =
   (LTrim
      javascript:
      (() => {
         if (window.location.protocol === 'https:') {
            document.documentElement.focus();
            const timer = setInterval(() => {
               if (document.hasFocus()) {
                  clearInterval(timer);
                  navigator.clipboard.writeText(document.documentElement.outerHTML);
               }
            }, 10);
         }
         else {
            const textArea = document.createElement('textarea');
            textArea.value = document.documentElement.outerHTML;
            textArea.wrap = 'off';
            textArea.rows = 100000;
            textArea.style.position = 'fixed';
            document.documentElement.appendChild(textArea);
            textArea.focus();
            textArea.select();
            document.execCommand('copy');
            textArea.parentNode.removeChild(textArea);
         }
      })();
   )
   Return js
}

SearchElement(parentElement, params)
{
   found := true
   for k, v in params {
      try {
         if (k = "ChildCount")
            (parentElement.accChildCount != v && found := false)
         else if (k = "State")
            (!(parentElement.accState(0) & v) && found := false)
         else
            (parentElement["acc" . k](0) != v && found := false)
      }
      catch 
         found := false
   } until !found
   if found
      Return parentElement
   
   for k, v in AccChildren(parentElement)
      if obj := SearchElement(v, params)
         Return obj
}

AccObjectFromWindow(hWnd, idObject = 0) {
   static IID_IDispatch   := "{00020400-0000-0000-C000-000000000046}"
        , IID_IAccessible := "{618736E0-3C3D-11CF-810C-00AA00389B71}"
        , OBJID_NATIVEOM  := 0xFFFFFFF0, VT_DISPATCH := 9, F_OWNVALUE := 1
        , h := DllCall("LoadLibrary", "Str", "oleacc", "Ptr")
        
   VarSetCapacity(IID, 16), idObject &= 0xFFFFFFFF
   DllCall("ole32\CLSIDFromString", "Str", idObject = OBJID_NATIVEOM ? IID_IDispatch : IID_IAccessible, "Ptr", &IID)
   if DllCall("oleacc\AccessibleObjectFromWindow", "Ptr", hWnd, "UInt", idObject, "Ptr", &IID, "PtrP", pAcc) = 0
      Return ComObject(VT_DISPATCH, pAcc, F_OWNVALUE)
}

AccChildren(Acc) {
   static VT_DISPATCH := 9
   Loop 1  {
      if ComObjType(Acc, "Name") != "IAccessible"  {
         error := "Invalid IAccessible Object"
         break
      }
      try cChildren := Acc.accChildCount
      catch
         Return ""
      Children := []
      VarSetCapacity(varChildren, cChildren*(8 + A_PtrSize*2), 0)
      res := DllCall("oleacc\AccessibleChildren", "Ptr", ComObjValue(Acc), "Int", 0
                                                , "Int", cChildren, "Ptr", &varChildren, "IntP", cChildren)
      if (res != 0) {
         error := "AccessibleChildren DllCall Failed"
         break
      }
      Loop % cChildren  {
         i := (A_Index - 1)*(A_PtrSize*2 + 8)
         child := NumGet(varChildren, i + 8)
         Children.Push( (b := NumGet(varChildren, i) = VT_DISPATCH) ? AccQuery(child) : child )
         ( b && ObjRelease(child) )
      }
   }
   if error
      ErrorLevel := error
   else
      Return Children.MaxIndex() ? Children : ""
}

AccQuery(Acc) {
   static IAccessible := "{618736e0-3c3d-11cf-810c-00aa00389b71}", VT_DISPATCH := 9, F_OWNVALUE := 1
   try Return ComObject(VT_DISPATCH, ComObjQuery(Acc, IAccessible), F_OWNVALUE)
}
Last edited by teadrinker on 20 Sep 2020, 11:32, edited 1 time in total.
User avatar
mikeyww
Posts: 26883
Joined: 09 Sep 2014, 18:38

Re: extract parts of internet pages source code

20 Sep 2020, 08:45

I use Ditto as a separate clipboard manager. I found that if it is running, I need an extra wait for sequential clips. The following worked to add an extra Sleep.

Code: Select all

clipIt(text) {
 Clipboard := "", Clipboard := text
 ClipWait, 2
 Process, Exist, Ditto.exe
 Sleep, ErrorLevel ? 110 : 0
}

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Holarctic, jameswrightesq, wpulford and 410 guests