Screen Scraping a web page by tank

Discuss Robotics Process Automation. RPA is a rapidly growing field with 6 figure incomes and an extreme workforce shortage. This sub forum will be used to discuss aspects of RPA as it relates to both scripting languages and RPA software such as UIPath or Automation Anywhere
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Screen Scraping a web page by tank

07 Apr 2019, 12:17

So i see this over and over in the world of RPA. Data extraction from web pages is by far one of the most common tasks
I am linking a youtube video i created reviewing the code as well as posting the code. Please note that iExcel is in its infancy and barely helpful

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

;https://www.barchart.com/stocks/top-100-stocks
#Include, iWebBrowser2.ahk

oIE := new iWebBrowser2
pwb := oIE.oIE_new("https://www.barchart.com/stocks/top-100-stocks")

while pwb.busy
    sleep,100

pDoc := oIE.pDoc(pwb)

while !pTable := getTable(pDoc)
    sleep,100    

#Include, iExcel.ahk
oExcel := new iExcel
oWkbk := oExcel.new()

while pRow := pTable.rows[A_Index]
{
    oExcel.set(oWkbk, "Sheet1", "a" A_Index, pRow.queryselector("td div:nth-child(1) a").innertext)
    oExcel.set(oWkbk, "Sheet1", "b" A_Index, pRow.queryselector("td + td + td").innertext)
    oExcel.set(oWkbk, "Sheet1", "c" A_Index, pRow.queryselector("td + td + td + td").innertext)
}
    
getTable(pDoc){
    try{
        return pDoc.queryselectorall("table")[0]
        } 
    catch e 
    {
        return false
    }
}
iWebBrowser2

Code: Select all


class iWebBrowser2 {

    oIE_new(ByRef sURL = "about:blank", ByRef sTitle = "", ByRef iHWND = "", ByRef sHTML = "", bVisible = true)
        {
        this.isInstalled()
        oIE := ComObjCreate("internetexplorer.application")
        oIE.Visible := bVisible
        if sURL
            oIE.Navigate(sURL)
        return oIE
        }

    oIE_get( ByRef sTitle = "", ByRef iHWND = "", ByRef sURL = "", ByRef sHTML = "" )
        {
        this.isInstalled()
        ;~ this function is pointless if no instance of IE is open
        ;~ one edit you might make is to have this function open IE and maybe go to the home page
        if ( !winexist( "ahk_class IEFrame" ) )
            {
            MsgBox, 4112, NO IE Window Found, The Macro will end
            ExitApp
            }
        
        if sTitle
            this.clean_IE_Title( sTitle ) 
        ;; ok this function should look at all the existing IE instances and build a reference object
        ; List all open Explorer and Internet Explorer windows:
        oIE := Object()
        matches := 0
        
        for window,k in ComObjCreate("Shell.Application").Windows
            if ( "Internet Explorer" = window.Name)
                {
                possiblematch := true
                if !window.document
                    continue
                pdoc := this.pDoc(window)
                if ( possiblematch && sTitle && !instr( pdoc.title, sTitle ) )
                    possiblematch := false
                
                if ( possiblematch && sHTML && !instr( pdoc.documentelement.outerhtml, sHTML ) )
                    possiblematch := false
                
                if ( possiblematch && sURL && !instr( pdoc.url, sURL ) )
                    possiblematch := false
                
                if ( possiblematch && iHWND > 0 && window.HWND != iHWND )
                    possiblematch := false		
                    
                if ( possiblematch )
                    {
                    ;~ windowsList .= k " => " ( clipboard := window.FullName ) " :: " pdoc.title " :: " pdoc.url "`n"
                    matches++
                    sTitle := pdoc.title
                    sURL := pdoc.url
                    iHWND := window.HWND
                    sHTML := pdoc.documentelement.outerhtml
                    oIE := window
                    }
                ObjRelease( pdoc )
                }
                
        if ( matches > 1 )
            {
            MsgBox, 4112, Too many Matches ,  Please modify your criteria or close some tabs/windows and retry
            ExitApp
            }
            
        return oIE
        }

    isInstalled()
        {
        Static IE_path
        
        ;; find where windows believes IE is installed
        ;; certain corp installs may have this in other than expected folders
        if !IE_path
            RegRead, IE_path, HKLM, SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\IEXPLORE.EXE
        ;~ MsgBox % IE_path
        ;; Perhaps policies prevent reading this key
        if ( ErrorLevel || !IE_path )
            IE_path := "C:\Program Files\Internet Explorer\iexplore.exe"
        
        ;; make sure it installed
        if !FileExist( IE_path )
            {
            MsgBox, 4112, Internet Explorer Not Found, IE does not appear to be installed`nCannot continue `nClick OK to Exit!!!
            ExitApp
            }
        } 

    pDoc(oIE)
        {
        return this.IHTMLWindow2_from_IWebDOCUMENT( oIE.document ).document
        }

    clean_IE_Title( ByRef sTitle = "" ) 
        {
        return sTitle := RegExReplace( sTitle ? sTitle : this.active_IE_Title(), this.IE_Suffix() "$", "" )
        }

    IE_Suffix() 
        {
        static sIE_Suffix
        if !sIE_Suffix
            {
            ;; HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main
            RegRead, sIE_Suffix, HKCU, Software\Microsoft\Internet Explorer\Main, Window Title ;, Windows Internet Explorer,
            sIE_Suffix := " - " sIE_Suffix
            }
        return sIE_Suffix
        }

    active_IE_Title() ;; returns the title of the topmost browser if exists from the stack
        {
        sTitle := "NO IE Window Open"
        if winexist( "ahk_class IEFrame" )
            {
            titlematchMode := A_TitleMatchMode
            titlematchSpeed := A_TitleMatchModeSpeed
            SetTitleMatchMode, 2	
            SetTitleMatchMode, Slow
            WinGetTitle, sTitle, %sIE_Suffix% ahk_class IEFrame
            SetTitleMatchMode, %titlematchMode%	
            SetTitleMatchMode, %titlematchSpeed%
            }
        return RegExReplace( sTitle, this.IE_Suffix() "$", "" )
        }
        
        
        
    IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT )
        {
        static IID_IHTMLWindow2 := "{332C4427-26CB-11D0-B483-00C04FD90119}"  ; IID_IHTMLWindow2
        return ComObj(9,ComObjQuery( IWebDOCUMENT, IID_IHTMLWindow2, IID_IHTMLWindow2),1)
        }

    IWebDOCUMENT_from_IWebDOCUMENT( IWebDOCUMENT ) ;bypasses certain security issues
        {
        return this.IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT ).document
        }

    IWebBrowserApp_from_IWebDOCUMENT( IWebDOCUMENT )
        {
        static IID_IWebBrowserApp := "{0002DF05-0000-0000-C000-000000000046}"  ; IID_IWebBrowserApp
        return ComObj(9,ComObjQuery( this.IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT ), IID_IWebBrowserApp, IID_IWebBrowserApp),1)
        }

    IWebBrowserApp_from_Internet_Explorer_Server_HWND( hwnd, Svr#=1 ) 
        {               ;// based on ComObjQuery docs
        static msg := DllCall( "RegisterWindowMessage", "str", "WM_HTML_GETOBJECT" )
            , IID_IWebDOCUMENT := "{332C4425-26CB-11D0-B483-00C04FD90119}"
        
        SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, ahk_id %hwnd%
        
        if (ErrorLevel != "FAIL") 
            {
            lResult := ErrorLevel
            VarSetCapacity( GUID, 16, 0 )
            if DllCall( "ole32\CLSIDFromString", "wstr", IID_IWebDOCUMENT, "ptr", &GUID ) >= 0 
                {
                DllCall( "oleacc\ObjectFromLresult", "ptr", lResult, "ptr", &GUID, "ptr", 0, "ptr*", IWebDOCUMENT )
                return  this.IWebBrowserApp_from_IWebDOCUMENT( IWebDOCUMENT )
                }
            }
        }
}
iExcel

Code: Select all

class iExcel {
    

    ;; returns workbook object
    New(saveas = false, visible = true){
        obj := ComObjCreate("excel.application")
        obj.visible := visible
        owrkbk := obj.workbooks.add
        if (saveas && this.__validateFilePath(saveas)) {
           obj.DisplayAlerts := False
           owrkbk.saveas(saveas)
           obj.DisplayAlerts := True
        }
        return owrkbk
    }

    ;; get an existing workbook. returns false if a valid file isnt referenced
    GetWorkBook(path = false){
        ControlGet, hwnd, hwnd, , Excel71, ahk_class XLMAIN
        oWrkbk := false
        if hwnd {
            window := this.__ObjectFromWindow(hwnd,-16)
            if (path && this.__validateFilePath(path)){
                if (path == window.parent.FullName){
                    try oWrkbk := window.parent
                }
                else{
                    oWrkbk := ComObjGet(path)
                }
            }
            else{
                oWrkbk := window.parent
            }
        }
        else{
            if (path && this.__validateFilePath(path)){
                oWrkbk := ComObjGet(path)
            } 
        }
        return oWrkbk
    }

    ;; void
    set(oWkbk, sheet = 1, range = "", value = ""){
        oWkbk.sheets(sheet).cells.range(range).value := value
    }

    ;; returns value of range. Could be an array
    get(oWkbk, sheet = 1, range = ""){
        return oWkbk.sheets(sheet).cells.range(range).value
    }

    ;;https://www.codeproject.com/Tips/216238/Regular-Expression-to-Validate-File-Path-and-Exten
    ;; validate a path is valid
    __validateFilePath(path){
        return RegExMatch(path, "^(?:[\w]\:|\\)(\\[a-z_\-\s0-9\.]+)+\.(txt|xls|xlsx|csv)$") && FileExist(path)
    }

    ;***borrowd & tweaked from Acc.ahk Standard Library*** by Sean  Updated by jethrow*****************
    __ObjectFromWindow(hWnd, idObject = -4){ 
        (if Not h)?h:=DllCall("LoadLibrary","Str","oleacc","Ptr")
            If DllCall("oleacc\AccessibleObjectFromWindow","Ptr",hWnd,"UInt",idObject&=0xFFFFFFFF,"Ptr",-VarSetCapacity(IID,16)+NumPut(idObject==0xFFFFFFF0?0x46000000000000C0:0x719B3800AA000C81,NumPut(idObject==0xFFFFFFF0?0x0000000000020400:0x11CF3C3D618736E0,IID,"Int64"),"Int64"), "Ptr*", pacc)=0
                Return	ComObjEnwrap(9,pacc,1)
    }
}
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
SOTE
Posts: 1426
Joined: 15 Jun 2015, 06:21

Re: Screen Scraping a web page by tank

07 Apr 2019, 16:26

Excellent video and topic!

Also, might I suggest uploading video in 1080p quality, in regards to the YouTube channel. When the text is smaller or the person has a large screen, 1080p will often look better. It's of course a plus vs minus type of thing, as viewing in the higher quality might take a bit more uploading time and cause some buffering or pauses for those with lower speed Internet connections (but they could select 720p in the case of lower speed connections). If a content creator wants to stick with 720p, then they might need to zoom in more.
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Screen Scraping a web page by tank

07 Apr 2019, 17:09

yea this was my first go at a video i am unsatisfied with the video itself but lack the time to improve it much
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
burque505
Posts: 1731
Joined: 22 Jan 2017, 19:37

Re: Screen Scraping a web page by tank

07 Apr 2019, 19:05

@tank, just tried the barchart example and it was very smooth. Thank you for sharing it and the two classes.
burque505
Posts: 1731
Joined: 22 Jan 2017, 19:37

Re: Screen Scraping a web page by tank

08 Apr 2019, 16:08

@tank, here's a version WITHOUT Excel, using your your iWebBrowser2.ahk and EPPlus.dll. I'm attaching a spreadsheet to show the similarity in output. I will change it as soon as I dig back into my notes to remember how to create a spreadsheat with EPPlus instead of just opening. Been about six months, and I can barely remember my own name that long. The DLL is in the zip archive. Needs a dummy file "EPPlusInFile.xlsx" in the script dir.
Spoiler
EDIT: Version without the need for a pre-existing spreadsheet. :headwall: Both versions create "EPPOut.xlsx"
Spoiler
Attachments
EPPlus.zip
(460.19 KiB) Downloaded 245 times

[The extension xlsx has been deactivated and can no longer be displayed.]

Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Screen Scraping a web page by tank

10 Apr 2019, 08:43

Ohhh... This will be helpfull in the long run!!!

GJ Tank! GJ!!
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Screen Scraping a web page by tank

14 Apr 2019, 00:37

Finally updated the video with the compare with Automation Anywhere
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
guest3456
Posts: 3454
Joined: 09 Oct 2013, 10:31

Re: Screen Scraping a web page by tank

17 Apr 2019, 07:52

is someone sleeping/snoring in the background of that video?? :D

User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Screen Scraping a web page by tank

17 Apr 2019, 08:33

My wife :lol:
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter

Return to “RPA”

Who is online

Users browsing this forum: No registered users and 17 guests