Screen Scraping a web page by tank

Helpful script writing tricks and HowTo's
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Screen Scraping a web page by tank

07 Apr 2019, 12:17

So i see this over and over in the world of RPA. Data extraction from web pages is by far one of the most common tasks
I am linking a youtube video i created reviewing the code as well as posting the code. Please note that iExcel is in its infancy and barely helpful

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

;https://www.barchart.com/stocks/top-100-stocks
#Include, iWebBrowser2.ahk

oIE := new iWebBrowser2
pwb := oIE.oIE_new("https://www.barchart.com/stocks/top-100-stocks")

while pwb.busy
    sleep,100

pDoc := oIE.pDoc(pwb)

while !pTable := getTable(pDoc)
    sleep,100    

#Include, iExcel.ahk
oExcel := new iExcel
oWkbk := oExcel.new()

while pRow := pTable.rows[A_Index]
{
    oExcel.set(oWkbk, "Sheet1", "a" A_Index, pRow.queryselector("td div:nth-child(1) a").innertext)
    oExcel.set(oWkbk, "Sheet1", "b" A_Index, pRow.queryselector("td + td + td").innertext)
    oExcel.set(oWkbk, "Sheet1", "c" A_Index, pRow.queryselector("td + td + td + td").innertext)
}
    
getTable(pDoc){
    try{
        return pDoc.queryselectorall("table")[0]
        } 
    catch e 
    {
        return false
    }
}
iWebBrowser2

Code: Select all


class iWebBrowser2 {

    oIE_new(ByRef sURL = "about:blank", ByRef sTitle = "", ByRef iHWND = "", ByRef sHTML = "", bVisible = true)
        {
        this.isInstalled()
        oIE := ComObjCreate("internetexplorer.application")
        oIE.Visible := bVisible
        if sURL
            oIE.Navigate(sURL)
        return oIE
        }

    oIE_get( ByRef sTitle = "", ByRef iHWND = "", ByRef sURL = "", ByRef sHTML = "" )
        {
        this.isInstalled()
        ;~ this function is pointless if no instance of IE is open
        ;~ one edit you might make is to have this function open IE and maybe go to the home page
        if ( !winexist( "ahk_class IEFrame" ) )
            {
            MsgBox, 4112, NO IE Window Found, The Macro will end
            ExitApp
            }
        
        if sTitle
            this.clean_IE_Title( sTitle ) 
        ;; ok this function should look at all the existing IE instances and build a reference object
        ; List all open Explorer and Internet Explorer windows:
        oIE := Object()
        matches := 0
        
        for window,k in ComObjCreate("Shell.Application").Windows
            if ( "Internet Explorer" = window.Name)
                {
                possiblematch := true
                if !window.document
                    continue
                pdoc := this.pDoc(window)
                if ( possiblematch && sTitle && !instr( pdoc.title, sTitle ) )
                    possiblematch := false
                
                if ( possiblematch && sHTML && !instr( pdoc.documentelement.outerhtml, sHTML ) )
                    possiblematch := false
                
                if ( possiblematch && sURL && !instr( pdoc.url, sURL ) )
                    possiblematch := false
                
                if ( possiblematch && iHWND > 0 && window.HWND != iHWND )
                    possiblematch := false		
                    
                if ( possiblematch )
                    {
                    ;~ windowsList .= k " => " ( clipboard := window.FullName ) " :: " pdoc.title " :: " pdoc.url "`n"
                    matches++
                    sTitle := pdoc.title
                    sURL := pdoc.url
                    iHWND := window.HWND
                    sHTML := pdoc.documentelement.outerhtml
                    oIE := window
                    }
                ObjRelease( pdoc )
                }
                
        if ( matches > 1 )
            {
            MsgBox, 4112, Too many Matches ,  Please modify your criteria or close some tabs/windows and retry
            ExitApp
            }
            
        return oIE
        }

    isInstalled()
        {
        Static IE_path
        
        ;; find where windows believes IE is installed
        ;; certain corp installs may have this in other than expected folders
        if !IE_path
            RegRead, IE_path, HKLM, SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\IEXPLORE.EXE
        ;~ MsgBox % IE_path
        ;; Perhaps policies prevent reading this key
        if ( ErrorLevel || !IE_path )
            IE_path := "C:\Program Files\Internet Explorer\iexplore.exe"
        
        ;; make sure it installed
        if !FileExist( IE_path )
            {
            MsgBox, 4112, Internet Explorer Not Found, IE does not appear to be installed`nCannot continue `nClick OK to Exit!!!
            ExitApp
            }
        } 

    pDoc(oIE)
        {
        return this.IHTMLWindow2_from_IWebDOCUMENT( oIE.document ).document
        }

    clean_IE_Title( ByRef sTitle = "" ) 
        {
        return sTitle := RegExReplace( sTitle ? sTitle : this.active_IE_Title(), this.IE_Suffix() "$", "" )
        }

    IE_Suffix() 
        {
        static sIE_Suffix
        if !sIE_Suffix
            {
            ;; HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main
            RegRead, sIE_Suffix, HKCU, Software\Microsoft\Internet Explorer\Main, Window Title ;, Windows Internet Explorer,
            sIE_Suffix := " - " sIE_Suffix
            }
        return sIE_Suffix
        }

    active_IE_Title() ;; returns the title of the topmost browser if exists from the stack
        {
        sTitle := "NO IE Window Open"
        if winexist( "ahk_class IEFrame" )
            {
            titlematchMode := A_TitleMatchMode
            titlematchSpeed := A_TitleMatchModeSpeed
            SetTitleMatchMode, 2	
            SetTitleMatchMode, Slow
            WinGetTitle, sTitle, %sIE_Suffix% ahk_class IEFrame
            SetTitleMatchMode, %titlematchMode%	
            SetTitleMatchMode, %titlematchSpeed%
            }
        return RegExReplace( sTitle, this.IE_Suffix() "$", "" )
        }
        
        
        
    IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT )
        {
        static IID_IHTMLWindow2 := "{332C4427-26CB-11D0-B483-00C04FD90119}"  ; IID_IHTMLWindow2
        return ComObj(9,ComObjQuery( IWebDOCUMENT, IID_IHTMLWindow2, IID_IHTMLWindow2),1)
        }

    IWebDOCUMENT_from_IWebDOCUMENT( IWebDOCUMENT ) ;bypasses certain security issues
        {
        return this.IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT ).document
        }

    IWebBrowserApp_from_IWebDOCUMENT( IWebDOCUMENT )
        {
        static IID_IWebBrowserApp := "{0002DF05-0000-0000-C000-000000000046}"  ; IID_IWebBrowserApp
        return ComObj(9,ComObjQuery( this.IHTMLWindow2_from_IWebDOCUMENT( IWebDOCUMENT ), IID_IWebBrowserApp, IID_IWebBrowserApp),1)
        }

    IWebBrowserApp_from_Internet_Explorer_Server_HWND( hwnd, Svr#=1 ) 
        {               ;// based on ComObjQuery docs
        static msg := DllCall( "RegisterWindowMessage", "str", "WM_HTML_GETOBJECT" )
            , IID_IWebDOCUMENT := "{332C4425-26CB-11D0-B483-00C04FD90119}"
        
        SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, ahk_id %hwnd%
        
        if (ErrorLevel != "FAIL") 
            {
            lResult := ErrorLevel
            VarSetCapacity( GUID, 16, 0 )
            if DllCall( "ole32\CLSIDFromString", "wstr", IID_IWebDOCUMENT, "ptr", &GUID ) >= 0 
                {
                DllCall( "oleacc\ObjectFromLresult", "ptr", lResult, "ptr", &GUID, "ptr", 0, "ptr*", IWebDOCUMENT )
                return  this.IWebBrowserApp_from_IWebDOCUMENT( IWebDOCUMENT )
                }
            }
        }
}
iExcel

Code: Select all

class iExcel {
    
    New(visible = true){
        obj := ComObjCreate("excel.application")
        obj.visible := visible
        return obj.workbooks.add
    }

    set(oWkbk, sheet = 1, range = "", value = ""){
        oWkbk.sheets(sheet).cells.range(range).value := value
    }

    get(oWkbk, sheet = 1, range = ""){
        return oWkbk.sheets(sheet).range(range).value
    }
}
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
SOTE
Posts: 1426
Joined: 15 Jun 2015, 06:21

Re: Screen Scraping a web page by tank

07 Apr 2019, 16:26

Excellent video and topic!

Also, might I suggest uploading video in 1080p quality, in regards to the YouTube channel. When the text is smaller or the person has a large screen, 1080p will often look better. It's of course a plus vs minus type of thing, as viewing in the higher quality might take a bit more uploading time and cause some buffering or pauses for those with lower speed Internet connections (but they could select 720p in the case of lower speed connections). If a content creator wants to stick with 720p, then they might need to zoom in more.
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Screen Scraping a web page by tank

07 Apr 2019, 17:09

yea this was my first go at a video i am unsatisfied with the video itself but lack the time to improve it much
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
burque505
Posts: 1731
Joined: 22 Jan 2017, 19:37

Re: Screen Scraping a web page by tank

07 Apr 2019, 19:05

@tank, just tried the barchart example and it was very smooth. Thank you for sharing it and the two classes.
burque505
Posts: 1731
Joined: 22 Jan 2017, 19:37

Re: Screen Scraping a web page by tank

08 Apr 2019, 16:08

@tank, here's a version WITHOUT Excel, using your your iWebBrowser2.ahk and EPPlus.dll. I'm attaching a spreadsheet to show the similarity in output. I will change it as soon as I dig back into my notes to remember how to create a spreadsheat with EPPlus instead of just opening. Been about six months, and I can barely remember my own name that long. The DLL is in the zip archive. Needs a dummy file "EPPlusInFile.xlsx" in the script dir.
Spoiler
EDIT: Version without the need for a pre-existing spreadsheet. :headwall: Both versions create "EPPOut.xlsx"
Spoiler
Attachments
EPPlus.zip
(460.19 KiB) Downloaded 193 times

[The extension xlsx has been deactivated and can no longer be displayed.]

Portwolf
Posts: 161
Joined: 08 Oct 2018, 12:57

Re: Screen Scraping a web page by tank

10 Apr 2019, 08:43

Ohhh... This will be helpfull in the long run!!!

GJ Tank! GJ!!
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Screen Scraping a web page by tank

14 Apr 2019, 00:37

Finally updated the video with the compare with Automation Anywhere
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter

Return to “Tutorials (v1)”

Who is online

Users browsing this forum: No registered users and 36 guests