Quote:
I want to prune a text file and leave only the urls and links.
This has not been thoroughly tested but appears to work. There are probably better utilites out there to extract links from a web page:
Code:
; Example #2: A working script that attempts to extract all FTP and HTTP
; URLs from a text or HTML file:
FileSelectFile, SourceFile, 3,, Pick a text or HTML file to analyze.
if SourceFile =
return ; This will exit in this case.
SplitPath, SourceFile,, SourceFilePath,, SourceFileNoExt
DestFile = %SourceFilePath%\%SourceFileNoExt% Extracted Links.txt
IfExist, %DestFile%
{
MsgBox, 4,, Overwrite the existing links file? Press No to append to it.`n`nFILE: %DestFile%
IfMsgBox, Yes
FileDelete, %DestFile%
}
LinkCount = 0
Loop, read, %SourceFile%, %DestFile%
{
URLSearchString = %A_LoopReadLine%
Gosub, URLSearch
}
MsgBox %LinkCount% links were found and written to "%DestFile%".
return
URLSearch:
; It's done this particular way because some URLs have other URLs embedded inside them:
StringGetPos, URLStart1, URLSearchString, http://
StringGetPos, URLStart2, URLSearchString, ftp://
StringGetPos, URLStart3, URLSearchString, www.
; Find the left-most starting position:
URLStart = %URLStart1% ; Set starting default.
Loop
{
; It helps performance (at least in a script with many variables) to resolve
; "URLStart%A_Index%" only once:
StringTrimLeft, ArrayElement, URLStart%A_Index%, 0
if ArrayElement = ; End of the array has been reached.
break
if ArrayElement = -1 ; This element is disqualified.
continue
if URLStart = -1
URLStart = %ArrayElement%
else ; URLStart has a valid position in it, so compare it with ArrayElement.
{
if ArrayElement <> -1
if ArrayElement < %URLStart%
URLStart = %ArrayElement%
}
}
if URLStart = -1 ; No URLs exist in URLSearchString.
return
; Otherwise, extract this URL, then find its ending space or tab:
StringTrimLeft, URL, URLSearchString, %URLStart% ; Omit the beginning/irrelevant part.
Loop, parse, URL, %A_Tab%%A_Space%<> ; Find the first space, tab, or angle (if any).
{
URL = %A_LoopField%
break ; i.e. perform only one loop iteration to fetch the first "field".
}
; If the above loop had zero iterations because there are no spaces or tabs,
; leave the contents of the URL var untouched.
; If the URL ends in a double quote, remove it. For now, StringReplace is used, but
; note that it seems that double quotes can legitimately exist inside URLs, so this
; might damage them:
StringReplace, URLCleansed, URL, ",, All
FileAppend, %URLCleansed%`n
LinkCount += 1
; See if there are any other URLs in this line:
StringLen, CharactersToOmit, URL
CharactersToOmit += %URLStart%
StringTrimLeft, URLSearchString, URLSearchString, %CharactersToOmit%
Gosub, URLSearch ; Recursive call to self.
return
Edit: Made file selectable via FileSelectFile.
Edit2: Added detection of < and > characters as terminators of a URL. Possibly some other improvements.