Parse, split and categorize string

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
domingo
Posts: 2
Joined: 11 May 2024, 10:28

Parse, split and categorize string

Post by domingo » 11 May 2024, 14:03

Hello, I wrote this script to parse a string and divide it into 3 types:

- HTML tags
- spaces/non-breaking spaces
- regular text, each sentence in a different variable

The part which divides it and stores it in the array is working fine. But I also need to store the type in a variable, which I called "kind". I'm having trouble achieving this.

Please ignore the various MsgBox, they are just there for debugging.

TIA


Code: Select all

; Parse string and split it into 3 categories: html tags, spaces/non-breaking spaces, regular text
#IfWinActive, Notepad++
^F12::
KeyWait, Control
KeyWait, F12


    Sleep 100

    ; Specify the file path
    filePath := "D:\\test.txt"
    
    ;test.txt has following stored:
    ;<p>      </p><p><span style="font-size:11pt">This is the first sentence. </span></p>Second sentence.</span></p>

    ; Initialize a variable to store file content
    fileContent := ""
    inputString := ""
	Clipboard := ""
    
    ; Read the content of the file into the variable
    FileRead, inputString, %filePath%

    ;inputString := sourceHtml

    previousIsText := false  ; Initialize the flag for text concatenation

    ; Initialize array to store elements
    elements := {}
    kinds := {}
    elementIndex := 1

    ; Regex to split input into parts (tags and text)
    pattern := "(\s+|<[^>]+>|[^<\s]+)"
    pos := 1
    while (pos <= StrLen(inputString)) && RegExMatch(inputString, pattern, match, pos)
    {
        if (SubStr(match, 1, 1) = "<")
        {
            ; It's an HTML tag
            elements["element" . elementIndex] := match
            kinds["kind" . elementIndex] := "tag"
            previousIsText := false
            elementIndex++
        }
        else
        {
            ; It's text
            if (previousIsText)
            {
                ; Concatenate with previous text
                elements["element" . (elementIndex - 1)] .= match
                
            }
            else
            {
                elements["element" . elementIndex] := match
                
                previousIsText := true
                elementIndex++
            }
        }
        
        ;sleep 300
        
        ; this is not working
        nowelement := elements["element" . elementIndex]
        Send, %nowelement%
        Send, {Enter}
        Sleep, 100
        
        
        ;this is not working
        
        /*
        corrente := elements["element" . elementIndex]
        ;MsgBox % currentElement
        If (RegExMatch(currentElement, "^\s+$") or RegExMatch(currentElement, "^\xA0+$"))
        {
        kinds["kind" . elementIndex] := "white"
        }
        else
        {
        kinds["kind" . elementIndex] := "txt"
        }
        */
        
        pos += StrLen(match)
        
    }

    ; Output the elements
    elementCount := elementIndex - 1  ; Get the total number of elements
    outputMsg := ""
    Loop, % elementCount
    {
        outputMsg .= elements["element" . A_Index] . "`r"
    }

    ; Sending output to active window
    Loop, % elementCount
    {
        Send, % kinds["kind" . A_Index]
        Send, % elements["element" . A_Index]
        Send {Enter}
    }

Clipboard := outputMsg

sleep 1000

send {Enter}{Enter}{Enter}

; reconstruir target com tags

    Loop, % elementCount
    {
        key := "kind" . A_Index
        ;currentKind := kinds["kind" . A_Index]
        currentKind := kinds[key]
        If (RegExMatch(currentKind, "^\s+$") or RegExMatch(currentKind, "^\xA0+$"))
        {
            Send Whitespace
        }
        
        Send, % kinds["kind" . A_Index]
        Send, % elements["element" . A_Index]
        Send {Enter}
    }


Return

[Mod action: Moved topic to the v1 section since this is v1 code. The main section is for v2.]
User avatar
andymbody
Posts: 938
Joined: 02 Jul 2017, 23:47

Re: Parse, split and categorize string

Post by andymbody » 11 May 2024, 15:54

domingo wrote:
11 May 2024, 14:03
I also need to store the type in a variable, which I called "kind". I'm having trouble achieving this.
This statement does not match the statement in the code that says "; this is not working". So I may not be replying to what you want here. The Kind array seems to be working as far as I can tell.

I think the reason you are unable to get nowelement to store anything is because you have incremented elementIndex within your if statements which causes nowelement to store a value within an element that does not exist yet (elementIndex +1). This means nowelement will always store an empty string. Either wait until the bottom of the loop to increment elementIndex or capture elementIndex-1 for nowelement.
domingo
Posts: 2
Joined: 11 May 2024, 10:28

Re: Parse, split and categorize string

Post by domingo » 12 May 2024, 10:31

I cannot put the counter elementIndex outside the statements because the second statement checks if the previous is a text, if it is, it concatenates it to form a sentence. So after these statements I get only two categories:

- tag
- text (which includes the parts with whitespaces only)

in the next step I need to differentiate "text" and categorize it as "text" or "whitespace". The result should be:

tag: <p>
spaces: NBSP NBSP NBSPNBSP
tag: </p>
tag: <p>
tag: <span style="font-size:11pt">
text: This is the first sentence. 
tag </span>
tag </p>
text: Second sentence.
tag </span>
tag </p>

My goal is to sequence the elements correctly to reconstruct them in a translation.
Post Reply

Return to “Ask for Help (v1)”