How to build this in AHK - It's been done in Python

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

How to build this in AHK - It's been done in Python

17 Jul 2020, 11:01

The python script turns

HelloHowAreYou into Hello How Are You

It has an english dictionary with about 125k entries (one per line) and in my tests it gives amazing results because it uses some form of probability when deciding what is the best words for a string (many strings can be broken up many ways)

The python github is here https://github.com/keredson/wordninja

The python code is

Code: Select all

import gzip, os, re
from math import log


__version__ = '2.0.0'


# I did not author this code, only tweaked it from:
# http://stackoverflow.com/a/11642687/2449774
# Thanks Generic Human!


# Modifications by Scott Randal (Genesys)
#
# 1. Preserve original character case after splitting
# 2. Avoid splitting every post-digit character in a mixed string (e.g. 'win32intel')
# 3. Avoid splitting digit sequences
# 4. Handle input containing apostrophes (for possessives and contractions)
#
# Wordlist changes:
# Change 2 required adding single digits to the wordlist
# Change 4 required the following wordlist additions:
#   's
#   '
#   <list of contractions>


class LanguageModel(object):
  def __init__(self, word_file):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    with gzip.open(word_file) as f:
      words = f.read().decode().split()
    self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    self._maxword = max(len(x) for x in words)
   

  def split(self, s):
    """Uses dynamic programming to infer the location of spaces in a string without spaces."""
    l = [self._split(x) for x in _SPLIT_RE.split(s)]
    return [item for sublist in l for item in sublist]


  def _split(self, s):
    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
      candidates = enumerate(reversed(cost[max(0, i-self._maxword):i]))
      return min((c + self._wordcost.get(s[i-k-1:i].lower(), 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
      c,k = best_match(i)
      cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
      c,k = best_match(i)
      assert c == cost[i]
      # Apostrophe and digit handling (added by Genesys)
      newToken = True
      if not s[i-k:i] == "'": # ignore a lone apostrophe
        if len(out) > 0:
          # re-attach split 's and split digits
          if out[-1] == "'s" or (s[i-1].isdigit() and out[-1][0].isdigit()): # digit followed by digit
            out[-1] = s[i-k:i] + out[-1] # combine current token with previous token
            newToken = False
      # (End of Genesys addition)

      if newToken:
        out.append(s[i-k:i])

      i -= k

    return reversed(out)

DEFAULT_LANGUAGE_MODEL = LanguageModel(os.path.join(os.path.dirname(os.path.abspath(__file__)),'wordninja','wordninja_words.txt.gz'))
_SPLIT_RE = re.compile("[^a-zA-Z0-9']+")

def split(s):
  return DEFAULT_LANGUAGE_MODEL.split(s)


Any hints on how to start making this using AHK would be great
User avatar
boiler
Posts: 17072
Joined: 21 Dec 2014, 02:44

Re: How to build this in AHK - It's been done in Python

17 Jul 2020, 13:23

If the words will always be capitalized like your example, it’s really easy and doesn’t need a dictionary:

Code: Select all

MsgBox, % Trim(RegExReplace("HelloHowAreYou", "([A-Z])", " $1"))
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: How to build this in AHK - It's been done in Python

17 Jul 2020, 15:03

boiler wrote:
17 Jul 2020, 13:23
If the words will always be capitalized like your example, it’s really easy and doesn’t need a dictionary:

Code: Select all

MsgBox, % Trim(RegExReplace("HelloHowAreYou", "([A-Z])", " $1"))
The data does not come in with caps, in fact, the reason im trying to break the words up is to then put them back together with caps. You helped me the other day with part of the code below.

Code: Select all

global matchlist
FileRead, list, dictwords.txt ; this is the list from the python script the words are ordered by popularity 
matchList := StrReplace(list, "`n", ",")
msgbox, % ezread("plannadvertisingguide.info")
msgbox, % ezread("gotheme.com") ; the python script does it correct "go theme" this one does GothEmE...
ExitApp

ezread(dnamechar)
{
startsearch = 1
endsearch := StrLen(dnamechar)
dext := strsplit(dnamechar, ".")
dnamechar := dext[1]
while (strlen(strword) != strlen(dnamechar)) 
{
	curword := (substr(dnamechar, startsearch, endsearch))
	if curword in %matchList%
	{
		StringUpper, curword, curword, t
		strword .= curword
		startsearch := (strlen(strword) + 1)
		endsearch := (StrLen(dnamechar) - startsearch + 1)
		continue
	}
	endsearch--
}
return strword "." dext[2]
}
return
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: How to build this in AHK - It's been done in Python

18 Jul 2020, 08:33

I am attaching the dictionary, hopefully someone can help me move things forward
Attachments
dictwords.txt
dictionary
(1.05 MiB) Downloaded 32 times
BoBo
Posts: 6564
Joined: 13 May 2014, 17:15

Re: How to build this in AHK - It's been done in Python

18 Jul 2020, 09:31

What about using a python command line portable? That way it should be possible to run that py-script and use its output for further processing with AHK ...
malcev
Posts: 1769
Joined: 12 Aug 2014, 12:37

Re: How to build this in AHK - It's been done in Python

18 Jul 2020, 09:32

I do not know much python, but some time ago it was interested to me to convert some functions from instagram api python wrapper to ahk.
https://www.autohotkey.com/boards/viewtopic.php?p=251268#p251268
I installed python, read manual about functions that code used, tested what they do in python and transfered them to ahk.
guest3456
Posts: 3463
Joined: 09 Oct 2013, 10:31

Re: How to build this in AHK - It's been done in Python

18 Jul 2020, 11:21

looks like the algorithm is right there in the python code... cant you just look up what each python func does and then convert it to ahk?

at the stackoverflow link someone even ported it to javascript which could give you more assistance in porting to ahk:
https://stackoverflow.com/a/53183719/312601

seems like a fun project


Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: No registered users and 358 guests