#include <string.h>
#include <wchar.h>
static inline wchar_t * wcspbrk2 (const wchar_t *wcs, const wchar_t *accept);
static inline wchar_t * wcschr2 (const wchar_t *wcs, const wchar_t wc);
int match(const wchar_t *haystack, const wchar_t *needle) {
int r = wcspbrk2(haystack, needle)-haystack+1;
return r > 0 ? r : 0;
}
//Find the first occurrence in WCS of any wide-character in ACCEPT.
static inline wchar_t * wcspbrk2 (const wchar_t *wcs, const wchar_t *accept){
while (*wcs != L'\0')
if (wcschr2 (accept, *wcs) == NULL)
++wcs;
else
return (wchar_t *) wcs;
return NULL;
}
static inline wchar_t * wcschr2 (const wchar_t *wcs, const wchar_t wc){
do
if (*wcs == wc)
return (wchar_t *) wcs;
while (*wcs++ != L'\0');
return NULL;
}
From what I see in disassembler, the resulting code with -OFast option doesnt seem superoptimized. Its virtually the same as the one produced with TCC. I guess thats b/c we have 2 nested sub-functions here. 1/3 of the code is wasted for storing/restoring stack frame. Just joining everything in 1 function and recompiling it would increase performance. especially for short strings.
I found it!!!
to enable C99: -std=c99 After that the function became super short in disassembler. lol
U wont believe it. its like 1/4 now. only registers. no stack at all. now I do see THE optimization. I will have hard time trying to optimize it even further. Anyways Ill try to make it flat first. to be compatible with Mcode.
So, my friend, make sure to inline if u need real perf optimization.
Availability for 64 bit only is out of the question since the most commonly used AHK version is AHK v1 32 bit.
Therefore using relative memory is a wrong decision - unless you expect that to change in AHK v2.
Even if you only do 64 bit compatability - its probably not possible to load constant float numbers from memory, without having a static pointer to it in the Machine code.
Meaning you need relocations for that - unless you want to force everyone to manually manipulate the resulting MCode they get.
The code responsible for both compiling and parsing the result of a compilation is VSCompiler.ahk.
The class thats responsible for parsing the compile result from an assembly listing is AssemblyListingParser at line 78 to line 210.
I currently do not have support for coff yet - I simply parse the result of an debug assembly.
I eventually want to create a program which allows you to create a compiling chain where several user defined steps will finally result in one machine code.
One of the steps could be modifying the resulting codes manually in assembly.
I also want to split the dynamic from the static data and both of them from the executeable code - for debugging purposes and to keep people from doing nonsense.
In my opinion there shoud not be much difference between a .dll and the MCode this creates.
Of course I cannot force you to participate in the project it just would be great to have some help since I do not believe I can finish this on my own
haystack:="The quick brown fox jumps over the lazy dog."
needle :="fly"
p:=MCode("RA+3CWZFhcl0REmJyg8fAEiJ0OsKSIPAAmZFhcB0IUQPtwBmRTnIdexJKcq6AAAAAEnR+kSJ0IPAAQ9IwsNmkEmDwgJFD7cKZkWFyXXCRTHS69Y=")
s:=DllCall(p, "str",haystack, "str",needle)
msgbox(s)
MCode(_s){ ;1=Base64 w/o hdr, 0=calc len (n):
if(!DllCall(s:="crypt32\CryptStringToBinary", 'str',_s, 'uint',0, 'uint',1, 'ptr',0, 'uintp',n, 'ptr',0, 'ptr',0))
return ;0=FAIL
p:=DllCall("GlobalAlloc", 'uint',0, 'ptr',n, "PTR")
;Changes the protection on a region in the virtual addr space of the calling process:
DllCall("VirtualProtect", 'ptr',p, 'ptr',n, 'uint',0x40, 'uintp',o) ;0x40:PAGE_EXECUTE_READWRITE
if(DllCall(s, 'str',_s, 'uint',0, 'uint',1, 'ptr',p, 'uintp',n, 'ptr',0, 'ptr',0))
return p
DllCall("GlobalFree", 'ptr',p) ;cleanup on FAIL
}
its probably not possible to load constant float numbers from memory, without having a static pointer to it in the Machine code.
I'm not sure what u mean here. we dont even touch floating point registers while messing with strings and other typical tasks.
I eventually want to create a program which allows you to create a compiling chain where several user defined steps will finally result in one machine code.
I also want to split the dynamic from the static data and both of them from the executeable code - for debugging purposes and to keep people from doing nonsense.
In my opinion there shoud not be much difference between a .dll and the MCode this creates.
Yeah MCode requires the long awaited revamp - it should start separating code from r/w data. Btw have u thought of adding DLL support as well? It has a relocation table and public names and is suitable for mcode importing.
I currently do not have support for coff yet - I simply parse the result of an debug assembly.
Better approach would be parsing resulting object files and eventually DLL themselves.
Im back to json.Get optimizing with very disappointing discovery: InStr(" `t`n`r", c,1) (haystack contains short string consisting of 4 widechars and with needle of 1 widechar) turns out to be faster by 1-2% than
lightweight mcode containing just a few instructions wsr:=MCode("SMfABwAAAMM="), but called with 2 patameters: DllCall(wsr, "str"," `t`n`r", "str",c)
That means that optimization of Instr, SubStr used with short strings (1-10 chars) is plain impossible - even the most ideal performance-wise mcode would be slower due to DllCall overhead...
Last edited by vvhitevvizard on 19 Dec 2018, 20:21, edited 2 times in total.
I found it!!!
to enable C99: -std=c99 After that the function became super short in disassembler. lol
...
so what do u think of Anchor function? Do u use complex GUI frequently? https://www.autohotkey.com/boards/viewt ... 72#p253072
That is good to know! I will keep that in mind when I need to compile C/C++ with GCC.
I tested out Anchor and it is quite nice, definitely worth porting to v2. I don't create GUI's frequently, and when I do I try to keep them bare-boned. I remember having to do resizing once and thought, "gee what a pain, let me just make everything non-resizable!"
That means that optimization of Instr, SubStr used with short strings is plain impossible - even the most ideal performance-wise mcode would be slower due to DllCall overhead...
I noticed InStr was really fast with those 2 letter haystacks, too. So now that we have identified the bottleneck as DllCall's overhead, perhaps it's worthwhile to explore AHK H's DynaCall (said to be faster), and it's McodeH (hex only), to see how they fair performance-wise.
Availability for 64 bit only is out of the question since the most commonly used AHK version is AHK v1 32 bit.
Therefore using relative memory is a wrong decision - unless you expect that to change in AHK v2.
Hey, I think a 64 bit only solution is quite reasonable for AHK v2. Moving forward, it's going to be increasingly hard to find 32 bit systems. My observation tells me the biggest obstacle to wider adoption of v2 is not the bit-ness, but the fact that most existing scripts are not compatible with v2 (most users prefer a canned, work-out-of-the-box solution) also it is still in alpha (which is fine with me, as I don't see a reason to rush things, and it's exciting to see new features from time to time).
While I appreciate yours, and many others effort in bringing us MCode, IMHO, the proper solution to the MCode problem has little to do with MCode. We need: 1) a way to ccall efficiently, and 2) the ability to load DLLs manually (already implemented in AHK H via MemoryModule). Also, to keep some happy , strip the DLLs of unnecessary parts that are not needed for manual loading.
After discovering AutoHotkey.dll I am convinced that sometimes the best solution is to do as little in AHK as possible. Perhaps that may change in the future.
That is good to know! I will keep that in mind when I need to compile C/C++ with GCC.
Correction: C99 enabling is required for C sources, C++ has it enabled by default. But C++ code is not suitable for creating portable mcodes in general.
2.
I optimized Mcode to be super simple (just 2 DllCalls) and fast (esp. for small mcode snippets). It can be even 1-lined.
This one is for x64 only and takes mcoded binary as string of base64 chars only. We really dont need to compute mcode size here. We just use 1 byte assuming Mcode's binary size is not larger than 4096 bytes
and VirtualAlloc is rounding it up to the size of the page
btw b/c the memory is being reserved (MEM_RESERVE) option here, the specified address is rounded down to the nearest multiple of the allocation granularity - great for performance.
Memory allocated is automatically initialized to zero - its good for uninitialized values to be zeroed and static values with zero/uninitialized ones (at the end of data) r removed from the resulting mcode base64 string. n inside the function contains the resulting mcode BINARY size - this might be returned to the caller as well
I havent seen actual mcodes of the size (resulting binary) being greater than 1k. Anyways its a good practice for bigger snippets to be separated and aligned to the page boundary (every separate mcode chunk can be optimized this way. and further optimized with Align command in Assembler if the need arises).
MCode(_s){
;allocates the virtual addr space of the calling process and changes its protection
;If 1st arg=0, size (we use 1 byte for mcodes <4096) is rounded up to the next page boundary
;Memory allocated is auto-initialized to zero
;0x3000=MEM_COMMIT | MEM_RESERVE, 0x40:PAGE_EXECUTE_READWRITE
p:=DllCall("VirtualAlloc", 'ptr',0, 'uint',1, 'uint',0x3000, 'uint',0x40, "PTR")
;0=zero terminated, 1=Base64 w/o hdr, n (out) contains the mcode BINARY size
(DllCall("crypt32\CryptStringToBinary", 'str',_s, 'uint',0, 'uint',1, 'ptr',p, 'uintp',n:=4096, 'ptr',0, 'ptr',0))
|| (DllCall("GlobalFree", 'ptr',p),p:=0) ;cleanup on decrypt FAIL
return p
}
1-line inline version (too long line thou - needs to be split anyways):
3.
Somehow, for Unicode AHK v2, "str", string is faster than "wstr", string. both variants should be equal (no string conversions) but the latter is significantly slower. There might be wrong DllCall logic behind that.
4.
I havent figured out how to pass widechars. "str", string = "ptr", &string, "uchar", char for ANSI chars. "ushort", wchar for widechars? I just want to make sure there is no hidden conversions inside DllCall done.
Last edited by vvhitevvizard on 20 Dec 2018, 00:00, edited 5 times in total.
I slightly optimized assembly code somewhat. replaced a few instructions. 83->75 bytes. But I kept intact Align 16 alignment (padded with redundant nops) within the most inner cycle so the size could be smaller but performance is of better value here.
the mcode chunk could be named similar to InStr(haystack,needle). It has the same basic functionality but makes sense only for AHK x64 Unicode and search is case-sensitive and faster for long strings starting from 10 wide chars.
Resulting test file:
MCode(_s){
;allocates the virtual addr space of the calling process and changes its protection
;If 1st arg=0, size (we use 1 byte for mcodes <4096) is rounded up to the next page boundary
;Memory allocated is auto-initialized to zero
;0x3000=MEM_COMMIT | MEM_RESERVE, 0x40:PAGE_EXECUTE_READWRITE
p:=DllCall("VirtualAlloc", 'ptr',0, 'uint',1, 'uint',0x3000, 'uint',0x40, "PTR")
;0=zero terminated, 1=Base64 w/o hdr, n (out) contains the mcode binary size
(DllCall("crypt32\CryptStringToBinary", 'str',_s, 'uint',0, 'uint',1, 'ptr',p, 'uintp',n:=4096, 'ptr',0, 'ptr',0))
|| (DllCall("GlobalFree", 'ptr',p),p:=0) ;cleanup on decrypt FAIL
return p
}
;test
needle:="fly", haystack:="The quick brown fox jumps over the lazy dog."
p:=MCode("RA+3CWZFhcl0PEmJykiJ0OsKSIPAAmZFhcB0HEQPtwBmRTnIdexJKcox0knR+kSJ0IPAAQ9IwsNJg8ICRQ+3CmZFhcl1x0Ux0uvb")
msgbox(DllCall(p, "str",haystack, "str",needle))
2.
next quick question here is how to report "\r", "\t", etc for widechars in terms of plain C. I had no practice here yet.
EDIT: ok answer is:
it's McodeH (hex only), to see how they fair performance-wise.
mcode as a hex string should be suboptimal simply b/c base64 takes less space.
perhaps it's worthwhile to explore AHK H's DynaCall (said to be faster),
I see there lots of stuff in AHK H v2 .dll! But DllCall to make another DllCall (DynaCall) makes no sense for AHK v2 scripts.
, strip the DLLs of unnecessary parts that are not needed for manual loading.
oh, I couldnt find a way to make GCC compiled DLL as small as possible. for 75 bytes binary code it created 116KB file with lotta of exported functions.
Thats not a critical issue at the moment tho so I didnt try to look in it.
Last edited by vvhitevvizard on 20 Dec 2018, 00:43, edited 5 times in total.
After discovering AutoHotkey.dll I am convinced that sometimes the best solution is to do as little in AHK as possible. Perhaps that may change in the future.
I, for one, do think that the majority of script's basic logic should be left within AHK v2 script code and only critical parts of it to be optimized (perf-wise) with mcode/AutoHotkey.dll calls/etc. So I look forward to re-optimize my .ini functionality, anchor, scrolling with mouse, support for text and bg colors for individual cels of ListView, et cetera. there r lots of old scripts needed to be converted to v2 and revisited. I do hope ull stay with me and we complement each other .
Fascinating stuff there! I will need to reread these posts a few more times to get more out of them. I think many or all of the AHK_H stuff works from AutoHotkey.exe, so there's no need to wrap DynaCall inside another DllCall. I'm already using it and so far it feels no different from AHK v2 aside from the extra capabilities. I'm just curious about the speed of MCodeH, and of course we can always convert between 64base and Hex ourselves. When I get a chance I will test it.
I also have no experience with wchar in C until just now. I remember reading a post yesterday on stackexchange about using \uHHHH for unicode characters where H's are hex digits. So I'm guessing tab = \u0009 and `r=\u000d (untested)?
DllCall by itself is actually not super slow but each arg slows it down greatly.
results for Mcode function is 0.14 ms per 1 call (there r 2 system calls allocating and filling memory with zero, decrypting, etc - so it takes some tangible time ofc)
DllCall(f) of actual working mcode (consisting of no workload) but w/o args: 0.000094ms per 1 DllCall DllCall(f, "str",haystack): 0.000187ms per 1 call. 99% slower than a DllCall with 0 args. so 1 arg for DllCall slows it down by x2 DllCall(f, "str",haystack, "str",needle): 0.000235ms. 26% slower than a call with 1 arg. 2nd arg slows it down by 1/4 more 0.1406ms
0.000094ms
0.000187ms
0.000235ms
MCode(_s){
;allocates the virtual addr space of the calling process and changes its protection
;If 1st arg=0, 1=size (1 byte for mcodes <4096) is rounded up to the next page boundary
;Memory allocated is auto-initialized to zero
;0x3000=MEM_COMMIT | MEM_RESERVE, 0x40=PAGE_EXECUTE_READWRITE
p:=DllCall("VirtualAlloc", 'ptr',0, 'uint',1, 'uint',0x3000, 'uint',0x40, "PTR")
;0=zero terminated, 1=Base64 w/o hdr, n (out) contains the mcode binary size
(DllCall("crypt32\CryptStringToBinary", 'str',_s, 'uint',0, 'uint',1, 'ptr',p, 'uintp',n:=4096, 'ptr',0, 'ptr',0))
|| (DllCall("GlobalFree", 'ptr',p),p:=0) ;cleanup on decrypt FAIL
return p
}
needle:="fly", haystack:="The quick brown fox jumps over the lazy dog."
s:=("RA+3CWZFhcl0PEmJykiJ0OsKSIPAAmZFhcB0HEQPtwBmRTnIdexJKcox0knR+kSJ0IPAAQ9IwsNJg8ICRQ+3CmZFhcl1x0Ux0uvb")
p:=MCode_(s)
;msgbox(DllCall(p, "str",haystack, "str",needle))
n:=10*1000
t:=A_TickCount
loop(n)
p:=MCode(s)
a1:=(A_TickCount-t)/n
f:=MCode("SMfABwAAAMM=")
t:=A_TickCount
loop(n)
DllCall(f)
a2:=(A_TickCount-t)/n
t:=A_TickCount
loop(n)
DllCall(f, "str",haystack)
a3:=(A_TickCount-t)/n
t:=A_TickCount
loop(n)
DllCall(f, "str",haystack, "str",needle)
a4:=(A_TickCount-t)/n
msgbox(clipboard:=a1 "ms`n`n" a2 "ms`n`n" a3 "ms`n`n" a4 "ms")
2.
btw, for x64 calling convention first four args r passed in registers. that means in order to increase performance we have to limit arg count by 4 (it concerns C/C++ code and calls inside mcode).
and in order to avoid DllCall's overhead in case we use it inside a cycle we gotta try to cramp passed data into 1 arg actually. it should be an address of binary structure which we fill with NumPut.
3.
...always convert between 64base and Hex ourselves. When I get a chance I will test it.
It makes no difference for small strings and for big ones u would prefer compactness over init procedures' speed. Normally these init functions r not an inner part of loop(100000) cycles, size is more important here.
go with testing hex vs base64 decrypt rate, Im curious is hex decrypt faster at all.
I'm playing with the code you posted right now...
Hey, just an idea, if the number of parameters causes significant slow down, can we cheat by using a pipe character to separate them so we can send them together? There are plenty of unused Unicode characters. Also, since we have full control of the memory it sits in, can we have the function operate on a relatively static address so we can pass zero parameters?
I'm playing with the code you posted right now...
Hey, just an idea, if the number of parameters causes significant slow down, can we cheat by using a pipe character to separate them so we can send them together? There are plenty of unused Unicode characters.
bad idea for mcode. that would require the string to be concatenated by AHK and to be pre-processed by the callee to "StrSplit" it. the better idea here is to pass a binary structure by its address. struct{
offset 0: haystack addr (8 bytes)
offset 8: needle addr (8 bytes)
}
we create it with SetVarCapacity outside the loop-for cycle and we fill it with 2 NumPuts
I was thinking along the lines of using StrPut... but maybe that is slow, too. And the callee has to read the string char by char right? We can put the needle first so the callee will obtain the needle before the haystack. But probably easier to just assume a max length for needle and give them fixed starting addresses.
Also, since we have full control of the memory it sits in, can we have the function operate on a relatively static address so we can pass zero parameters?
Great idea!!! my version of Mcode (above) saves the size of binary data in n - so we can actually wrap mcode into a class and make a public key with the size of binary data which at the same time can be used as an address of that binary structure. We just place binary structure right after mcode's end. And we allocated 4k with Mcode - so there is enuf room. this way we dont need to pass any arguments to DllCode at all!!! x2 performance boost. I hope NumPut is optimized.
no no. In our case (AHK Unicode), StrPut just converts between AHK's inner string format (widechar) and a format required for some ANSI functions (e.g MessageBoxA). DllCall does the same automatically if we force "astr".
But probably easier to just assume a max length for needle and give them fixed starting addresses.
We dont put the strings themselves, we put uint64 addresses of those strings. binary structure is 8+8 bytes in this case.
btw it requires that I change compiled C code in assembly and add addressing of the binary structure with those 2 string addresses - I dont see a way of doing this with C source w.o resorting to assembly.
Actually u bumped me against an idea of much more effective use of DllCall. Indeed, the function tries to convert data to be compatible with "the calee's arg declaration". We can take care of ourselves and use DllCall just to call our mcode w/o any additional functionality and checks.
I see...I thought treating them as fixed size would be faster than being redirected via pointers, because I remember reading about something in another language where declaring numbers as an array of Real numbers is slower than array of Double or Int64 since Real numbers are stored as pointers in that language. Anyway, I suppose it makes little difference, and using a struct will make things much easier in terms of coding, as similar structs can be reused for other MCodes as well!
On an unrelated note, I manage to use AHK v2 (not H) to create a Julia thread with libjulia.dll, then started another thread of AHK H v2 using AutoHotkey.dll from Julia via ccall to create GUI and retrieve values. It's quite nice since SciTE displays stdout in its output pane, so once I got it to display stderr, I have a REPL (Read Evaluate Print Loop), where I can dynamically evaluate expressions and start any Julia or AHK scripts (via dll). For reasons that are beyond me, I couldn't do this with the same copy of AHK H v2. Nonetheless, I am happy with the results
so to the concept! thats how mcode class looks. we instantiate it for every mcode. this.c=addr of code, this.a=addr of args structure (and retval should be passed this way as well). this.a-this.c=code size
class mcode{
__New(_s){
;allocates the virtual addr space of the calling process and changes its protection
;If 1st arg=0, 1=size (1 byte for mcodes <4096) is rounded up to the next page boundary
;Memory allocated is auto-initialized to zero
;0x3000=MEM_COMMIT | MEM_RESERVE, 0x40=PAGE_EXECUTE_READWRITE
c:=this.c:=DllCall("VirtualAlloc", 'ptr',0, 'uint',1, 'uint',0x3000, 'uint',0x40, "PTR")
;0=zero terminated, 1=Base64 w/o hdr, n (out) contains the mcode binary size
DllCall("crypt32\CryptStringToBinary", 'str',_s, 'uint',0, 'uint',1, 'ptr',c
, 'uintp',n:=4096, 'ptr',0, 'ptr',0)
|| (DllCall("GlobalFree", 'ptr',c),this.c:=0) ;cleanup on decrypt FAIL
this.a:=c+n ;addr of binary struct for passed args
}
}
needle:="fly", haystack:="The quick brown fox jumps over the lazy dog."
s:=("RA+3CWZFhcl0PEmJykiJ0OsKSIPAAmZFhcB0HEQPtwBmRTnIdexJKcox0knR+kSJ0IPAAQ9IwsNJg8ICRQ+3CmZFhcl1x0Ux0uvb")
strIn:=new mcode(s)
msgbox("c:" strIn.c "a:" strIn.a)
msgbox(DllCall(strIn.c, "str",haystack, "str",needle))
and thats how we put and get an address of the strings (NumPut&NumGet's default is "ptr" - thats what we have):
lol I cant generate full mcode cuz optimizing compiler thinks the whole function does NOTHING w/o arguments so it skips all and keeps only return value (=0)
I recall there was volatile keyword but I have no luck with it so far. What Im trying is to make a compiler respect 2 vars h and n and create space for them in initialized data segment instead of removing them at optimization process:
There r tons of optimization command line parameters. Im lost. I need one turning off unused functions and dead code removal WHILE retaining all other optimizations. https://gcc.gnu.org/onlinedocs/gcc/Opti ... %20Options
static const wchar_t h[] __attribute__((used)) =L"The quick brown fox jumps over the lazy dog.";
static const wchar_t n[] __attribute__((used)) =L"fly";
this way compiler passes them to the output file.
2.
TCC: I figured out that only _mingw.h is required for simple mcode snippets: #include <_mingw.h>
or skip all the include hell and do: #define wchar_t short
both GCC and TCC take such file w/o any includes