Neural Network basics - Artificial Intelligence using AutoHotkey!

04 Jan 2018, 19:28

Hello there

In my efforts to learn some AI programming, i have stumbled upon this subject of Neural Networks and set myself the goal of creating some Nets using AutoHotkey. After digging into the subject by searching and reading tutorials here and there i have come across this post by Milo-Spencer-Harper which describes in detail the creation of an extremelly simple neural net in Python. After succesfully translating (or sort of) the code to AutoHotkey, i have decided to write this tutorial based off of what i understood, in order to better cement my knowledge of the basics.

Hopefully this will also help someone out in the AutoHotkey community

*Note: Section I of this tutorial has been covered in an AutoHotkey webinar on March 20th 2018. You can check the webinar in this link (Special thanks to Joe Glines and Jackie Sztuk for making it possible). The written tutorial below will also provide an expanded view on the subject so don't forget to read it either. Section II will further expand on the subject with a simple Multi-Layer Neural Network implementation. MLNNs are considered the vanilla form of Neural Networks, so be sure to check it too

SECTION I

1. What is a Neural Network

Artificial Neural Networks (ANNs) are models that implement machine learning algorithms, which are based on some aspects of our current understanding of the workings of the brain (the low-level electrical part only, not the bio-chemical, of course). The most important thing about ANNs is that they allow us to implement some algorithms which would be far too complex to manually program. This is done by having the machine program the code itself through learning sessions. With the knowledge of ANNs, programming complex tasks such as handwriting and voice recognition by machines is now doable. There are numerous other examples of succesfull implementations and as the knowledge of ANNs spreads, we are getting almost daily news of new tasks that were not programmable before, but have now been succesufully programmed using ANNs.

Here are a few AI video examples if you want to know more about AI implemented through ANNs (you can watch them later if you just want to follow this tutorial):
https://www.youtube.com/watch?v=qv6UVOQ0F44
https://www.youtube.com/watch?v=P7XHzqZjXQs (turn on the subtitles in this one for english)
https://www.youtube.com/watch?v=Ipi40cb_RsI

2. How do Neural Networks work?

As stated in section 1, ANNs base themselves on some ideas brought up by studies on biological neurons, synapses, their electric impulses and etc. Thus, the most basic constituent of a Neural Network is a neuron-like component represented by a mathemathical formulae. This formula simulates the thought process of each neuron and it's consequent behaviour as in which impulses it will send through it's synapses to other neurons when many different stimuli arrive to it. A single-neuron Neural Network with 3 synapses as input and 1 as output is represented in the image below:

: Individual Neuron.png (5.68 KiB) Viewed 29107 times

As you can see in the image above, the "neuron" receives signals (stimuli) thorugh it's input synapses, works these signals in a unique way and than fires an according output through another synapse. The output of this neuron can than be used as an input to another neuron in a neural network such as the one below (or as a final output):

: Neural Network.png (12.18 KiB) Viewed 29088 times

It is important to understand that each neuron does a unique thing with the signals it receives (the formulaes to work the inputs in each neurons are different). In ANNs, we say that each neuron attributes a unique "weight" to each of the inputs it receives and uses these weights to rework the actual inputs into an output. The weight is like a measure of strength of a synapse in a biological neurons. If you check the center-bottomost neuron in the image above, what does it do with the inputs received? It weights a possible input from B as a "light" negative component (multiplying it -1x) and a possible input from C as "heavy" positive component (multiplying it +5x). So if the stimuli value coming to it through B is 10 and the stimuli value coming from C is 6, what will be the output H of this neuron?

Total = WeightB * InputB + WeightC * InputC
Or
-1x10 + 5x6 = 20

So twenty will be the output. Thus, we can see that there are two basic components in the processing of a neuron: the "weight" of an input and the "value" of an input. This is what recreates an inportant aspect of biological neural networks in ANNs: The weight attributed to an input can be reworked through the training of the network, so that the same input value can be accounted for in an infinitely many different ways. This is how we simulate the strengthening (or withering) of synapses in biological neurons. If through training, the network creator code discovers that a part of what is being fed as input to it is irrelevant to the intended results, it will lower its weight substantially. If, however, it discovers that another input is very relevant to the intended result, it will rise this other inputs weight. This is the basics of how training works in an ANN: finding the correct weights to treat the incoming inputs (For now, this is sufficient. We'll keep other concepts that could also apply for another ocasion).

So further following up our results above: In our 5-neuron network above the value of 20, as calculated by the left-bottomost, would be fed as input to the left-uppermost neuron (and than be treated with a weight of 4 by it) and so on, in an intricate chain of inputs and outputs through different neurons up to a point in which the many individual weights applied wil have reworked all the inputs into a final output.

3. Ok, so how are these weights calculated (or how is the training)?

Through trial and error and aproximation. During training cycles, the net creator code feeds a net with samples of inputs to which an expected output is known, and the net than processes these into a final output using it's weights. Than, the net creator code compares the output values with the expected outputs and readjusts the weights of the network in each neuron to aproximate the final output to the expected output (thus, recreating the net in each iteration). It is important to consider that this trial and error is highly oriented towards a goal: if the actual output is too much lower than the expected one, the weights will be forced up considerably by the net creator. If, however, they are just a little bit lower, the weights are recalculated up just a bit. Likewise, if they are too far above the expected result, the weights will be lowered considerably, and if they are just a bit above the expected result, the weights will go down just a bit.

The math to do all this training and the net recreation is simple. It follows the pseudo-code below:

Code: Select all

THIS_TRAINING_ITERATION_OUTPUT := TRAINING_VALUES * CURRENT_WEIGHTS ; The current net is used to calculate a final output.
OUTPUT_SET_BETWEEN_0_AND_1 := GRADIENT(SIGMOID(THIS_TRAINING_ITERATION_OUTPUT)) ; Than, we rework the final output to a representative value between 0 and 1 using a sigmoid* function and a gradient* function.
ADJUST_CURRENT_WEIGHT(AVAILABLE KNOWLEDGE, OUTPUT_SET_BETWEEN_0_AND_1) ; And than we readjust our weights (recreate the net) using both the  available knowledge in the samples and this training sections outputs)

Notes:

Sigmoid function: An Activation Function. The sigmoid function is used to rework any value (from -infinite to +infinite) to a point in a S-shaped curve between 0 and 1. Negative values are presented as between 0 and 0.5, while positive values are presented as between 0.5 and 1:

Gradient of the sigmoid function: Since the sigmoid function is an S-shaped curve, equal distances in A and B represent different distances in the 0 to 1 scale depending on where these values are located (in other words, there are distortions in the distances). For this reason, we use the gradient of the sigmoid, which represents these exact distortions in the sigmoid curve. The value of the gradient for any particular distance A - B allows us to rework the sigmoid distance A - B to better picture the actual distances between these values (remember: during training we have to readjust the weight based on the distance between the current calculation and the expected value of the training sample!).
Sigmoid in Blue, Gradient of sigmoid in Green:

: Sigmoid&Gradient.png (24.56 KiB) Viewed 28880 times

4. Enougth theory! let's get practical!

Suppose we have the following situation: given any combination of 3 binary values, an unknown underlying rule is being applied to find a fourth binary value (dependant exclusively on the first 3, of course). We do not know what rule this is, and neither does the network we are going to create, but we do have 4 samples of binary combinations and for each of them, we know the fourth byte value (or the correct answer based on the unknown underlying rule). The case table is presented below.

Can you figure out the underlying rule and the most probable value of the three question marks

: Case Table.png (3.57 KiB) Viewed 29107 times

If you took a few seconds to analyze the samples in the table above, you probably figured it out already. The answer is always the same as the first input byte. This means that this input byte should hold a decisive weight in the final answer while the others, not so much. If we were to check the 3 other possible validation cases one can provide ([0,0,0], [0,1,0] and [1,1,0]), it is rather obvious now that we will not even have to look at the values of input bytes 2 and 3 to find the answers. So if we were to program a function to find the solution it could just be something like this:

Code: Select all

FIND_SOLUTION(first_byte_value, second_byte_value, third_byte_value)
{
    return first_byte_value
}

But this function carries something that we are trying to avoid here: It relies on the human programmer being intelligent enougth to find the underlying rule and than translating it to code himself. What we want to do instead is to have the machine somehow trying to find the underlying rule itself and than presenting a code that uses its conclusions to find the correct solutions in any new samples we present to it later. How do we go about it?

Let me present you a commented code that does exactly that and than we will take our conclusions!

Note: The coments in the code are part of this tutorial! Don't skip reading the code!

Code: Select all

/*
1. PREPARATION STEPS
*/

Random, Weight1, -4.0, 4.0 ; We start by initializing random numbers into the weight variables (this simulates a first hipotesis of a solution and allows the beggining of the training).
Random, Weight2, -4.0, 4.0
Random, Weight3, -4.0, 4.0
WEIGHTS := Array([Weight1],[Weight2],[Weight3]) ; And than organize them into a matrix.

TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too). 
EXPECTED_OUTPUTS := Array([0],[1],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.


/*
2 . ACTUAL TRAINING
*/

Loop 10000 ; And now we do the net creator code (which is the training code). It will perform 10.000 training cycles.
{
	Loop 4 ; For each training cycle, this net creator code will train the network of weights using the four training samples.
	{
		ACQUIRED_OUTPUT := 1 / (1 + exp(-1 * MATRIX_ROW_TIMES_COLUMN_MULTIPLY(TRAINING_INPUTS, WEIGHTS, A_Index))) ; First, the net is set to calculate some possible results using the weights we currently have. (At the first iteration of the loop these weights are absolutely random, but don't forget they will be recalculated every time). We use a sigmoid function here to set any results (from -infinite to +infinite) to a value between 0 and 1.
		SIGMOID_GRADIENT := ACQUIRED_OUTPUT * (1 - ACQUIRED_OUTPUT) ; But since the sigmoid function has a curve like shape, the distance between values is highly distorted depending on the position they occupy in the S-shaped curve, so we will also use the sigmoids gradient function to correctly account for that (This is to find better pictures of the actual distances between values while still keeping the results between 0 and 1).
		WEIGHTS[1,1] += TRAINING_INPUTS[A_Index, 1] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT) ; Than, each weight is recalculated using the available knowledge in the samples and also the current calculated results.
		WEIGHTS[2,1] += TRAINING_INPUTS[A_Index, 2] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT)
		WEIGHTS[3,1] += TRAINING_INPUTS[A_Index, 3] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT)
		
		; Breaking the formula above: Weight is adjusted (we use +=, not :=) by getting the value of the input byte and multiplying it by the difference between calculated input (sigmoidally treated) and expected input, after this difference is adjusted by the gradient of the sigmoid (removing the sigmoidal  distortions). 
	}
}

/*
3. FINAL RESULTS
*/

; VALIDATION CASE [1,0,0]:
Input1 := 1, Input2 := 0, Input3 := 0

; After recalculating the weights in 10.000 iterations of training, we apply them by multiplying these weights to inputs that resembles a new case 
; (this new case is a validation sample, not one of the training ones: [1, 0 ,0])
MSGBOX % "VALIDATION CASE: `n" . Input1 . "`, " . Input2 . "`, " . Input3 . "`n`nFINAL WEIGHTS: `nWEIGHT1: " . WEIGHTS[1,1] . "`nWEIGHT2: " . WEIGHTS[2,1] . "`nWEIGHT3: " . WEIGHTS[3,1] . "`n`nWEIGHTED SOLUTION: `n" Input1 * WEIGHTS[1,1] + Input2 * WEIGHTS[2,1] + Input3 * WEIGHTS[3,1] . "`n`nFINAL SOLUTION: `n" . (1 / (1 + EXP(-1 * (Input1 * WEIGHTS[1,1] + Input2 * WEIGHTS[2,1] + Input3 * WEIGHTS[3,1])))) . "`n`nComments: `nA FINAL SOLUTION between 0.5 and 1.0 means the final network thinks the solution is 1. How close the value is to 1 means how certain the net is of that. `nA FINAL SOLUTION between 0 and 0.5 means the final network thinks the solution is 0. How close the value is of 0 means how certain the net is of that."

; Breaking the output numbers:
; WEIGHTED_SOLUTION: If this is positive, the net believes the answer is 1 (If zero or negative, it belives the answer is 0). The higher a positive value is, 
; the more certain the net is of its answer being 1. The lower a negative value is, the more certain the net is of its answer being 0.  
; FINAL SOLUTION: A sigmoidally treated weighted_solution. If this is above 0.50, the net believes the answer to be 1. The closer to 1, the more certain 
; the net is about that. If this is 0.50 or below it, the net believes the answer to be 0. The closer to 0, the more certain the net is about that.

Return
; The function below is just a single step in multiplying matrices (this is repeated many times to multiply an entire matrix). It is used because the input_data, weights and expected results were set into matrices for organization purposes.
MATRIX_ROW_TIMES_COLUMN_MULTIPLY(A,B,RowOfA)
{
	If (A[RowOfA].MaxIndex() != B.MaxIndex())
	{
		msgbox, 0x10, Error, Number of Columns in the first matrix must be equal to the number of rows in the second matrix.
		Return
	}
	Result := 0
	Loop % A[RowOfA].MaxIndex()
	{
		Result += A[RowOfA, A_index] * B[A_Index, 1]
	}
	Return Result
}

For an interactive GUI Version of the code above, check this post by SpeedMaster.

5. Conclusions of the test above

The code was indeed able to approximate the expected results for [1,0,0]: it presented it as ~0.999 (which is close enougth to 1).

Furthermore, if we study the comented code above, (and if we play with it, changing some values) we will notice some interesting facts about ANN creator codes and ANNs themselves.

1. First, a network is just something like this (if we stick to basics, of course):

Result := Weight1 * Input1 + Weight2 * Input2 + Weight3 + Input3

2. If the network is a function of weights that reworks inputs, than we also have that a network creator code is really just a code that obtains these correct weights (And It does so by training the network, which just means aproximating the values of the weight to account for any underlying rules that may be present in the samples).

3. The programmer DOES NOT provide the underlying rules in a network creator code and unlike in our case-study, most times he/she DOESN'T EVEN KNOW these underlying rules, as they are just too complex (i.e: how to tell a number handwritten in a 30x50 image based on individual the pixel values?). In this case, the programmer just provides a number of samples with correct labels, and a means for the machine to recalculate a number of weights ir order to absorb some underlying rules in the samples and imprint them into these weights.

4. The final weights are somewhat unnusual if you actually look at them. The case we presented may have at first make us think that the final network would have weights like [+infinite, 0, 0], but the network presented them as being [+9.68, -0.20 and -4.64] (or something along these values). These values may seem odd at first (almost as if totally random), but they are not: Since we DID NOT provided any underlying rule for the output net, the program is free to find ANY VALUES that will accomodate the underlying rule. This means that the output weights just have to be ANY values that correctly implement the underlying rule (which is this: First byte being 1 makes the weighted solution positive, while being 0 makes it negative (or zero) and the second and third byte don't really change this).

5. If you study the results of neural networks, sometimes you can actually find some quite interesting ideas. The network in the first and second video examples presented in section 1 of this tutorial actually surprised me: The networks discovered that the google dinosaur was better off ducking all the time and mario can be played with more ease if you move around spin-jumping all the time. (clever brats!

).

6. Back to the case-study we provided: Did you also noticed that no training samples had a value of 0 for the third byte? This resulted in a big difference between the weights of the second and third byte, but try adding a fifth training sample with such as case (like [0,1,0]) and see what the final weights become: Surprisingly, they become something like [+12.80, -4.21, -4.21], which just means the network can be changed to treat the second input byte in a similar fashion to the third byte but still sticks to providing a valid answer for the underlying rule: the first byte is the only one that truly matters to make the weighted solution positive or negative.

7. This new value for the weights also implies something interesting: No matter what are the values of the second and third byte, the output will always be negative if the first byte is not 1 and always be positive if it is 1. This means that [+12.80, -4.21, -4.21] is actually also equivalent to [+infinite, 0, 0] when it comes to correctly implementing the underlying rule.

8. Another curious aspect of this ANN we created is that if we run the net on an input [0,0,0] we get a very interesting result: 0,50. This is caused by the sigmoid function we are using to represent the final value: if 0 would be -infinite and 1 would be +infinite, than 0,50 is in fact the weighted output for [0,0,0], which is always 0: There is no actual work being done by the net here as anything (any weight) multiplied by 0 equals 0. Thus, any conclusions we derive from this case are just arbitrary at this moment. We cannot actually say the network concluded that the rule would yield a zero or a one in this case. There is, however, a way to have the net work even on a [0,0,0] case: We just add what we call a bias to the calculation. Biases will not be added to the codes in this tutorial, but they are a regular adition to Neural Networks that serves to tackle things like noise in images. If you want to experiment with biases yourself, try adding a fourth input parameter whose value is always 1 and whose weight is also to be calculated by the net (or just set a number to be added (or subtracted) alongside the calculations for each synapse).

6. A single network creator code can create nets to solve more than one problem.

In the section 4 we created a network creator code that trained a network to learn an underlying rule from a specific pattern and than apply that same rule to solve new questions. The underlying rule in question was: from an input of 3 bits, return the value of the first bit. But what if it was a different one? Something like: from an input of 3 bits, return the inverse of the second? Would we have to change our network creator code to do this instead?

The answer is NO. All we need to do is to change our training samples. The network will absorb whichever rule it can find in the training samples.

(actually that is a huge overstatement, but we will consider it like so for now).

Let's see how this works. First, our new case table:

: Case Table 2.png (3.57 KiB) Viewed 28693 times

And now, to change the training samples to acomodate the new rule. The first change is in the EXPECTED_OUTPUTS line. This is line number 11.
Lets change it from this:

Code: Select all

TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too). 
EXPECTED_OUTPUTS := Array([0],[1],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.

To this:

Code: Select all

TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too). 
EXPECTED_OUTPUTS := Array([1],[0],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.

And let's also change the values we set for the validation case, since we changed the validation case to [1,1,0] now. This information is in line 37.

Let's change it From this:

Code: Select all

; VALIDATION CASE [1,0,0]:
Input1 := 1, Input2 := 0, Input3 := 0

To this:

Code: Select all

; VALIDATION CASE [1,1,0]:
Input1 := 1, Input2 := 1, Input3 := 0

Running the code now yields a sigmoidal result (FINAL SOLUTION) of ~0.000078, which is a big sucess! The trained network thinks the answer to the new validation sample is a 0 (inverse of input2, which is 1) and it is strongly convinced of this! (remember: the closer to 0 in the sigmoidal solution, the more convinced the network is that the solution is a zero).

Checking the new weights attributed is also quite interesting:
Input1: ~+0.2086
Input2: ~-9.6704
Input3: ~+4.6278

However you look at it, these weights means what we expected them to mean: an input2 of 1 will result in a negatively weighted number and an input2 of 1 will result in a positively weighted number. Success!

7. But be sure check what you are really feeding it!

Since the neural network will be trained to find any underlying rules in the samples (and not just a rule we think to be present in the samples), care must be taken when choosing what to feed it as training samples. For one such example, consider the table below. The underlying rule i am trying to teach the net here is this: If both the second and the third inputs are 1, than the result must be 1. If not, than the result must be 0.

: Case Table 4.png (3.58 KiB) Viewed 28693 times

We have 2 examples that yield 1 and two examples that yield 0 in the table above. This should be enougth to train the net to succesfully answer a validation sample of [1,1,0] as 0 right?

Well, NO! Look carefully: there is ANOTHER possible underlying rule in place here: If the second input is 0, result is 0, if it is 1, result is 1. If the network absorbs this rule instead, it will answer the validation sample as 1 (and since this rule is more simple than the first one, it will probably be the one absorbed!).

To have the network solve the riddle as we want it to, we will need to change our training samples and feed something more apropriate to the network creator code.

This should do:

: Case Table 5.png (3.69 KiB) Viewed 28693 times

As we changed the first training sample to a new training sample [0,1,0] = 0, we have now made absolutely sure that a rule "second bit is the answer" is not possible. Running the code now will properly yield our desired results.

8. Excellent! And what else?

As we have seen in the section 6, a netwok creator code is very powerful: it can learn many different patterns and create many different networks to solve them. But there is unfortunately a limit to it's power. Some underlying rules are just too complex for this simple model of ANN we are implementing.

Consider the following case table.

: Case Table 3.png (3.58 KiB) Viewed 28693 times

The underlying rule of this table is this: If (input1 is 1 and input2 is 0) OR (input1 is 0 and input2 is 1), the result is 1. Otherwise it is 0. This is what we call a XOR problem (exclusive-OR). It is a very possible situation, and it is also very possible to devise an ANN to solve it. But if you just adjust our code to the table above you will see it does NOT work. Even if you try to add every possible input as training sample, it will still not work: it will simply never work with out curent code. The output you get if you try it (because you always get an output) will probably be inconclusive, so that if you run the code ten times or so, the trained net will actually shift between positive and negative at random. The net is simply unable to solve this problem.

Fortunately though, and as mentioned, this problem has been solved already and we CAN create ANNs that solve XOR problems. The way to do this is simple: we need a multi-layered ANN. This is necessary so that we can have our net understand multi-dimensional solutions. We can implement a multi-layer ANN using a trick called BackPropagation. So if we just change our network creator code to an implementation of ANNs that includes these concepts, we will succeed!

Muti-Layered ANNS will be covered in the section II of this tutorial, but for the moment, let's enjoy what we have achieved so far!

Artificial Neural Networks are a field of knowledge that has flourished in the recent years and it is in continuous development. New concepts, new models, new ideas, there is just so much to talk that this basic tutorial will not suffice to include it all. But if it was somehow succesfull, it may have ignited a spark of curiosity in you and you may well be on your way to become an experienced ANN programmer. How about consulting what is available elsewhere and help improve the current boundaries of what tasks are considered programable?

We are currently in a decade in which the making of ANNs is still considered a crafting of sorts, and those who craft them now hold a new power to change the world. If you can develop a network creator code that comes up with a net to solve a new problem, this can be quite valuable!

Thanks for reading all this and feel free to post in any questions you like

04 Jan 2018, 19:31

SECTION II

If you have progressed this far in this tutorial, you should be able to devise simple ANNs that implement what we call a PERCEPTRON. The perceptron is an artifical model for a neuron. It receives inputs, processes them using some weights, and than presents an output according to this processing. This unit is the basic component of an Artifical Neural Networks and it is what we have implemented thus far in this tutorial: a perceptron that receives 3 inputs, processes them with 3 weights and returns an according output.

Let's proceed further now!

9. Introduction to multi-layered Artificial Neural Networks

Although powerfull, the perceptron unit has a few limitations. If we look at it's mathematical model, some of these limitations become clear:

Result := Input1 * Weight1 + Input2 * Weight2 + Input3 * Weight3 + Input4 * Weight4 ...

Transforming the formula above into a function we have something like this:

f(x) = Ax + Bx + Cx...

And if we add a small bias B to the calculation, which can be represented by a number we add (or subtract), the function can than represented like this:

f(x) = Ax + B ...

Doesn't that tickles something?

It is just an implementation of a first-degree polynomial!

That's right, and if this is a first degree polynomial, it also means it will always graph as a straight line:

: First Degree Poly.png (1.88 KiB) Viewed 28693 times

So imagine we were trying to get a perceptron to tell us wether our inputs is 1 or 0 based on a rule of position. Imagine the following graph contains the underlying rule we are trying to implement in the samples of green and red dots. What we want to do it is separate these dots by color:

: Problem1.png (1.67 KiB) Viewed 28693 times

Can our perceptron model absorb the underlying rule into a function

Sure it can! This is a solution it can present us:

: Problem1Solution.png (2.11 KiB) Viewed 28693 times

But now comes a more important question: Can you spot the range of possible solutions that our simple perceptron has to work with?

: Problem1Solution2B.png (2.25 KiB) Viewed 28401 times

The area painted in yellow in the picture above is the range of possible values our that our perceptron can present as a solution. Any straight line we can set in that area is a valid solution. There is absolutely no other way of fitting a straight line (the functions our perceptron can output) without incurring in a separation error aside of putting it inside that area.

Now, imagine we have a different problem at hand. Once again, we need to find an underlying rule of separation between red and green dots. The problem is that the samples are now disposed like this:

: Problem2.png (3.11 KiB) Viewed 28693 times

Can you see were we are getting?

There is absolutely no range of possible solutions that our perceptron can inprint into a first-degree polynomial function because all a first-degree polynomial can do is a straight line!

From this comes a rule of ANNs that is known ever since at least the 1950s: Single perceptron units can only solve linearly-separable problems.

But is there a solution to our problem nonetheless? Some rule that can separate the dots presented above? Yes there is!
One such example is the function f(x) = x² (A second-degree polynomial). It readily solves our problem:

: Problem2Solution.png (3.81 KiB) Viewed 28692 times

But if a perceptron is unable to present a solution that is not a first-degree polynomial, how can an ANN ever hope to achieve a solution like that?

Simple! To achieve this we just add a second layer composed of a second perceptron unit. This second layer will simply operate a new processing on the results of the first layer.

Thus, were we once had this (BEFORE):
Result := Input1 * Weight1 + Input2 * Weight2 ...

We will now have something like this (weights 3 and 4 below represent our second layer processing) (NOW):
Result := (Input1 * Weight1) * Weight3 + (Input2 * Weight2) * Weight4 ...

So if, for example, weight1 = weight3 in the situation above, we will end up with:
Result := (Input1 * Weight1) * Weight1 + ...
Which is equivalent to:
Result := (Input1 * Weight1²)

And that is just what we wanted

: Well, not exactly, but close: Since X is actually the Input, and not the Weight, the function above is still equivalent to a first-degree polynomial (for a weight of 2 in both layers, it would be something like f(x) = 2²x). But a very important component already in place in our neural network will now change this for us: the sigmoid activation function that we use to treat the output of each neuron in each layer, when implemented alongside this multiple layer architecture, is what ultimately allows our network to access new dimensions and present non-linear solutions. The explanation for this is a little bit more complicated, but if you want to, you can have a look at the graph below. The black line is the output of two neurons in a first layer and 1 neuron is a second layer, all sigmoidally-treated: see how the black line starts to implement multiple curves.

: Non-Linear Solution2.png (77.39 KiB) Viewed 28372 times

Now THAT is what we wanted

10. The power of multi-layered Artificial Neural Networks to solve problems

Multi-Layered neural networks (MLNNs) are very powerfull and unlike single-layer networks (perceptrons), they can solve almost any problem we can present them (if the problem has a solution, of course). For this reason MLNNs are considered the vanilla form of Artificial Neural Networks. These ANNs can implement what we call Deep-Learning by adjusting the weights of hidden-layers through a method called backpropagation.

Given the statements above i imagine we have to provide some proof to back up our claims of power right? No problem!

Think about everything a computer can do. Do you know how a computer can be so powerful as to even be able to simulate 3D ambients in games and etc? When we investigate the low-level basics of how a CPU operates in search of an answer, we discover that any processor and RAM memory in the current architectures are based on a huge ammount of intricately connected basic units: The transistors. These transistors are of great use to us because with them we can implement any of the basic logic operations, and this is done through chaining transistors to what we call logic gates: There are NAND Gates, AND gates, OR Gates, XOR Gates and so on. These gates are the very basic units of what constitues processing and memory in modern computers.

But as we stated before, Multi-layered ANNs can solve XOR problems in addition to those the perceptrons can already solve as individual units (AND, OR, NAND...). In this sense, Multi Layered Neural Networks in fact achieve what we call Functional Completeness and are thus able to do anything a computer can. Their only limitation is, of course, the limited processing power we have and this is the main reason why ANNs were theorized decades ago but have flourished mostly in this decade or so: We finally have enougth processing power to create ANNs that do "magical" things (like automating car driving, recognizing human faces in pictures or even recognizing cat videos). If we continue to increase our processing power (or finding more efficient task-specific ANN models, such as the Convolutional Neural Network models, which are great for image processing), the AI we can implement will continue to increase and who knows what is going to come next?

11. All right, so how do we go about implementing these MLNNs?

If the sections above sparked some interest in you to implement a full fledged MLNN, fear not: this is going to be explained next in this tutorial. The following sections are once again based on a post by Milo Spencer Harper which we have sucesffuly translated to AutoHotkey (or sort of).

First, let's talk about the main problem...

When we implemented the single-layer perceptron in sections 4 to 8 of this tutorial, we did so by creating a code that recalculates the weights to be used to process the inputs. This recalculation had the purpose of aproximating the final function to account for an underlying rule in the samples. Both of these goals were achieved by tunning the weights of the perceptrons with the difference between the actual output and the expected output in the training samples. If the output had to be bigger, we just had to increase the weights and if the output had to be lower, we just had to decrease the weights. In a MLNN creator code, however, we still have to do the tunning of the weights, but we have a new situation to account for: There will be some work going on between layers, and we now have to fully account for it. If this statement was not clear enougth, picture it like this: if a neuron of the first layer outputs a positive result, a neuron of the second layer may just change that to negative, so we cannot simply adjust the weights in the first layer up if we want the final output to go up (like we did before) because a bigger output from the first layer may be transformed to a bigger negative when it goes through the second layer (or rather: if we adjust a weight UP in the first layer, we may cause the input to be worked further DOWN in the second layer). This means that we cannot adjust the weights in the first layer without accounting for what is going on in the second layer.

Confusing huh

But don't worry, because this new problem is solvable nonetheless and all we have to do is to find the ratio between the tunning of the weights of the first layer and the impact it will have in the second layer. In other words: if we change the first layers weights up by 10%, how much will change in the output? or in other words: What is rate of change of Layer 1 compared to Layer 2?

Answering this is what will allow us to implement the method of backpropagation and there is a field of mathematics that is specialized at providing this type of answer: calculus

The derivative of a function is a second function that represents the rate of change in the first function. It is this path that is going to lead us to sucessfully answering "how much does a change in the weights of layer affects the final output".

That being said...

12. Lets get practical again!

Suppose we have the following situation.

: Case Table 6.png (4.05 KiB) Viewed 28692 times

This is an implementation of the XOR problem our single-layer neural network (perceptron) could not solve. How do we go about creating a multi-layer neural neural to solve it?

First, let's define our MLNN implementation. We will create a MLNN in which the first layer contains 4 perceptrons, each receiving 3 inputs and the second layer contains a single perceptron, receving 4 inputs (1 from each of the percetrons in the first layer).

: MLNN_B.png (16.13 KiB) Viewed 28400 times

The code below does exactly this, and it has been commented to provide a step-through-step idea of how to achieve the correct results.

Warning: The comments in the code below is a valuable part of this tutorial! Don't skip studying the code.

Code: Select all

 SetBatchLines, -1
 
 ; The code below does a lot of matricial calculations. This is important mostly as a means of organization. We would need far too many loose variables if we did not used matrices, so we are better off using them.
 
 ; We start by initializing random numbers into the weight variables (this simulates a first hipotesis of a solution and allows the beggining of the training).
 ; Since we are planning to have a first layer with 4 neurons that have 3 inputs each and a second layer with 1 neuron that has 4 inputs, we need a total of 16 initial hipothesis (random weights)
Loop 16 
{
	Random, Weight_%A_Index%, -1.0, 1.0
}

; And than organize them into a matrix for each layer.
WEIGHTS_1 := Array([Weight_1, Weight_2, Weight_3, Weight_4], [Weight_5, Weight_6, Weight_7, Weight_8], [Weight_9, Weight_10, Weight_11, Weight_12]) ; Initital 12 Weights of layer1. MATRIX 3 x 4. 
WEIGHTS_2 := Array([Weight_13], [Weight_14], [Weight_15], [Weight_16]) ; Initial 4 Weights of layer2. MATRIX 1 x 4. 

TRAINING_INPUTS := array([0, 0, 1], [0, 1, 1], [1, 0, 1], [0, 1, 0], [1, 0, 0], [1, 1, 1], [0, 0, 0]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too). MATRIX 7 x 3.
EXPECTED_OUTPUTS := Array([0],[1],[1],[1],[1],[0],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.

; Below we are declaring a number of objects that we will need to hold our matrices.
OUTPUT_LAYER_1 := Object(), OUTPUT_LAYER_2 := Object(), OUTPUT_LAYER_1_DERIVATIVE := Object(), OUTPUT_LAYER_2_DERIVATIVE := Object(), LAYER_1_DELTA := Object(), LAYER_2_DELTA := Object(), OLD_INDEX := 0


Loop 60000 ; This is the training loop (The network creator code). In this loop we recalculate weights to aproximate desired results based on the samples. We will do 60.000 training cycles.
{
	; First, we calculate an output from layer 1. This is done by multiplying the inputs and the weights.
	OUTPUT_LAYER_1 := SIGMOID_OF_MATRIX(MULTIPLY_MATRICES(TRAINING_INPUTS, WEIGHTS_1))
	
	; Than we calculate a derivative (rate of change) for the output of layer 1.
	OUTPUT_LAYER_1_DERIVATIVE := DERIVATIVE_OF_SIGMOID_OF_MATRIX(OUTPUT_LAYER_1)
	
	; Next, we calculate the outputs of the second layer.
	OUTPUT_LAYER_2 := SIGMOID_OF_MATRIX(MULTIPLY_MATRICES(OUTPUT_LAYER_1, WEIGHTS_2))
	
	; And than we also calculate a derivative (rate of change) for the outputs of layer 2.
	OUTPUT_LAYER_2_DERIVATIVE := DERIVATIVE_OF_SIGMOID_OF_MATRIX(OUTPUT_LAYER_2)
	
	; Next, we check the errors of layers 2. Since layer 2 is the last, this is just a difference between calculated results and expected results.
	LAYER_2_ERROR := DEDUCT_MATRICES(EXPECTED_OUTPUTS, OUTPUT_LAYER_2)
	
	; Now we calculate a delta for layer 2. A delta is a rate of change: how much a change will affect the results.
	LAYER_2_DELTA := MULTIPLY_MEMBER_BY_MEMBER(LAYER_2_ERROR, OUTPUT_LAYER_2_DERIVATIVE)
	
	; Than, we transpose the matrix of weights (this is just to allow matricial multiplication, we are just reseting the dimensions of the matrix).
	WEIGHTS_2_TRANSPOSED := TRANSPOSE_MATRIX(WEIGHTS_2)
	
	; !! IMPORTANT !!
	; So, we multiply (matricial multiplication) the delta (rate of change) of layer 2 and the transposed matrix of weights of layer 2. 
	; This is what gives us a matrix that represents the error of layer 1 (REMEBER: The error of layer 1 is measured by the rate of change of layer 2).
	; It may seem counter-intuitive at first that the error of layer 1 is calculated solely with arguments about layer 2, but you have to interpret this line alongside the line below (just read it).
	LAYER_1_ERROR := MULTIPLY_MATRICES(LAYER_2_DELTA, WEIGHTS_2_TRANSPOSED)
	
	;Thus, when we calculate the delta (rate of change) of layer 1, we are finally connecting the layer 2 arguments (by the means of LAYER_1_ERROR) to layer 1 arguments (by the means of layer_1_derivative).
	; The rates of change (deltas) are the key to understand multi-layer neural networks. Their calculation answer this: If i change the weights of layer 1 by X, how much will it change layer 2s output?
	; This Delta defines the adjustment of the weights of layer 1 a few lines below...
	LAYER_1_DELTA := MULTIPLY_MEMBER_BY_MEMBER(LAYER_1_ERROR, OUTPUT_LAYER_1_DERIVATIVE)
	
	; Than, we transpose the matrix of training inputs (this is just to allow matricial multiplication, we are just reseting the dimensions of the matrix to better suit it).
	TRAINING_INPUTS_TRANSPOSED := TRANSPOSE_MATRIX(TRAINING_INPUTS)
	
	; Finally, we calculate how much we have to adjust the weights of layer 1. The delta of the Layer 1 versus the inputs we used this time are the key here.
	ADJUST_LAYER_1 := MULTIPLY_MATRICES(TRAINING_INPUTS_TRANSPOSED, LAYER_1_DELTA)

	; Another matricial transposition to better suit multiplication...
	OUTPUT_LAYER_1_TRANSPOSED := TRANSPOSE_MATRIX(OUTPUT_LAYER_1)
	
	; And finally, we also calculate how much we have to adjust the weights of layer 2. The delta of the Layer 2 versus the inputs of layer 2 (which are really the outputs of layer 1) are the key here.
	ADJUST_LAYER_2 := MULTIPLY_MATRICES(OUTPUT_LAYER_1_TRANSPOSED,LAYER_2_DELTA)
	
	; And than we adjust the weights to aproximate intended results.
	WEIGHTS_1 := ADD_MATRICES(WEIGHTS_1, ADJUST_LAYER_1)
	WEIGHTS_2 := ADD_MATRICES(WEIGHTS_2, ADJUST_LAYER_2)
	
	; The conditional below is just to display the current progress in the training loop.
	If (A_Index >= OLD_INDEX + 600)
	{
		TrayTip, Status:, % "TRAINING A NEW NETWORK: " . Round(A_Index / 600, 0) . "`%"
		OLD_INDEX := A_Index
	}	
}

; TESTING OUR OUPUT NETWORK!

; First, we convey our validation case to variables:
Input1 := 1
Input2 := 1
Input3 := 0

; Than, we do the function for the first layer components!
Out_1 := Sigmoid(Input1 * WEIGHTS_1[1,1] + Input2 * WEIGHTS_1[2,1] + Input3 * WEIGHTS_1[3,1])
Out_2 := Sigmoid(Input1 * WEIGHTS_1[1,2] + Input2 * WEIGHTS_1[2,2] + Input3 * WEIGHTS_1[3,2])
Out_3 := Sigmoid(Input1 * WEIGHTS_1[1,3] + Input2 * WEIGHTS_1[2,3] + Input3 * WEIGHTS_1[3,3])
Out_4 := Sigmoid(Input1 * WEIGHTS_1[1,4] + Input2 * WEIGHTS_1[2,4] + Input3 * WEIGHTS_1[3,4])

; Which are inputed into the function of the second layer to form the final function!
Out_Final := Sigmoid(Out_1 * WEIGHTS_2[1,1] + Out_2 * WEIGHTS_2[2,1] + Out_3 * WEIGHTS_2[3,1] + Out_4 * WEIGHTS_2[4,1])

; REMEMBER: The sigmoidal result below is to be interpreted like this: A number above 0.5 equals an answer of 1. How close the number is to 1 is how certain the network is of its answer. A number below 0.5 equals an answer of 0. How close the number is of 0 is how certain the network is of its answer.
msgbox % "The final network thinks the result is: " . Out_Final


; The final weights of the network are displayed next. They are what hold the underlying rule and provide the solution. If these are already calculated, there is nothing else to calculate, just apply the weights and you will get the result: that is why a Neural Network is expensive (in termos of processing power) to be trained but extremely light to be implemented (usually).
MSGBOX % "WEIGHT 1 OF NEURON 1 OF LAYER 1: " . WEIGHTS_1[1,1]
MSGBOX % "WEIGHT 2 OF NEURON 1 OF LAYER 1: " . WEIGHTS_1[2,1]
MSGBOX % "WEIGHT 3 OF NEURON 1 OF LAYER 1: " . WEIGHTS_1[3,1]

MSGBOX % "WEIGHT 1 OF NEURON 2 OF LAYER 1: " . WEIGHTS_1[1,2]
MSGBOX % "WEIGHT 2 OF NEURON 2 OF LAYER 1: " . WEIGHTS_1[2,2]
MSGBOX % "WEIGHT 3 OF NEURON 2 OF LAYER 1: " . WEIGHTS_1[3,2]

MSGBOX % "WEIGHT 1 OF NEURON 3 OF LAYER 1: " . WEIGHTS_1[1,3]
MSGBOX % "WEIGHT 2 OF NEURON 3 OF LAYER 1: " . WEIGHTS_1[2,3]
MSGBOX % "WEIGHT 3 OF NEURON 3 OF LAYER 1: " . WEIGHTS_1[3,3]

MSGBOX % "WEIGHT 1 OF NEURON 4 OF LAYER 1: " . WEIGHTS_1[1,4]
MSGBOX % "WEIGHT 2 OF NEURON 4 OF LAYER 1: " . WEIGHTS_1[2,4]
MSGBOX % "WEIGHT 3 OF NEURON 4 OF LAYER 1: " . WEIGHTS_1[3,4]

MSGBOX % "WEIGHT 1 OF NEURON 1 OF LAYER 2: " . WEIGHTS_2[1,1]
MSGBOX % "WEIGHT 2 OF NEURON 1 OF LAYER 2: " . WEIGHTS_2[2,1]
MSGBOX % "WEIGHT 3 OF NEURON 1 OF LAYER 2: " . WEIGHTS_2[3,1]
MSGBOX % "WEIGHT 4 OF NEURON 1 OF LAYER 2: " . WEIGHTS_2[4,1]

RETURN ; aaaand That's it !! :D The logical part of the ANN code ends here (the results are displayed above). Below are just the bodies of the functions that do the math (matricial multiplication, sigmoid function, etc). But you can have a look at them if you want, i will provide some explanation there too.


; The function below applies a sigmoid function to a single value and returns the results.
Sigmoid(x)
{
	return  1 / (1 + exp(-1 * x))
}


Return
; The function below applies the derivative of the sigmoid function to a single value and returns the results.
Derivative(x)
{
	Return x * (1 - x)
}

Return
; The function below applies the sigmoid function to all the members of a matrix and returns the results as a new matrix.
SIGMOID_OF_MATRIX(A)
{
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] := 1 / (1 + exp(-1 * A[CURRENT_ROW, CURRENT_COLUMN]))
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below applies the derivative of the sigmoid function to all the members of a matrix and returns the results as a new matrix. 
DERIVATIVE_OF_SIGMOID_OF_MATRIX(A)
{
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] := A[CURRENT_ROW, CURRENT_COLUMN] * (1 - A[CURRENT_ROW, CURRENT_COLUMN])
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below multiplies the individual members of two matrices with the same coordinates one by one (This is NOT equivalent to matrix multiplication).
MULTIPLY_MEMBER_BY_MEMBER(A,B)
{
	If ((A.MaxIndex() != B.MaxIndex()) OR (A[1].MaxIndex() != B[1].MaxIndex()))
	{
		msgbox, 0x10, Error, You cannot multiply matrices member by member unless both matrices are of the same size!
		Return
	}
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] := A[CURRENT_ROW, CURRENT_COLUMN] * B[CURRENT_ROW, CURRENT_COLUMN]
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below transposes a matrix. I.E.: Member[2,1] becomes Member[1,2]. Matrix dimensions ARE affected unless it is a square matrix.
TRANSPOSE_MATRIX(A)
{
	TRANSPOSED_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			TRANSPOSED_MATRIX[CURRENT_COLUMN, CURRENT_ROW] := A[CURRENT_ROW, CURRENT_COLUMN]
		}
	}
	Return TRANSPOSED_MATRIX
}

Return
; The function below adds a matrix to another.
ADD_MATRICES(A,B)
{
	If ((A.MaxIndex() != B.MaxIndex()) OR (A[1].MaxIndex() != B[1].MaxIndex()))
	{
		msgbox, 0x10, Error, You cannot subtract matrices unless they are of same size! (The number of rows and columns must be equal in both)
		Return
	}
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] := A[CURRENT_ROW,CURRENT_COLUMN] + B[CURRENT_ROW,CURRENT_COLUMN]
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below deducts a matrix from another.
DEDUCT_MATRICES(A,B)
{
	If ((A.MaxIndex() != B.MaxIndex()) OR (A[1].MaxIndex() != B[1].MaxIndex()))
	{
		msgbox, 0x10, Error, You cannot subtract matrices unless they are of same size! (The number of rows and columns must be equal in both)
		Return
	}
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex()
	{
		CURRENT_ROW := A_Index
		Loop % A[1].MaxIndex()
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] := A[CURRENT_ROW,CURRENT_COLUMN] - B[CURRENT_ROW,CURRENT_COLUMN]
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below multiplies two matrices according to matrix multiplication rules.
MULTIPLY_MATRICES(A,B)
{
	If (A[1].MaxIndex() != B.MaxIndex())
	{
		msgbox, 0x10, Error, Number of Columns in the first matrix must be equal to the number of rows in the second matrix.
		Return
	}
	RESULT_MATRIX := Object()
	Loop % A.MaxIndex() ; Rows of A
	{
		CURRENT_ROW := A_Index
		Loop % B[1].MaxIndex() ; Cols of B
		{
			CURRENT_COLUMN := A_Index
			RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN]  := 0
			Loop % A[1].MaxIndex()
			{
				RESULT_MATRIX[CURRENT_ROW, CURRENT_COLUMN] += A[CURRENT_ROW, A_Index] * B[A_Index, CURRENT_COLUMN]
			}
		}
	}
	Return RESULT_MATRIX
}

Return
; The function below does a single step in matrix multiplication (THIS IS NOT USED HERE).
MATRIX_ROW_TIMES_COLUMN_MULTIPLY(A,B,RowA)
{
	If (A[RowA].MaxIndex() != B.MaxIndex())
	{
		msgbox, 0x10, Error, Number of Columns in the first matrix must be equal to the number of rows in the second matrix.
		Return
	}
	Result := 0
	Loop % A[RowA].MaxIndex()
	{
		Result += A[RowA, A_index] * B[A_Index, 1]
	}
	Return Result
}

For an interactive GUI Version of the code above, check this post by SpeedMaster.

For a class-style code with bias calculation and more options in the GUI, check this post by Nnnik.

With the code above, we have succesfully implemented an instance of the vanilla form of an Artifical Neural Network (Also called the multi-layer perceptron). This concludes our tutorial on the basics of ANNs. With that code you should have a nice starting point to implement new ANNs and achieve new results. Modify it, add to it, make it suited to your liking, or just solve new problems with your own ideas based on these concepts. I am leaving that freedom and opportunity to you now

If you wish to learn more about the concepts involved in ANNs, this is a great video series to start: https://www.youtube.com/watch?v=aircAruvnKk
And if you wish to go through a step-by-step implementation of an ANN that recognizes handwritten digits, this is a great online book by Michael Nielsen: http://neuralnetworksanddeeplearning.com/

Also, feel free to post in any questions regarding ANNs and we will try to find a solution

05 Jan 2018, 05:32

Very interesting - if I could find the time to play around with this right now I would. However with finals just around the corner just a little bit of distraction could cause some serious issues for me.

Joe Glines · 05 Jan 2018, 11:17

Great overview Gio! I've been a data scientist long before they were called that and this is a wonderful, easy to follow, overview of the topic!

Great job!

05 Jan 2018, 12:36

Thanks Joe

I am still writing actually. I have just added a breaking of the formulas in the code (to aid their interpretation and study). I am also planning on furthering up this tutorial a litte bit.

05 Jan 2018, 13:08

Wow! Very nice, truly a valuable addition to this forum!

05 Jan 2018, 13:30

So just to make and clearify my understanding... The example equations you have in the second image, are these just arbitary to show each neuron will tend towards to their own 'equations' or rather 'thinking'. In some sense, these equations could be different from machine to machine or rather "model' to model?

YouTube · 05 Jan 2018, 14:26

Nice write up enjoyed reading it, I am a little unsure if the example is training a small net as i did read that a few times or if it is only a single neuron? After my first read I belive it to be just one neuron, am I wrong?

05 Jan 2018, 16:39

joedf wrote:So just to make and clearify my understanding... The example equations you have in the second image, are these just arbitary to show each neuron will tend towards to their own 'equations' or rather 'thinking'. In some sense, these equations could be different from machine to machine or rather "model' to model?

You are right. The values of the weights in the 5-neuron network image are indeed just arbitrary examples to explain the concepts. I wrote them to merely ilustrate a possible configuration. They can and will be very different in any specific net.

I am a little unsure if the example is training a small net as i did read that a few times or if it is only a single neuron? After my first read I belive it to be just one neuron, am I wrong?

Milo viewed this example as a single-neuron-single-layer network in which the lone neuron receives 3 inputs and processes them with 3 different weights to achieve 1 output. In my opinion this is mostly conceptual though: I have seen people explain nets in which each neuron in the first layer is interpreted as being just 1 input and 1 weight. In this sense, the example could be seen as a 3-neuron-single-layer network: The reinterpretation would not really change the formula. Milos next tutorial is about multi-layered networks. I think these will have a much better boundary definition.

05 Jan 2018, 18:26

I have added a second part to the tutorial. I will now consider it finished. Keep in mind this is a tutorial on the very basics, there is a lot more into the subject. But if anyone wants to discuss more advanced concepts about ANNs in this topic, feel free to do so

The more we learn, the better!

In the meanwhile, this is a very fun ANN video to watch (A genetic algorithm learns how to fight!):
https://www.youtube.com/watch?v=u2t77mQmJiY

And another (Computer tries to replicate my voice!):
https://www.youtube.com/watch?v=jSsMqjMcRAg

Joe Glines · 09 Jan 2018, 10:44

This is a decent Intro to Machine Learning done by MIT professor John Guttag

10 Jan 2018, 00:15

Yes MIT open course ware! Thanks for sharing

12 Jan 2018, 19:54

I have updated the tutorial to contain a section for multi-layered neural networks. Figured i had to do it otherwise it would just be far too basic.
Still writing it, but there is enougth for a good read already.

As always, feel free to post in any questions

Joe Glines · 13 Jan 2018, 08:39

@Gio, I'm curious have you created ANN in other languages? And, if so, how does AutoHotkey stack up in terms of performance. I'm not expecting to use AHK on "big data" but was just curious if we should just be using it for learning or is it feasible to use it for a "real" program?

I've been using SPSS for over 20 years and it has a lot of stats built into it however the syntax language is archaic at doing some basic things. I can now call Python or R code from within side it but I'd love to be able to just do it all in AutoHotkey...
Thanks again for your work!
Regards,
Joe

13 Jan 2018, 10:15

have you created ANN in other languages? And, if so, how does AutoHotkey stack up in terms of performance.

Not really, but i ran Milos code (which is the basis of my AHK code) in Python 3.6.4 and the performance results were greatly in favour of his Python code (his code in Python took 5 secs while mine in AHK took 40 secs on the same laptop computer). That being said, the code is not exactly the same. Milos code uses a very famous Python library called numpy which has probably been greatly optimized for matricial calculations, whereas i just wrote some blunt loop-based matrix multiplication functions myself with no intention on providing excellent performance.

I'm not expecting to use AHK on "big data" but was just curious if we should just be using it for learning or is it feasible to use it for a "real" program?

Artificial Neural Networks are a very broad subject and there is certainly enougth room for development in AHK. Performance depends a lot on how you write the code and what you are trying to do. Aside of vanilla ANNs (or Multi-Layer Perceptrons), there are Convolutional Neural Network and Recurrent Neural Networks models, among others, which greatly improve performance for specific tasks, such as image and voice recognition. There is also the difference between a network creator code and a network itself (i.e. you could get a network creator code running in one language and than translate just the final network to AHK). And there is also the fact that sampling (collecting sufficient and high quality labeled training samples) and perhaps other tasks (such as preventing overfitting) are usually what really takes the longest time when designing ANNs. In my opinion AHK IS a valid option for learning and even designing Neural Networks for almost everyone and i would only think about using a more performance oriented language if i actually stumbled upon a very specific situation where performance was really going to impair me short term. Even translating a code to another language if need arises would not be that much of a hassle afterall.

And thanks for the video Joe, it was a great intro to the history of machine learning

Joe Glines · 13 Jan 2018, 14:17

One of the reasons I was learning Python a few years ago was for doing statistical analysis. I played with NumPy and Pandas (and a few others) but, at the time, it was taking too long for me to perform my analysis compared to doing it in SPSS.

Regarding that Machine Learning video- I find it amusing because I've been utilizing tools like Linear Regression, Discriminant Analysis, Probit/Logit models, Regression Trees, CHAID (Chi-Square Automatic Interacting Detection) for 20+ years but I'd never consider them "machine learning". I guess it is in the eye of the beholder...

15 Jan 2018, 16:01

That is an interesting point indeed Joe. Processing power requirements of Statistical analysis on big data vs processing power requirements of Neural Networks simulation. Once we know that vanilla ANNs could in theory find the solution to any problem (if it has a solution, of course), being mostly limited only due to finite processing power and sample quality, we are led to conclude that whatever we can do to increase processing power, that should be the way to go for researching and implementing Neural Networks.

But if we compare ANN reasearch and implementation vs applying solid statistical models that just basically need the raw processing power, there is a key difference that really changes the goal for ANNs.

ANNs are currently in a different state: While we do have a lot to thank regarding processing power only acquired in the last decade or so (which is what allowed the field of ANNs to flourish as in having new people run simulations in their home computers), ANNs are currently in a state of continuous reasearch for new models and implementations that provide either better efficiency or just new usage. The models of ANN implementations we have today are not solid at all: There is a lot of room for doing lots of changes and testing different things or even just applying the same models to solve different problems never attempted before.

From these attempts, new ideas are coming to light, and models like Convolutional Neural Networks and Recurrent Neural Networks have evolved to efficiently solve problems that would just be too costly to solve using vanilla ANNs (you would need far too many perceptrons and layers to do the same thing) with nothing more than what a home computer can provide. And these CNNs are not a static thing: different implementations are getting different results over time.

This is the key difference: while you can get like 2x, 4x, or even 10x more processing power using a different hardware or software (or even 1000x using a google supercomputer or something like that), that is actually not too much when you consider the fact that ANN processing costs for the same model of ANN may increase exponentially depending on the task at hand (or rather how it is programmed), with new layers and increased number of perceptrons per layer having a huge impact on processing requirements. That is why the most sucesfull models to date for tasks like image and voice recognition are NOT vanilla ANNs, but rather modified models like CNNs or Recurrent Neural Networks. And even the actual implementations of these vary greatly between different codes. Through time, new implementations of the same type of models have provided much better results for tasks like handwriten character recognition (see the example chart here: https://en.wikipedia.org/wiki/MNIST_dat ... erformance).

So the current research focus in ANNs is divided mostly into:
1 - Using existing models for new things (like this guy trying to teach an ANN to play baroque music or this list of 30 out-of-the-box-thinking applications of ANNs);
2 - Finding new models and new architectures or just optimizing existing ones.

It is not a field that currently relies of finding more processing power to do the same things with the same tools. The focus is on changing the models. That is why some people say that implementing ANNs is currently somewhat of a crafting skill. So, in example, when one does ANN research, he/she can quickly estimate the required time to run a simulation, and if this time is just too big, there is actually no other choice but getting back to code and trying to do something else or changing something there. Brute force is not a key option unless you just want to see some code brought to it's conclusion for a presentation or something like that. Most times you can even estimate the final results based on the curves brought up by the first few hundreds or thousands of iterations if you want to.

Anyways, this post is of course NOT a denial of the need for more processing power, it is TRUE that more processing power would be great, but i hope i have been clear on why i think research and actual new implementations CAN be made with AHK

An interesting read about the relation between different architectural designs of ANNs and their relationg to the problem of processing power available: https://towardsdatascience.com/neural-n ... 6e5bad51ba

Joe Glines · 25 Feb 2018, 17:10

Here's a decent article discussing what machine learning is and frequent misconceptions of it
http://news.codecademy.com/what-is-machine-learning/

26 Feb 2018, 08:55

A very interesting article Joe, thanks for sharing

Also, this week saw the launching of Samsungs Galaxy S9. It comes to compete with Apples Iphone X and what is remarkable about both it and the Iphone X is that the breakthrough technologic advancements of this new generation of smartphones are mostly based on AI and Neural Networks. There is little mention to hardware advancements in the spotlights, as it is almost all about the software and AI now.

And the ideas they brought up are remarkably simple (and powerful)!

Example1: Samsungs Galaxy S9 translates text in images captured by the camera and blends the translation in the image (Neural networks!):

Example2: Samsungs Galaxy S9 uses Face recognition to create custom emojis (Neural Networks!):

Example3: Samsungs Galaxy S9 and Apples IphoneX uses Face recognition to unlock your phone (Neural networks!).

Also related: Iphone X comes with a built-in Neural Engine (Neural Networks!).

Joe Glines · 21 Mar 2018, 08:22

Our webinar on Neural Networks was awesome! You can check out the recordings and resources shared here

Great job presenting Gio!

Neural Network basics - Artificial Intelligence using AutoHotkey!

Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Re: Neural Network basics - Artificial Intelligence using AutoHotkey!

Who is online