The post Robot Localization III: The Kalman Filter appeared first on deep ideas.

]]>This post deals with another solution to the continuous state space problem, the **Kalman Filter**, invented by Thiele, Swerling and Kalman. It has successfully been used in many applications, like the mission to Mars or automatic missile guidance systems (cf. [NEGENBORN, abstract]). The classical application is radar tracking, but there is a vast amount of other applications (cf. [NORVIG, pp. 588, 589]).

In its essence, it is an implementation of the Bayes Filter in which the belief is a normal (Gaussian) distribution and can therefore be represented by its parameters: a mean vector and a covariance matrix. In this representation, the mean vector is the expected state and the covariance matrix is a measure of uncertainty.

In order for the Kalman filter to work, we need to make a few assumptions about the system we wish to describe (in addition to the Markov assumptions of the Bayes filter). If these assumptions hold, the belief $bel(x_t)$ *will be normally distributed at each time step* $t$ and can thus be represented by a mean vector $\mu_t$ and a covariance matrix $\Sigma_t$. It is also true that if either of the three assumptions is violated, then the belief will always be non-Gaussian for $t \geq 1$ (cf. [RISTIC, p. 4]). Thus, these assumptions are necessary and sufficient conditions for the Kalman Filter. In the next section, we will see how the Kalman Filter algorithm follows from these assumptions for one-dimensional state spaces. After that, we will take a look at the multi-dimensional algorithm. The assumptions are as follows:

- The transition model is a linear Gaussian, which means that $x_{t+1}$ is a function that is linear in $x_t$ with added random Gaussian noise:

$x_{t+1} = A_{t+1} \cdot x_t + \Delta_{t+1} + \epsilon_{t+1}$

where $A$ is a matrix, $\Delta$ is a translation vector and $\epsilon$ is a vector representing unpredictable Gaussian transition noise. - The sensor model $P(e_t \vert x_t)$ is also a linear Gaussian:

$e_{t+1} = B_{t+1} \cdot x_{t+1} + \Gamma_{t+1} + \zeta_{t+1}$

where $B$ is a matrix, $\Gamma$ is a translation vector and $\zeta$ is a vector representing unpredictable Gaussian sensor noise. - The initial belief $bel(x_0)$ is normally distributed.

The assertion that the belief is always normally distributed is very important, since it ensures the computational tractability of the belief update for deliberate time steps, because in the general case, i.e. for deliberate sensor and measurement distributions, a representation of the belief could, as we argued in chapter 2, grow unboundedly over time.

For simplicity, we’ll first assume that we are dealing with a one-dimensional state space (i.e. $x_t$ is just a real number, e.g. a position along a line). We will take a look at the multidimensional case later. The transition phase from time $t$ to $t + 1$ just adds some number $\delta_{t+1}$ to the state, plus some unpredictable Gaussian noise $\epsilon_{t+1}$ (as before, imagine a robot moving at a desired speed of $\delta$ per time step with some unpredictable random error):

$$x_{t+1} = x_t + \delta_{t+1} + \epsilon_{t+1}$$

Then our **transition model** is given by

$$P(x_{t+1} \, \vert \, x_t) = \mathcal{N}(x_t + \delta_{t+1}, \phi^2)$$

The variance $\phi^2$ acts as a measure of uncertainty, reflecting the transition noise $\epsilon$. In the robot example, assuming we are at position $x_t$ at time step $t$, the position at time step $t+1$ is a Gaussian cloud around an expected position of $x_t + \delta_{t+1}$ with a variance (uncertainty) of $\phi^2$.

Our **sensor model** is given by

$$P(e_{t+1} \, \vert \, x_{t+1}) = \mathcal{N}(x_{t+1}, \psi^2)$$

Again, the variance $\psi^2$ acts as a measure of uncertainty, this time for the measurement noise $\zeta$. In the robot example, assuming that we are at position $x_{t+1}$, the measurement that we get can be expected to be sampled from a Gaussian cloud around $x_{t+1}$ with a variance of $\psi^2$

Assuming that the belief at some time step $t$ is a normal distribution, i.e. $bel(x_t) = \mathcal{N}(\mu_t, \sigma_t^2)$, it can be shown that the **projected belief** $\overline{bel}(x_{t+1})$ is also a normal distribution with mean $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$ and variance $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$.

Considering the robot example, it should not surprise us that the expected position at time step $t+1$ is just the expected position at time step $t$ plus the expected distance $\delta_{t+1}$ that we wanted to move. Moreover, it seems reasonable that our new uncertainty in the belief, $\overline{\sigma}_{t+1}^2$, is given by the old uncertainty $\sigma_{t}^2$ plus the uncertainty that we get due to the transition $\phi^2$.

Now, assuming that $\overline{bel}(x_{t+1})$ is normally distributed, it can be shown that the **updated belief** $bel(x_{t+1})$ after receiving a measurement $e_{t+1}$ is a normal distribution as well, this time with mean $\overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$ and variance $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$ where $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$.

We can see that the new mean is a weighted average of the new measurement and the old mean, where the weights are the transition noise and the sensor noise, respectively. This makes intuitive sense: The importance of the new measurement increases with the uncertainty of the current belief, whilst the importance of the current belief increases with the uncertainty of the measurement.

A proof of these statements can be found in [NEGENBORN, pp. 34 – 37].

Now that all the preparatory work is done, we can formulate the actual Kalman Filter algorithm. It is basically a variant of the Bayes Filter with the property that the beliefs $bel(x_t)$ and $\overline{bel}(x_t)$ are now represented by their parameterizations $(\mu_t, \sigma_t^2)$ and $(\overline{\mu}_t, \overline{\sigma}_t^2)$, respectively. As with the Bayes Filter, the correctness follows by induction.

One-Dimensional Kalman Filter

- $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$
- $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$
- $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$
- $\mu_{t+1} = \overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$
- $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$
- return $\mu_{t+1}, \sigma_{t+1}^2$

The variable $k$ is often called the **Kalman gain** (cf. [NORVIG, p. 588]) and functions as a measure of how important the new measurement is. If the uncertainty of the projected belief is low, then the Kalman gain will be low and thus the new measurement will not have a big impact on the belief. Additionally, if the uncertainty of the measurement is high, the Kalman gain will be low as well and if it is low, the Kalman gain will be high.

The Kalman gain is first incorporated in the expectation update. First, the deviation of the measurement from the expectation, $e_{t+1} – \mu_{t+1}$, is calculated, then it is weighted with the Kalman gain and finally it is added to the expectation. This has exactly the desired effect that the new measurement has an impact on the belief that is proportional to its importance. Dependent on how much new information has been incorporated, the uncertainty decreases, which is implemented in the variance update.

We will now shed some light on this algorithm by applying it to a one-dimensional robot localization problem up to time step 4. The state, i.e. the robot’s location, is simply a real number. The robot believes that it starts out at $x_0 = 0$ with some uncertainty, which is reflected by a prior belief of $\mathcal{N}(\mu_0 = 0, \sigma_0^2 = 1.0)$.

We assume that the robot moves at constant average speed $d_t = 1$ with a noise of $\phi^2 = 0.1$. The positions of the robot shall be $x_0 = 0, x_1 = 0.4543, x_2 = 1.3752, x_3 = 2.2080, x_4 = 3.4944$. I sampled these positions randomly using the specified transition model. Of course, they are not known to the algorithm and they shall only be used for a later comparison with the resulting beliefs (and to create the measurements). We can see that the transition noise really had an impact here. For example, from time step 0 to 1, the robot only moved 0.4543 units when the expected distance was 1 unit.

In our example, the robot is able to sense its position with a measurement noise of $\psi^2 = 1.0$. This is very big noise if we consider that it means that, in expectation, about 68.2% of the measurements are within a distance of 1 unit to the actual position (which is already a big interval) and 31.8% of the measurements might even be outside this interval. Let’s assume that we make the following measurements (which have been sampled from the sensor model using the actual positions specified above): $e_1 = 3.3558, e_2 = −0.0570, e_3 = 1.8155, e_4 = 3.7446$. We can see the obvious impact of the measurement noise: Although we were at position 0.4543 at time step 1, we measured the position 3.3558.

The following figure shows the development of the belief for the first four time steps, both numerically and graphically. At each time step, the black graphs show the belief specified in the upper right-hand corner, whereas the red graphs show the measurement probabilities $P(e_t \, \vert \, x_t)$. The blue line shows the position $x_t$ and the green line the expected position, i.e. the mean of the belief distribution. Take some time to go over the graphs and do not let the mass of information confuse you. After having understood this example, you are able to visualize the Kalman Filter, which helps a lot when using it.

We can see that even though the measurements have been very bad, we still arrive at a belief that is quite reasonable, with an error of only 0.144.

My next article will be about the multi-dimensional Kalman Filter – i.e. the situation when we have a multi-dimensional state vector $x_t$, e.g. a 2D or 3D position, along with varying speed. To stay updated, you can either subscribe by Email, subscribe to my Facebook page or follow me on Twitter.

The post Robot Localization III: The Kalman Filter appeared first on deep ideas.

]]>The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.

]]>Unbalanced classes create two problems:

- The accuracy (i.e. ratio of test samples for which we predicted the correct class) is no longer a good measure of the model performance. A model that just predicts “not cancer” everytime will yield a 95% accuracy, even though it is a bad (and even dangerous) model that does not yield any insight or scientific advancement, despite the fact that “95% accuracy” sounds like something good. In addition, it’s hard to get an intuition for how good a model with 96%, 97% or 98% accuracy really is.
- The training process might arrive at a local optimum that always predicts “not cancer”, making it hard to further improve the model.

Fortunately, these problems are not so difficult to solve. Here are a few ways to tackle them.

If possible, you could collect more data for the underrepresented classes to match the number of samples in the overrepresented classes. This is probably the most rewarding approach, but it is also the hardest and most time-consuming, if not downright impossible. In the cancer example, there is a good reason that we have way more non-cancer samples than cancer samples: These are easier to obtain, since there are more people in the world who haven’t developed cancer.

Artificially increase the number of training samples for the underrepresented classes by creating copies. While this is the easiest solution, it wastes time and computing resources. In the cancer example, we would almost have to double the size of the dataset in order to achieve a 50:50 share between the classes, which also doubles training time without adding any new information.

Similar to 2, but create augmented copies of the underrepresented classes. For example, in the case of images, create slightly rotated, shifted or flipped versions of the original images. This has the positive side-effect of making the model more robust to unseen examples. However, it only does so for the underrepresented classes. Ideally, you would want to do this for all classes, but then the classes are unbalanced again and we’re back where we started.

Remove training samples from the overrepresented classes so that the number of training samples for all classes is the same. This solves our problem and reduces training time, but it makes our model worse. After all, we want to use as much labelled data as we possibly can, even if this causes unbalanced classes. I don’t recommend this solution.

The sensitivity tells us the probability that we detect cancer, given that the patient really has cancer. It is thus a measure of how good we are at correctly diagnosing people who have cancer.

$$sensitivity = Pr(detect\, cancer \; \vert \; cancer) = \frac{\text{true positives}}{\text{positives}}$$

The specificity tells us the probability that we do not detect cancer, given that the patient doesn’t have cancer. It measures how good we are at not causing people to believe that they have cancer if in fact they do not.

$$specificity = Pr(\lnot \, detect\, cancer \; \vert \; \lnot \, cancer) = \frac{\text{true negatives}}{\text{negatives}}$$

A model that always predicts cancer will have a sensitivity of 1 and a specificity of 0. A model that never predicts cancer will have a sensitivity of 0 and a specificity of 1. An ideal model should have both a sensitivity of 1 and a specificity of 1. In reality, however, this is unlikely to be achievable. Therefore, we should look for a model that achieves a good tradeoff between specificity and sensitivity. So which one of the two is more important? This can’t be said in general. It highly depends on the application.

If you build a photo-based skin cancer detection app, then a high sensitivity is probably more important than a high specificity, since you want to cause people who might have cancer to get themselves checked by a doctor. Specificity is a little less important here, but still, if you detect cancer too often, people might stop using your app since they unnecessarily get annoyed and scared.

Now suppose that our desired tradeoff between sensitivity and specificity is given by a number $t \in [0, 1]$ where $t = 1$ means that we only pay attention to sensitivity, $t = 0$ means we only pay attention to specificity and $t = 0.5$ means that we regard both to be equally important. In order to incorporate the desired tradeoff into the training process, we need the samples of the different classes to have a different contribution to the loss. To achieve this, we can simply multiply the contribution of the cancer samples to the loss by

$$\frac{\text{number of non-cancer samples}}{\text{number of cancer samples}} \cdot t$$

In Keras, the class weights can easily be incorporated into the loss by adding the following parameter to the fit function (assuming that 1 is the cancer class):

class_weight={ 1: n_non_cancer_samples / n_cancer_samples * t }

Now, while we train, we want to monitor the sensitivity and specificity. Here is how to do this in Keras. In other frameworks, the implementation should be similar (for instance, you could replace all the K calls by numpy calls).

from keras import backend as K def sensitivity(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) return true_positives / (possible_positives + K.epsilon()) def specificity(y_true, y_pred): true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1))) possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1))) return true_negatives / (possible_negatives + K.epsilon())

model.compile( loss='binary_crossentropy', optimizer=RMSprop(0.001), metrics=[sensitivity, specificity] )

If we have more than two classes, we can generalize sensitivity and specificity to a “per-class accuracy”:

$$perClassAccuracy(C) = Pr(detect\, C \; \vert \; C)$$

In order to train for maximum per-class accuracy, we have to specify class weights that are inversely proportional to the size of the class:

class_weight={ 0: 1.0/n_samples_0, 1: 1.0/n_samples_1, 2: 1.0/n_samples_2, ... }

Here is a Keras implementation of the per-class accuracy, which I adopted from jdehesa at Stackoverflow.

INTERESTING_CLASS_ID = 0 # Choose the class of interest def single_class_accuracy(y_true, y_pred): class_id_true = K.argmax(y_true, axis=-1) class_id_preds = K.argmax(y_pred, axis=-1) accuracy_mask = K.cast(K.equal(class_id_preds, INTERESTING_CLASS_ID), 'int32') class_acc_tensor = K.cast(K.equal(class_id_true, class_id_preds), 'int32') * accuracy_mask class_acc = K.sum(class_acc_tensor) / K.maximum(K.sum(accuracy_mask), 1) return class_acc

If you have any questions, feel free to leave a comment. If you want to stay updated about new machine learning articles, you can either subscribe to deep ideas by Email, subscribe to my Facebook page or follow me on Twitter.

The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.

]]>The post Robot Localization II: The Histogram Filter appeared first on deep ideas.

]]>The **Histogram Filter** is the most straightforward solution to represent continuous beliefs. We simply divide $dom(x_t)$ into $n$ disjoint bins $b_0, …, b_{n−1}$ such that $\cup_i b_{i} = dom(x_t)$. Then we define a new state $x_t^\prime \in \{0, …, n − 1\}$ where $x_t^\prime = i$ if and only if $x_t \in b_i$. Since $x_t^\prime$ has a discrete, finite state space, we can use the discrete Bayes Filter to calculate $bel(x_t^\prime)$.

$bel(x_t^\prime)$ is an approximation for $bel(x_t)$ then: For each bin $b_i$, it gives us the probability that $x_t$ is in that bin. The more bins we use, the more accurate the approximation becomes, with the downside of increasing computational complexity.

To make this more clear, we shall apply the Histogram Filter to a global localization example as displayed in the following image:

A self-driving car lives in a one-dimensional, cyclic world that is 5 meters wide. By cyclic, we mean that if it is in the rightmost cell and moves one step to the right, it’s back in the leftmost cell. The robot’s position at each time step is given as $pos_t \in [0, 5)$, which is the only state variable. It has a sensor that is, under uncertainty, able to tell the color of the wall next to it. We assume that the car is constantly moving right under noise, at an expected speed of one meter per time step.

In order to apply the Histogram Filter, we choose the following decomposition of the state space: $b_0 = [0, 1)$, $b_1 = [1, 2)$, $b_2 = [2, 3)$, $b_3 = [3, 4)$, $b_4 = [4, 5)$. This way, the position can be measured as a discrete variable $pos_t^\prime \in \{0, …, 4\}$, which is an estimate of the true, continuous position. Each discrete position corresponds to exactly one of the distinguished grid cells in the above image.

We can now specify the transition and sensor models. We assume that the car intends to move exactly one grid cell to the right at each time step, but that the inaccuracy of the motor causes it to move 2 grid cells in 5% of the cases, not move at all in 5% of the cases and move exactly 1 grid cell in 90% of the cases. This results in the following transition model:

$$

P(pos_t^\prime = x + 2 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.05\\

P(pos_t^\prime = x + 1 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.9\\

P(pos_t^\prime = x \; \vert \; pos_{t−1}^\prime = x) = 0.05

$$

As for the sensors, we assume that in 90% of the cases the measured color is correct and in 10% of the cases it is incorrect, yielding the following sensor model:

$$

P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 0, 2, 3) = 0.9\\

P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 0, 2, 3) = 0.1\\

P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 1, 4) = 0.1\\

P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 1, 4) = 0.9

$$

Let’s now use the discrete Bayes filter to calculate the car’s belief for three time steps where the sensor measurements are Orange, Blue and Orange in that order. We assume that the car starts at the very left (but it does not know that it does) and travels exactly one grid cell to the right per time step (which it does not know either). We can represent the belief as a 5-dimensional row vector $bel(pos_t^\prime) = (bel_{t,1}, bel_{t,2} bel_{t,3}, bel_{t,4}, bel_{t,5})$ where $bel_{t,i}$ represents the probability that the robot is in cell $i$ at time-step $t$.

The car has no prior knowledge about its position. Thus, it starts out with the following belief:

$bel(pos_0^\prime) = (0.2, 0.2, 0.2, 0.2, 0.2)$

First, it projects the previous belief to the current time step:

$\overline{bel}(pos_1^\prime) = \sum_{pos_0^\prime} P(pos_1^\prime \; \vert \; pos_0^\prime) \cdot bel(pos_0^\prime)$

$= (0.05, 0.9, 0.05, 0.0, 0.0) \cdot 0.2 + (0.0, 0.05, 0.9, 0.05, 0.0) \cdot 0.2$

$+ (0.0, 0.0, 0.05, 0.9, 0.05) \cdot 0.2 + (0.05, 0.0, 0.0, 0.05, 0.9) \cdot 0.2$

$+ (0.9, 0.05, 0.0, 0.0, 0.05) \cdot 0.2 = (0.2, 0.2, 0.2, 0.2, 0.2)$

This results in the same belief as before, which shouldn’t surprise us, since each cell was equally likely to be the car’s position at time $t = 0$ and therefore, since the robot just moved blindly, each cell is still equally likely to be its position at time $t = 1$.

Now the robot updates the projected belief with the sensor input:

$bel(pos_1^\prime) = \eta \cdot P(MeasuredColor_1 = Orange \; \vert \; pos_1^\prime) \cdot \overline{bel}(pos_1^\prime)$

$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.2, 0.2, 0.2, 0.2, 0.2)$

$= \eta \cdot (0.02, 0.18, 0.02, 0.02, 0.18)$

$= (0.04762, 0.42857, 0.04762, 0.04761, 0.42857)$

where the last step follows by dividing the vector by the sum over all vector values so that the probabilities sum up to 1. We can see that each of the two orange cells are equally likely to have caused the sensor measurement. Thus, the robot currently has two salient theories on where it might be.

$\overline{bel}(pos_2^\prime) = \sum_{pos_1^\prime} P(pos_2^\prime \; \vert \; pos_1^\prime) \cdot bel(pos_1^\prime)$

$= (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$

$bel(pos_2^\prime) = \eta \cdot P(MeasuredColor_2 = Orange \; \vert \; pos_2^\prime) \cdot \overline{bel}(pos_2^\prime)$

$= \eta \cdot (0.9, 0.1, 0.9, 0.9, 0.1) \cdot (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$

$= (0.45165, 0.01102, 0.45165, 0.07711, 0.00857)$

$\overline{bel}(pos_3^\prime) = \sum_{pos_2^\prime} P(pos_3^\prime \; \vert \; pos_2^\prime) \cdot bel(pos_2^\prime)$

$= (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$$bel(pos_3^\prime) = \eta \cdot P(MeasuredColor_3 = Orange \; \vert \; pos_3^\prime) \cdot \overline{bel}(pos_3^\prime)$

$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$

$= (0.00683, 0.73358, 0.01102, 0.08219, 0.16637)$

The disadvantage of the Histogram Filter is obvious: We are not able to tell the probability of each possible state. We are only able to tell the probability that the state is in a certain region of the state space. This disadvantage might be circumvented by using a very fine-grained decomposition of the state space, but this drastically increases the computational complexity.

Continue with the next part: Robot Localization III: The Kalman Filter

The post Robot Localization II: The Histogram Filter appeared first on deep ideas.

]]>The post Robot Localization I: Recursive Bayesian Estimation appeared first on deep ideas.

]]>The methods that we will learn are generic in nature, in that they can be used for various other tasks that involve rational decision making in the face of uncertainty. We will, for the main part, deal with **filtering**, which is a general method for estimating variables from noisy observations over time. In particular, we will explain the **Bayes Filter** and some of its variants – the **Histogram Filter**, the **Kalman Filter** and the **Particle Filter**. We will show the benefits and shortcomings of each of these algorithms and see how they can be applied to the robot localization problem.

The traditional approach in reasoning over time involves strict logical inference. In order for this to work, a few assumptions have to be made about the environment we wish to make decisions in. For instance, the environment has to be **fully observable**, which means that at any point in time we can exactly measure each aspect of the environment that is relevant to our decision making. Additionally, the environment needs to be **deterministic**, which means that, given the state of the environment at a certain point in time and a decision we choose, the resulting state of the environment is already determined – there is no randomness whatsoever. Last but not least, the environment has to be **static**, which basically means that it waits for us to make our decision before it changes.

None of these assumptions hold in realistic environments. We can never measure every aspect of an environment that might have an influence on the decision making. We can, however, use sensors to measure a small portion of the environment, but even this small portion we can not measure with complete certainty. We call such environments **partially observable**.

Whether realistic environments are deterministic or not is actually an unanswered philosophical question. At least for humans and agents, it appears to be non-deterministic, because even though we know physical laws that allow us to describe most natural processes, there are just too many influential factors that we are unable to model precisely (e.g. wind turbulence causing a seemingly random change in the trajectory of a flying ball). Regardless of the nondeterminism, we can usually tell what is *likely* to happen and what is *unlikely* to happen. Thus, we call realistic environments **stochastic**. Moreover, realistic environments are **dynamic** as opposed to static – they are always changing. For a more thorough treatise of the nature of environments cf. [NORVIG, pp. 40 – 46].

All of these properties of realistic environments result in uncertainty about the state of the world. It is a big challenge to make rational decisions in the face of uncertainty. Humans do a great job at this every day. Even though we can never know the true state of the world and predict what is going to happen next and how we should act to achieve a desired outcome, we still manage to achieve many of our goals remarkably well. We do this by maintaining a **belief** about the state of the world at a certain point in time, which we arrive at by both prediction and observation. This belief can be thought of as a probability distribution over all the possible states of the world, conditioned by our observations. Given a belief, we can, for each possible decision, determine the probabilities of each possible outcome. After that, we choose the decisions that are most probable to achieve a desired goal state, maximize a performance measure, or the like. This behavior can reasonably be called rational. Of course, we do not actually maintain precise probability distributions in our brains and carry out calculations, but this is a way of imagining how this cognitive ability of ours roughly works and it gives us a first idea of how it can be implemented algorithmically.

It is a difficult but interesting task to implement such a behavior for autonomous agents. The purpose of this text is to give an insight into how the first half can be done – the task of maintaining a belief about the state of an environment that is updated over time through making predictions according to a model of how the system develops, interpreting periodically arriving, noisy observations (more specifically, sensor measurements) and incorporating them into the belief.

Robot localization is one of the most fundamental problems in mobile robotics. There are multiple instances of the localization problem with different difficulties (cf. [NEGENBORN, pp. 9 – 11]). In this article, we shall deal with the problem that the robot is given a map of the environment and then either needs to keep track of its position when the initial position is known, or localize itself from scratch when it could theoretically be anywhere.

One might use methods like GPS for positioning, but in many scenarios it is not accurate enough. Self-driving cars, for example, need a few centimeters accuracy to be considerable for road traffic. As everyone with a car navigator knows, the accuracy for GPS can be grim. Therefore, it is not always an option. Since there is no reliable sensor to measure a position directly, we need to rely on other observations and infer the actual position from it. A possible way to do so would be to install cameras, use pattern recognition to spot landmarks whose positions on the map are known, determine the distances of the landmarks and then use trilateration to determine the robot’s position.

It is reasonable to assume that the distance sensors are noisy. It becomes even more difficult when we assume that the robot is moving through the world, because movement is usually noisy as well: Even though the robot can control its average speed, motors are subjected to an unmodeled inaccuracy, resulting in unpredictable speed variations. As we can see, this is a situation as described in the previous section: The robot cannot infer its exact position from sensor data and, even if it does know its exact position at a certain point in time, it does not know it for certain anymore a moment later. This is due to the fact that the model it uses to describe the environment cannot describe the marginal factors that cause the motor to be inaccurate. As such, this problem is a good example for filtering and will therefore be used to elucidate the algorithms presented in this article.

Before we can deal with the concrete filter algorithms, we have to lay a theoretical foundation. In this article, we will model the world in such a way that all the changes in the environment take place at discrete, equidistant time steps $t \in \mathbb{N}_0$, where sensor measurements arrive at every time step $t \geq 1$. To model uncertainty over continuous time is more difficult, since it involves stochastic differential equations. The discrete-time model can be seen as an approximation at the continuous case. [NORVIG, p. 567]

At each point in time $t$, we can characterize a dynamic system by a state vector $x_t$, which we simply call the **state**. This state vector contains the so-called **state variables** that are necessary to describe the system. We assume that it contains the same state variables at each time step. We define the so-called **state space** $dom(x_t)$ as the set of all the possible values that $x_t$ might take. If we consider a moving robot on a plain, the state could be $x_t = (X_t, Y_t, \dot{X}_t, \dot{Y}_t)$ where $X_t$ and $Y_t$ refer to the robot’s current position and $\dot{X}_t$ and $\dot{Y}_t$ to its movement speed in the X and Y direction, respectively. In this case, the state space would be $dom(x_t) = \mathbb{R}^4$.

For each environment, there are virtually infinitely many possible state vectors, where additional state variables generally make the description of the environment more precise, with the downside of increasing the computational complexity of maintaining a belief. For example, if we consider the robot on a plain again, we could include the wind direction and force in the state vector to account for variations in the robot’s movement that are caused by the wind.

A state is called **complete** if it includes all the information that is necessary to predict the future of the system. In realistic examples, the state is usually incomplete. For example, if we assume that there are human beings interfering with the robot on the plain, then the state would have to include data that makes it possible to predict their decisions, which is practically impossible. Even in situations where we could in principle include all the influencing factors in the state, it is still often preferable not to include them to reduce computational complexity. In practice, the algorithms described in this article have turned out to be robust to incomplete states. A rule of thumb is to include enough state variables to make unmodeled effects approximately random. [THRUN, p. 33]

As alluded to in the introduction, the state $x_t$ is usually **unobservable**, which means that we cannot measure it directly. Instead, we have sensors that generate a measurement $e_t$ at each time step $t \geq 1$, which is a vector of arbitrary dimension. This measurement vector contains noisy sensor measurements that are caused by the state. In our modeling, $e_t$ always contains the same measurement variables. If we have a GPS sensor, then this measurement vector could consist of the measured X and Y coordinates. It is important to realize that these measured coordinates are generally not the same as the actual coordinates. Instead, they are *caused* by the actual coordinates but underlie a certain measurement noise due to the inaccuracy of GPS.

As we said, the state $x_t$ is unobservable. All we can do is maintain a belief $bel(x_t)$, given the observations. The process of determining the belief from observations is called **filtering** or **state estimation** (cf. [NORVIG, p. 570]). In mathematical terms, the belief is a probability distribution over all possible states, conditioned by the observations so far: $bel(x_t) := P(x_t \mid e_{1:t})$, where we use $e_{1:t}$ as a short-hand notation for $(e_1, e_2, …, e_t)$.

We also define $\overline{bel}(x_t) := P(x_t \mid e_{1:t−1})$, which is the **projected** or **predicted** belief, i.e. the probability distribution over all the possible states at time $t$, given only past observations.

As we can see, the number of measurements we have to condition by in order to determine the belief increases unboundedly over time. This means that we would have to store all the measurements, which is impossible with a limited memory. Additionally, the time needed to compute the belief would increase unboundedly, since we have to consider all the measurements so far. If we want to have a computationally tractable method for calculating the belief at deliberate points in time, we have to find a function $f$ such that $bel(x_{t+1}) = f(bel(x_t), e_{t+1})$. This means that in order to calculate the belief at a certain time step, we take the belief of the previous time step, project it to the new time step and then update it in accordance with new evidence. Such a method is called **recursive estimation** (cf. [NORVIG, p. 571]). The **Bayes Filter** is an algorithm for doing this. But before we can formulate the algorithm and prove its correctness, we have to specify how the world evolves over time and how we interpret sensor input. Also, as we will see in the next sections, we have to make some assumptions about the system in order to arrive at a recursive formulation.

As stated in the introduction, realistic environments are non-deterministic but stochastic – given a state $x_t$, we can not tell what the state $x_{t+1}$ will be. Regardless of that, we can tell how *likely* each of the possible states $x_{t+1}$ is, given the state $x_t$. In mathematical terms, we can specify the conditional probability distribution $P(x_{t+1} \mid x_t)$. We call this distribution the **transition model**, since it is a model of how the environment transitions from one time step to the next.

Analogously, due to the partial observability of the environment (in particular, the inaccuracy of the sensors), we cannot tell which state causes exactly which sensor measurement, since there is always some measurement noise. However, we can tell how likely each possible sensor measurement $e_t$ is, given the state $x_t$. In mathematical terms, we can specify $P(e_t \mid x_t)$, which we call the **sensor model**. Given a sensor measurement $e_t$, it tells us how likely each state is to cause this measurement.

We will see examples for transition and sensor models in the following sections.

In order to be able to arrive at a recursive formula for maintaining the belief $bel(x_t)$, we have to make so-called **Markov assumptions** about both the transition model and the sensor model. We will see in the next section that these two assumptions allow us to arrive at a method to calculate the belief recursively.

For the transition model, the Markov assumption states, that, given the state $x_t$, all states $x_{t+j}$ with $j \geq 1$ are conditionally independent of $x_{0:t−1}$ (cf. [DEGROOT, p. 188, 189]). This gives us $P(x_{t+1} \mid x_{0:t}) = P(x_{t+1} \mid x_t)$. Intuitively speaking, this assumption means that if we know the state at a certain point in time, then no previous states give us additional knowledge about the future.

We also make a sensor Markov assumption as follows: $P(e_{t+1} \mid x_{t+1}, e_{1:t}) = P(e_{t+1} \mid x_{t+1})$. This means that if we know the state $x_{t+1}$, then no sensor measurements from the past tell us anything more about the probabilities of each possible sensor measurement $e_{t+1}$.

As we stated in section 3.2, we want a method to calculate $bel(x_{t+1})$ from $bel(x_t)$ and $e_{t+1}$. We can do this in two consecutive steps First, we calculate the projected belief $\overline{bel}(x_{t+1})$ from $bel(x_t)$. This step is usually called **projection**: We project the belief of the previous time step to the current time step. We can do this in the following way (a proof for this statement can be found in [NORVIG, p. 572]):

$$

\overline{bel}(x_{t+1}) = \int_{x_t} P(x_{t+1} \mid x_t) bel(x_t)

$$

The process of calculating $bel(x_{t+1})$ from $\overline{bel}(x_{t+1})$ is called **update**: We update the projected belief with the new evidence $e_{t+1}$. This can be done as follows:

$$

bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) \overline{bel}(x_{t+1})

$$

In this formula, $P(e_{t+1} \mid x_{t+1})$ can be obtained from the sensor model. $\eta$ has the function of a normalizing constant. This means that we do not need to calculate it directly from its definition. In the discrete case, it follows from the fact that the probabilities need to sum up to 1. In the continuous case, it follows from the fact that the probability density function needs to integrate to 1 (cf. [DEGROOT, p. 105]).

For the recursive formulation to work, we need a prior belief $bel(x_0)$. Most commonly, we have no knowledge beforehand, in which case we should assign equal probabilities to each possible state. If we know the state at the beginning and need to keep track of it, we should use a point mass distribution. If we only have partial knowledge, we could use some other distribution.

The Bayes filter algorithm for calculating $bel(x_{t+1})$ from $bel(x_t)$ and $e_t$ can now be formulated as follows (cf. [THRUN, p. 27]):

Continuous Bayes Filter

- $\overline{bel}(x_{t+1}) = \int_{x_t} P(x_{t+1} \mid x_t) bel(x_t)$
- $bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) \overline{bel}(x_{t+1})$

Under the assumption that $bel(x_0)$ has been initialized correctly, the correctness of this algorithm follows by induction, since we already showed that $bel(x_{t+1})$ is correctly calculated from $bel(x_t)$.

In principle, we now have a method to calculate the belief at each time step. The question arises, however, how we should represent the belief distribution. For finite state spaces, we can simply replace the integral with a sum over all possible $x_t$ and represent the belief as a finite table. We call this modified version the **Discrete Bayes Filter** (cf. [THRUN, pp. 86, 87]). We will see a concrete example for the discrete Bayes Filter in the next section.

Discrete Bayes Filter

- $\overline{bel}(x_{t+1}) = \sum{x_t} P(x_{t+1} \mid x_t) bel(x_t)$
- $bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) bel(x_{t+1})$

It becomes more difficult if we consider continuous state spaces. In this case, the belief becomes a probability density function (from now on abbreviated p.d.f.) over all possible states. The general way to represent such a function is by a symbolic formula. The problem arises that an exact representation of a formula for the belief function could, in the general case, grow without bounds over time (cf. [NORVIG, p. 585]). Additionally, the integration step becomes more and more complex and some p.d.f.s are not guaranteed to be integrable offhand. We are going to see three different solutions to this problem, all of which introduce a different way of representing the belief distribution: The Histogram Filter, the Kalman Filter and the Particle Filter.

Continue with Part II: The Histogram Filter.

[NORVIG] Peter Norvig, Stuart Russel (2010) *Artificial Intelligence – A Modern Approach*. 3rd edition, Prentice Hall International

[THRUN] Sebastian Thrun, Wolfram Burgard, Dieter Fox (2005) *Probabilistic Robotics*

[NEGENBORN] Rudy Negenborn (2003) *Robot Localization and Kalman Filters*

[DEGROOT] Morris DeGroot, Mark Schervish (2012) *Probability and Statistics.* 4th edition, Addison-Wesley

[BESSIERE] Pierre Bessire, Christian Laugier, Roland Siegwart (2008) *Probabilistic Reasoning and Decision Making in Sensory-Motor Systems*

The post Robot Localization I: Recursive Bayesian Estimation appeared first on deep ideas.

]]>The post “Can Computers Think?” -“No, but…” appeared first on deep ideas.

]]>In his essay *Can Computers Think?* [11], Searle gives his own definition of strong artificial intelligence, which he subsequently tries to refute. His definition is as follows:

One could summarise this view […] by saying that the mind is to the brain, as the program is to the computer hardware.

Searle’s first attempt at refuting the possibility of strong artificial intelligence is based on the insight that mental states have, by definition, a certain **semantic content** or **meaning**. Programs, on the other hand, are purely formal and syntactical, i.e. a sequence of symbols that do not have a meaning in themselves. Therefore, a program could not be equivalent to a mind. A formal reconstruction of this argument looks as follows:

- Syntax is not sufficient for semantics
- Programs are completely characterized by their formal, syntactical structure
- Human minds have semantic contents
- Therefore, programs are not sufficient for creating a mind

Searle emphasizes the fact that his argument is based solely on the property that programs are defined formally, regardless of which physical system is used to run the program. Therefore, it does not state that it is impossible for us today to create a strong artificial intelligence, but that this is generally impossible for any conceivable machine in the future, regardless of how fast it is or which other properties it might have.

In order to make his first premise more plausible (“Syntax is not sufficient for semantics”), Searle describes a thought experiment – the **Chinese Room**. Assume there were a program that is capable of answering Chinese questions in Chinese. No matter which question you pose in Chinese, it gives you an appropriate answer that a human Chinese speaker might also give. Searle now tries to argue that a computer running this program doesn’t actually *understand* Chinese in the same sense as a Chinese human being understands Chinese.

To this end, he assumes that the formal instructions of the program are carried out by a person who does not understand Chinese. This person is locked in a room, and the Chinese questions are passed into the room as a sequence of symbols. The room contains baskets with many other Chinese symbols, along with a list of formal instructions, which are purely syntactical rules that tell the person how to produce an answer to the question by assembling the symbols from the baskets. The answer generated by these instructions are then passed out of the room by the person. The person is not aware that the symbols that are passed into the room are questions and the symbols that are passed out of the room are answers to these questions. He just blindly carries out the instructions strictly and correctly. And these instructions generate meaningful Chinese sentences that are answers to the questions which couldn’t be distinguished from the answers a real Chinese speaking person would give.

Now Searle raises attention to the fact that the person in the room doesn’t *understand* Chinese simply by following formal instructions for generating answers. He continues to argue that a computer running a program that generates Chinese answers to Chinese questions therefore also doesn’t *understand* Chinese. Since this experiment could be generalized to arbitrary tasks, Searle concludes that computers are inherently incapable of understanding something.

There are numerous objections to the Chinese Room argument by various authors. Many of these arguments are similar in nature. In the following, I will present the most commonly presented ones, including answers to these objections by Searle himself.

One of the most commonly raised objection is that even though the person in the Chinese Room does not understand Chinese, the *system as whole* does – the room with all its constituents, including the person. This objection is often called the Systems Reply and there are various versions of it.

For example, artificial intelligence researcher, entrepreneur and author Ray Kurzweil says in [5] that the person is only an executive unit and that its properties are not to be confused with the properties of the system. If one looks at the room as an overall system, the fact that the person does not understand Chinese doesn’t entail that this also holds for the room.

Cognitive scientist Margaret Boden argues in [1] that the human brain is not the carrier of intelligence, but rather that it *causes* intelligence. Analogously, the person in the room *causes* an understanding of Chinese to arise, even though it does not understand Chinese itself.

Searle responds to the Systems Reply with the semantic argument: Even the system as a whole couldn’t go from syntax to semantics and, hence, couldn’t understand the meaning of the Chinese symbols. In [9], he adds that the person in the room could theoretically memorize all the formal rules and perform all the computations in its head. Then, he argues, the person is the entire system, could answer Chinese questions without help and perhaps even lead Chinese conversations, but still wouldn’t understand Chinese since it only carries out formal rules and can’t associate a meaning with the formal symbols.

Similar to the Systems Reply, the Virtual Mind Reply states that the person does not understand Chinese, but that a running system could create new entities that differ from both the person and the system as a whole. The understanding of Chinese could be a new entity of this sort. This standpoint is argued for by artificial intelligence researcher Marvin Minsky in [15] and philosopher Tim Maudlin in [6]. Maudlin notes that Searle didn’t provide an adequate answer to this reply thus far.

Another reply changes the thought experiment in such a way that the program is put into a robot that can perceive the world through sensors (like cameras or microphones) and interact with the world via effectors (like motors or loudspeakers). This causal interaction with the environment, the argument goes, is a guarantee that the robot understands Chinese, since the formal symbols are endowed with semantics this way – namely objects in the real world. This view presupposes an externalist semantics. This reply is raised, for example, by Margaret Boden in [1].

Searle answers to this argument in [17] with the semantic argument: The robot still only has a computer as its brain and couldn’t go from syntax to semantics. He makes this more plausible by adapting the thought experiment such that the Chinese Room itself is integrated into a robot as its central processing unit. The Chinese symbols would then be generated by sensors and passed into the room. Analogously, the symbols passed out of the room would control the effectors. Even though the robot interacts with the external world this way, the person in the room still doesn’t understand the meaning of the symbols.

Some authors, e.g. philosophers Patricia and Paul Churchland in [2], suggest that one should imagine that instead of manipulating the Chinese symbols, a computer should simulate the neuronal firings in the brain of a Chinese person. Since the computer operates in exactly the same way as a brain, the argument goes, it must understand Chinese.

Searle answers to this argument in [10]. He argues that one could also simulate the neuronal structures by a system of water pipes and valves and put it into the Chinese Room. The person in the room then has instructions on how to guide the water through the pipes in order to simulate the brain of a Chinese person. Still, he says, no understanding of Chinese is generated.

Now I present my own reply, which I have coined the **Emergence Reply**.

I grant that Searle’s arguments prove that a mind can not be *equated* with a computer program. This is immediately obvious from the semantic argument: Since a mind has properties that a program does not have (namely semantic content), a program can not be equal to a mind. Hence, it refutes the possibility of strong artificial intelligence by his own definition.

However, one can phrase another definition of strong artificial intelligence which, as I will argue, is not affected by Searle’s arguments:

A system exhibits

strong artificial intelligenceif it cancreatea mind as anemergent phenomenonby running a program.

I explicitly include any type of system, regardless of the material from which it is made – be it a computer, a Chinese Room or a gigantic hall of falling dominos or beer cans that simulate a Turing machine.

I will not try to argue for the possibility of strong artificial intelligence according to this definition. It is doubtful whether this is even possible. However, I will argue why this definition is not affected by Searle’s arguments.

In my proposed definition, no analogy between the program and the mind created by the program is demanded. Therefore, the semantic argument becomes obsolete: Even though a program as a syntactical construct doesn’t create semantics (and therefore couldn’t be equal to a mind), it doesn’t follow that a program can’t create semantic contents *in the course of its execution*.

Moreover, this definition doesn’t state that the computer hardware is the *carrier* of the mental processes. The hardware is not enabled to think this way. Rather, the computer creates the mental processes as an emergent phenomenon, similarly to how the brain creates mental processes as an emergent phenomenon. So, if one considers the question in the title of Searle’s original essay “Can Computers Think?”, the answer would be “No, but they might *create* thinking.”

How a mind can be created through the execution of a program, and what sort of ontological existence this mind would have, is a discussion topic of its own. In order to make this more plausible, imagine a program that exactly simulates the trajectories and interactions of elementary particles in a brain of a Chinese speaker. This way, the program does not only create the same outputs for the same inputs as the Chinese’s brain, but proceeds *completely analogously*. There is no immediate way to exclude the possibility that the simulated brain can’t create a mind in exactly the same way as a real brain can. The only assumption here is that the physical processes in a brain are deterministic. There are some theories claiming that a mind requires non-deterministic quantum phenomena that can’t be simulated algorithmically. One such theory is presented by physicist Sir Roger Penrose in [7], who has founded the Penrose Institute to explore this possibility. If such theories turn out to be true, then this would be a strong argument against the possibility of strong artificial intelligence.

As regards the Chinese Room Argument, it convincingly shows that the fact that a system gives the impression of understanding something doesn’t entail that it really understands it. Not every program that the person in the Chinese Room could execute in order to converse in Chinese does in fact create understanding. This is an important insight that refutes some common misconceptions, like the fact that IBM’s Deep Blue *understands* chess in the same way as a human does, or that Apple’s Siri *understands* spoken language. Deep Blue just calculates the payoff of certain moves, and Siri just transcribes one sequence of numbers into another (albeit in a sophisticated way). This definitely doesn’t create understanding or a mind.

Moreover, the Chinese Room Argument shows that the Turing Test is no reliable indicator of strong artificial intelligence. In this test, described by Alan Turing in [12], a human subject should converse with an unknown entity and decide whether it is talking to another human or a computer, solely based on the answers that the entity gives. If the computer repeatedly manages to trick the subject, we call it intelligent. This test only measures how good a computer is at giving the impression of being intelligent without making any restrictions as to how the computer does it internally, which, as we argued already, is an important factor in determining whether a computer really exhibits strong artificial intelligence.

Additionally, Searle’s argument shows that it is not the hardware itself that understands Chinese. Even if a hardware running a program creates a mind that understands Chinese, the person in the Chinese Room is the hardware and doesn’t understand Chinese.

It does not, however, refute the possibility that the hardware can create a mind that understands Chinese by executing the program. Assume there is a program that answers Chinese questions and creates mental processes that exhibit an understanding of the Chinese questions and answers. This assumption can not be refuted by the Chinese Room Argument. If we let the person in the room execute the program via pen and paper, it is correct that the person doesn’t understand Chinese. But the person is only the hardware in this case. Its mind does not equal the mind that is created by the execution of the program.

It might seem intuitively implausible that arithmetical operations carried out with pen and paper could give rise to a mind. But this can be made more plausible by assuming, as before, that the neuronal processes in the brain are simulated in the form of these arithmetical operations. The fact that a mind could not arise in such a way may be a false intuition. There is no immediately obvious logical reason to exclude this possibility. Similar things hold for Searle’s system of water pipes, beer can domino or other unorthodox hardware. If one assumes that a computer hardware can create a mind, one must grant that this is also possible for other, more exotic mechanical systems.

Whether it is indeed possible to create a mind by the execution of a program is still an open question. Maybe Roger Penrose turns out to be right that consciousness is a natural phenomenon that can’t be created by the deterministic interaction of particles. Are organisms really just algorithms? How can the parallel firing of tens of billions of neurons give rise to consciousness and a mind? As of now, neuroscience has not the slightest idea. However, I would say with some certainty that this question cannot be answered by thought experiments alone.

If you liked this article, you may also be interested in my article Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence.

[1] Boden, Margaret A: *Escaping from the Chinese Room*. University of Sussex, School of Cognitive Sciences, 1987.

[2] Churchland, Paul M und Patricia Smith Churchland: *Could a Machine Think?* Machine Intelligence: Perspectives on the Computational Model, 1:102, 2012.

[3] Cole, David: *The Chinese Room Argument*. In: Zalta, Edward N. (Herausgeber): The Stanford Encyclopedia of Philosophy. Summer 2013. http://plato.stanford.edu/archives/ sum2013/entries/chinese-room/.

[4] Dennett, Daniel C: *Fast thinking*. 1987.

[5] Kurzweil, Ray: *Locked in his Chinese Room*. Are We Spiritual Machines: Ray Kurzweil vs. the Critics of Strong AI, 2002.

[6] Maudlin, Tim: *Computation and consciousness*. The journal of Philosophy, pp 407–432, 1989.

[7] Penrose, Roger: *The Emperor’s New Mind* (1990). Vintage, London.

[8] Russell, Stuart Jonathan et al.: *Artificial Intelligence: A Modern Approach*. Prentice hall Englewood Cliffs, 1995.

[9] Searle, John: *The Chinese Room Argument*. Encyclopedia of Cognitive Science, 2001.

[10] Searle, John R: *Minds, brains, and programs*. Behavioral and brain sciences, 3(03):417–424, 1980.

[11] Searle, John R: *Minds, brains, and science*. Harvard University Press, 1984.

[12] Turing, Alan M: *Computing machinery and intelligence*. Mind, pp 433–460, 1950.

The post “Can Computers Think?” -“No, but…” appeared first on deep ideas.

]]>The post Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence appeared first on deep ideas.

]]>This text gives an overview of Gödel’s Incompleteness Theorem and its implications for artificial intelligence. Specifically, we deal with the question whether Gödel’s Incompleteness Theorem shows that human intelligence could not be recreated by a traditional computer.

Sections 2 and 3 feature an introduction to axiomatic systems, including a brief description of their historical development and thus the background of Gödel’s Theorem. These sections provide the basic knowledge required to fully understand Gödel’s Theorem and its significance for the history of mathematics – a necessary condition for understanding the arguments to follow. Section 4 features a thorough description of Gödel’s Theorem and outlines the basic idea of its proof. Sections 5 and 6 deal with arguments advocating the view that intelligence has a non-algorithmic component on the grounds of Gödel’s Theorem. In addition to a detailed account of the arguments, these sections also feature a selection of prominent objections to these arguments raised by other authors. The last section comprises a discussion of the arguments and my own objections.

At the beginning of the 20th century, the mathematical community suffered from a crisis regarding the very foundations of mathematics, triggered by the discovery of various paradoxes that called into question the reliability of mathematical intuition and the notion of proof. At that time, some fields of mathematics were grounded on a rigorous formal basis, called an **axiomatic system** (or interchangeably formal system), whereas other fields relied on a certain degree of intuitive insight.

In a formal way, an axiomatic system is a set of propositions, expressed in a formal language, called axioms. These axioms represent statements that are assumed to be true without proof. The set of axioms is equipped with a set of inference rules which can be used to derive other propositions, called theorems, by applying them to the axioms. Applying the rules of inference boils down to replacing expressions by certain other expressions according to precise syntactical rules. The axioms and the set of inference rules are ideally chosen in such a way that they are intuitively evident. This way, the truth of a complex, non-obvious statement can be accepted by accepting the truth of the axioms and sequentially applying the inference rules until the complex statement in question is deduced.

An early, prominent example of such an axiomatic system is the Euclidean geometry described by the ancient Greek philosopher Euclid in c. 300 BC (an English translation can be found in [Euc02]). It consists of 5 axioms making trivial statements about points, lines and circles (e.g. that any two points could be connected by a line). From these axioms, Euclid derived 48 non-trivial geometric propositions solely by means of logical inference and without making use of informal geometric intuition or perception.

Up until modern times, geometry was the only branch of mathematics that was predicated on such a sound axiomatic basis, whereas research and applications in other branches were carried out without a rigid formal notion about which types of inference were allowed and which statements were assumed to be intuitively evident. This was due to the fact that, for most practical purposes, mathematicians saw no need for doing so. However, this changed with the discovery of various paradoxes around the turn of the 20th century. In 1901, the British mathematician Bertrand Russell put forward what later came to be known as **Russell’s paradox** (cf. [Gri04]). This paradox showed an inherent flaw in the informal set theory proposed by German mathematician Georg Cantor, according to which every definable collection of distinct elements is a set. Russell defined the set R of all sets that do not contain themselves, symbolically:

$$\{x \; \mid \; x \not\in x\}$$

According to Cantor, R is a valid set. The paradox arises when one asks the question whether R contains itself. If R contains itself, then by definition it does not contain itself. If, on the other hand, it does not contain itself then it contains itself by definition. Symbolically:

$$R \in R \; \iff R \not\in R$$

Therefore, the question whether R contains itself has no well-defined answer. This example shows that the notion of a set defined by Cantor is flawed, even though it seems to be intuitively reasonable. Examples like this lead many mathematicians to recognize that intuition is not a safe guide and that there was a need to supply all branches of mathematics with an axiomatic system that would be sufficient to formally derive all true propositions, a standpoint later termed **formalism** (cf. [NN01] p. 3). Over time, more and more branches, both new and old, were equipped with sets of axioms (e.g. the Zermelo-Fraenkel set theory, cf. [Fra25]).

It is worth noting that axiomatic systems and formal proofs do not require an intuitive understanding of the entities described or the nature of the proven statements. Consider the following example:

Axiomatic system 1.1. Every member of P is contained in exactly two members of L.

2. Every member of L contains exactly two members of P.

3. Every two members of L share exactly one member of P.

This axiomatic system makes statements about some abstract sets L and P , and even though we can understand the axioms per se, we do not associate any meaning with the symbols and we do not have any intuition about the overall structure of L and P. Still, we can deduce theorems from these axioms. For example, it can be shown that every three members of L contain exactly three members of P. Even though the axioms were given informally, they can be translated into second-order logic and the proof for the theorem can be carried out using rules that just replace certain sequences of symbols with other symbols. This way, the proof could be carried out by a computer simply by iteratively applying symbol replacement rules on meaningless sequences of symbols until the theorem is obtained. It is then clear that the theorem follows from the axioms without any intuition as to what the theorem or the axioms actually represent.

A prominent representative of the formalist standpoint was David Hilbert, who initiated what was later termed **Hilbert’s Program** (cf. [Zac15]). Hilbert advocated the view that all fields of mathematics should be grounded on an axiomatic basis. Furthermore he demanded that every such system should be proven to be consistent, which means that it is impossible to deduce two contradictory theorems from the axioms.

Proving the inconsistency of an axiomatic system can be done by deducing a contradiction. The question that Hilbert wanted to address, however, was how to prove the consistency, i.e. how to prove the impossibility to deduce a contradiction. One way to do so is to find an interpretation of the axioms, such that they form true statements about some part of reality or some abstract concept of our intuition. A possible model for the axiomatic system 1 is given in the following image:

When we interpret the set P as the corners of a triangle and the set L as its edges, then the axioms are invested with meaning and we can verify beyond doubt that all axioms represent true statements about the model by verifying them for each individual element. This can be done easily since there are only finitely many elements. This proves the consistency of the system, because no contradiction can be deduced from true premises.

However, there are axiomatic systems for which the model-based approach to proving their consistency is open to dispute. If, for example, the axioms require the model to contain an infinite number of elements, then it is impossible to verify the truth of the axioms beyond doubt, since the truth can no longer be verified for each individual element. Moreover, the model-based approach actually only reduces the consistency of one system to the consistency of another system. As regards the triangle example, we established the consistency of the axioms by verifying them for the triangle, but in doing so we implicitly assumed the consistency of geometry. Therefore, we have only shown that if geometry is consistent, then our axiomatic system is also consistent; we have given what is called a **relative proof of consistency**.

Hilbert urged to find **absolute proofs of consistency**, i.e. proofs that establish the consistency of an axiomatic system without presupposing the consistency of another axiomatic system. Absolute proofs of consistency use structural properties of the axioms and inference rules in order to show that no contradictions can be derived; they are not proofs within the formal axiomatic system itself, but rather proofs about the system. They are, so to speak, proofs in some meta-system. To better understand the concept of a meta-system, consider the statement ”’$p \vee p \rightarrow p$’ is a tautology”. This is not a statement within propositional logic, but a statement in some meta-system *about* propositional logic, and it can be proved within that meta-system.

Absolute proofs of consistency have successfully been established for some axiomatic systems, e.g. propositional logic (cf. [NN01] p. 45). This lead Hilbert to believe that such a proof could be found for any consistent axiomatic system, which is where Gödel’s Incompleteness Theorem comes into play: amongst other things, it shows that this is impossible for most of the axiomatic systems.

First presented in [Göd31], Gödel’s Incompleteness Theorem is actually comprised of two related but distinct theorems, which roughly state the following (cf. [Raa15]):

1. Any consistent formal [axiomatic] system F within which a certain amount of elementary arithmetic can be carried out is incomplete; i.e. there are statements of the language of F which can neither be proved nor disproved in F.

2. For any consistent system F within which a certain amount of elementary arithmetic can be carried out, the consistency of F cannot be proved in F itself.

The first of these two theorems is often referred to simply as Gödel’s Incompleteness Theorem.

Let us further elaborate these statements. The first theorem basically states that all axiomatic systems that are expressive enough to perform elementary arithmetic contain statements that can neither be proved nor disproved within the system itself, i.e. neither the statements nor their negations can be obtained by iteratively applying the inference rules to the axioms.

To say that a system is capable of performing arithmetic means that it either contains the natural numbers along with addition and multiplication, or that natural number arithmetic can be translated into the system such that the system mimics arithmetic in one way or another.

The second theorem states that the question of whether or not an axiomatic system is consistent belongs to those statements that cannot be proved within the system. Note that this does not mean that a proof showing the consistency of the system in question could not be given in some meta-system. However, if the consistency cannot be shown within the system itself, then a proof within the meta-system has to make inferences which cannot be modeled within the system itself. Such methods would then be open to dispute, because the consistency of the meta-system is not established. A proof of its consistency would require us to use even more elaborate methods of proof within some meta-meta-system, resulting in an infinite regress. Therefore, absolute proofs of consistency, as envisioned by Hilbert, cannot be given for axiomatic systems that are capable of doing arithmetic.

The implications of Gödel’s incompleteness theorems came as a shock to the mathematical community. For instance, it implies that there are true statements that could never be proved, and thus we can never know with certainty if they are true or if at some point they turn out to be false. It also implies that no absolute proofs of consistency could be given. Hence, the entirety of mathematics might be inconsistent and we cannot know for sure whether at some point a contradiction might occur which renders all of mathematics invalid.

For some of the following arguments, it is necessary to have a rough understanding of the ideas underlying the proof of Gödel’s theorem. At the core of the proof lies a sophisticated method of mapping the symbols of arithmetic (like =, +, ×, …), formulas within arithmetic (like ’∃x(x = y+1)’) and proofs within arithmetic (i.e. sequences of formulas) onto a unique natural number (called **Gödel number**) in such a way that the original symbol, formula or proof could be reconstructed from that number. This allows to express statements about arithmetic (like ”The first sign of ’∃x(x = y+1)’ is the existential quantifier”) as a formula within arithmetic itself (i.e. by stating that the Gödel number g of ’∃x(x = y + 1)’ has a certain property, expressible as an arithmetical formula F(g), that is only possessed by the Gödel numbers of statements beginning with the existential quantifier), effectively allowing arithmetic to talk about itself.

Next, Gödel defined a statement G which states ”G cannot be proved within arithmetic”, and showed how it could be translated into a formula within arithmetic using Gödel numbering (this formula G has come to be refered to in the literature as the Gödelian formula). G yields a contradiction similar to Russell’s paradox: If it could be proved within the system, then it would be false, hence the system would be inconsistent. Assuming that arithmetic is consistent it follows that G cannot be shown within arithmetic and thus it follows that G is true. So G is an example of a formula that is true but cannot be proved in the system, which proves the first Incompleteness Theorem.

Various objections against the possibility of artificial intelligence have been raised on the grounds of Gödel’s incompleteness theorems, which have come to be referred to as **Gödelian Arguments**. The following sections give an overview of two of the most prominent arguments, along with several objections to these arguments.

An early argument stems from the British philosopher John Lucas, put forward in a scientific paper with the title Minds, Machines and Gödel ([Luc61]). Lucas argues that, by definition, cybernetical machines (which includes computers in particular) are instantiations of a formal system. He backs up his claim by arguing that a machine has only finitely many types of operations it can perform and likewise only a finite number of initial assumptions built into the system. Thus, the initial assumptions could be represented as symbolic axioms within some formal system and the possible types of operations could be represented by formal rules of inference. Hence, every operation that a machine performs could be represented by formulas representing the state of the machine before and after the operation and stating which inference rule was used to get from the first formula to the second. In this manner the entire sequence of operations performed by the machine could be represented as a proof within a formal system and therefore the types of outputs that the machine could produce correspond to the theorems that can be proved within this formal system.

Now, since human minds can do arithmetic, a formal system F that adequately models the mind would also have to be capable of doing arithmetic, hence there are true statements that the machine is incapable of producing, but which the mind can. Lucas states that the Gödelian formula G(F) is an example for this: By following Gödel’s proof, the human mind knows that G(F) is true, but, as shown by Gödel, G(F) cannot be proved within the formal system F and consequently cannot be produced by the machine as being true. He concludes that a machine could then not be an adequate model of the mind and that the mind and machines are essentially different, since there exist true statements that the mind can know to be true but a machine cannot know to be true.

Following his argument, Lucas addresses some possible objections to his point and tries to refute them.

The first objection addressed by Lucas is that if a formal system F is not capable of constructing G(F), an extended, more adequate machine could be constructed that is indeed capable of producing G(F) and everything that follows from it. But then, he argues, this new machine will correspond to a different formal system F′ with other axioms or other rules of inference and this formal system will again have a Gödelian formula G(F′) that the machine is incapable of producing but the human mind can see to be true. If the machine was again modified to be able to produce G(F′), resulting in a new formal system F′′, then again a new formula G(F′′) could be constructed and so forth, ad infinitum. This way, no matter how many times the machine gets improved, there will always be a formula that it is incapable of producing but the human mind knows to be true.

The second objection he addresses is related: Since the construction of the Gödelian formula G(F) is a mechanizable procedure, the machine could be programmed in such a way that, in addition to its standard operations, it is capable of going through the Gödelian procedure to produce the Gödelian formula G(F) from the rest of the formal system, adding it to the formal system, then going through the procedure again to produce the Gödelian formula G(F′) of the strengthened formal system, adding it again, etc. This, as Lucas says, would correspond to a formal system that, in addition to its standard axioms, contains an infinite sequence of additional axioms, each one being the Gödelian formula of the system with the axioms that came before. Lucas objects to this argument by refering to a proof given by Gödel in a lecture at the Institute of Advanced Study, Princeton, N.J., U.S.A. in 1934. In this lecture, Gödel showed that even for formal systems that contain such an infinite sequence of Gödelian formulas as axioms, a formula could be constructed that the human mind can see to be true but that cannot be proved within the system. The intuition behind this proof, as Lucas points out, is the fact that the infinite sequence of axioms would have to be specified by some finite procedure and thus a finite formal system could be constructed that precisely models the infinite formal system.

Lucas also addresses an objection raised by Hartley Rogers in [Rog87]. Rogers claims that a machine modeling a mind should allow for non-inductive inferences. Specifically, he suggests that a machine should maintain a list of propositions neither proved nor unproved and occasionally add them to its list of axioms. If at some point their inclusion leads to a contradiction, it should be dropped again. This way, the machine could produce a formula as true even though it could not be proved from its axioms, which would render Lucas’ argument invalid. Lucas replies to this argument by stating that such a machine must choose the formulas it accepts to be true without proof at random, because a deterministic procedure would again make the whole system an axiomatic system for which an unprovable formula could be constructed that we can see to be true. Such a system, Lucas argues, would not be a good model of the human mind, because the formulas it randomly accepts as true could be wrong, even if they are consistent with the axioms.

Rogers also calls attention to the fact that the Gödelian argument is only applicable if we know that the machine is consistent. Human beings, he argues, might be inconsistent and, hence, an inconsistent machine could be a model of the human mind. Lucas answers by stating that even though human beings are inconsistent in certain situations, this is not the same type of inconsistency as in a formal system. A formal inconsistency allows to derive every sentence and its negation as true. However, when humans arrive at contradictory conclusions, they do not stick to this contradiction, but rather try to resolve it. In this sense human beings are self-correcting. Lucas continues to argue that a self-correcting machine would still be subject to the Gödelian argument, and that only a fundamentally inconsistent machine, which would allow to derive all formulas as true, could escape the Gödelian argument.

Lucas concludes his essay by stating that the characteristic attribute of human minds is the ability to step outside the system. Minds, he argues, are not constrained to operate within a single formal system, but rather they can switch between systems, reason about a system, reason about the fact that they reason about a system, etc. Machines, on the other hand, are constrained to operate within a single formal system that they could not escape. Thus, he argues, it is this ability that makes human minds inherently different from machines.

The following objections are not addressed in [Luc61], but have been voiced by other authors in reaction to Lucas’ argument.

In the book *Artificial Intelligence: A Modern Approach* ([RN03]), Peter Norvig and Stuart Russell, two artificial intelligence researchers, argue that a computer could be programmed to try out an arbitrary amount of different formal systems, or even invent new formal systems. This way, the computer could produce the Gödel sentence of one system S by switching to another, more powerful system T and carrying out the proof of S’s Gödel sentence in T.

Further, they try to reduce Lucas’ argument to absurdity by pointing out that the brain is a deterministic physical device operating according to physical laws and in consequence also constitutes a formal system. Therefore, they argue, Lucas’ argument could be used to show that human minds could not simulate human minds, which is a contradiction. Thus, they conclude that Lucas’ argument must be flawed.

Paul Benacerraf presents an objection to Lucas’ argument in [Ben67]. He raises attention to the fact that in order to produce the Gödel sentence of a formal system, one must have a profound understanding of the system’s axioms and inference rules. Constructing the Gödel sentence for arithmetic might be simple, but Benacerraf claims that if the human mind could be simulated by a formal system, then this formal system would be so complex that a human being could never understand it to the extent that he would be able to construct its Gödel sentence. Therefore, Benacerraf concludes that Lucas’ argument does not actually prove that the human mind could not be simulated by a formal system, but rather that it proves a disjunction: Either the human mind could not be simulated by a formal system, or such a formal system would be so complex that a human being could not fully understand it.

In his book *Gödel, Escher, Bach* [Hof79], physicist and cognitive scientist Douglas Hofstadter extends on the objections described above. As demonstrated in Lucas’ refutation of the Extended Machine Objection, adding a procedure to produce the Gödel formula G(F) does not refute his argument since this corresponds to a new system F′ with a new Gödel formula G(F′) that it is unable to produce. No matter how many times the capability of producing the Gödel formula of the system obtained so far is added, the resulting system will always have a new Gödel formula G(F′…′) that it is unable to produce. Hofstadter argues that if this process of adding the capability to produce the Gödel formula is carried out sequentially, the resulting system becomes more and more complex with every step. He claims that at some point the system is so complex that human beings would be unable to produce the Gödel formula. At this point, he concludes, neither the system F′…′ nor the human being that the system models can produce the Gödel formula and, therefore, the human being does not have more power than the system.

Hofstadter takes the view that a program that models human thought needs to be able to switch between systems in an arbitrary fashion. Rather than being constrained to operating within a certain system, it must always be able to jump out of the current system into a meta-system, eventually allowing the system to reflect about itself, to reflect about the fact that it reflects about itself, and so forth. This, he argues, would require the program to be able to understand and modify its own source code.

An argument similar to Lucas’ argument has been proposed by Roger Penrose in a book with the title *The Emperor’s New Mind* ([Pen89]), in which Penrose claims to overcome the objections that were raised against Lucas’ argument. His argument has later been refined and extended in his book *Shadow’s of the Mind* ([Pen94]). Here, we shall deal with this refined version of the argument.

Penrose attempts to show that mathematical insight cannot be simulated algorithmically by leading this assumption to a contradiction. He defines mathematical insight as the means by which mathematicians generate mathematical propositions and their proofs and are able to follow and understand each other’s proofs.

Penrose’s argument can be reconstructed as follows:

1. Assume (for the sake of contradiction) that there is some formal system F that captures the thought processes required for mathematical insight.

2. Then, according to Gödel’s theorem, F cannot prove its own consistency.

3. We, as human beings, can see that F is consistent.

4. Therefore, since F captures our reasoning, F could prove that F is consistent.

5. This is a contradiction and, therefore, such a system F could not exist.

The strong assumption in this otherwise logically sound argument is 3, that we can see that F is consistent. Penrose argues for 3 in two different ways (labeled as 3a and 3b in the following)

3a.1: We, as human beings, know that we are consistent.

3a.2: Therefore, if we know that F captures our reasoning, we know that F is consistent.

Penrose recognises that this argument rests on the assumption that we could know that F captures our reasoning. He therefore extends his argument to capture the case where we do not know that F captures our reasoning, but we can still see that F is consistent.

3b.1: By definition, F consists of a set of axioms and inference rules.

3b.2: Each individual axiom could be verified by us, since if F is able to see their truth, then so are we.

3b.3: Furthermore, the validity of the inference rules could also be verified by us, since it would be implausible to believe that human reasoning relies on dubious inference rules.

3b.4: Therefore, since we know that the axioms are true and that the inference rules are valid, we know that F is consistent.

Later in his book, Penrose addresses the question why his argument is not applicable to human brains. To this end, he presents four possible views on the question of how human consciousness and reasoning comes into existence (cf. [Pen94]):

A: All thinking is computation; in particular, feelings of conscious awareness are evoked merely by the carrying out of appropriate computations.

B: Awareness is a feature of the brain’s physical action; and whereas any physical action can be simulated computationally, computational simulation cannot by itself evoke awareness.

C: Appropriate physical action evokes awareness, but this physical action cannot even be properly simulated computationally.

D: Awareness cannot be explained by physical, computational, or any other scientific terms.

Penrose himself embraces position (C). He points out that all the physical laws presently known to us are algorithmic in nature, and therefore argues that there must be non-algorithmic physical phenomena yet to be discovered. He hypothesizes that these phenomena are based on the interaction between quantum mechanics and general relativity.

In [McC95], Daryl McCullough stresses a few loose ends in Penrose’s argument. For instance, he points out that there is an ambiguity in the definition of F, and that there are actually three different ways to interpret F:

1. F represents the mathematician’s inherent reasoning ability.

2. F represents a snapshot of the mathematician’s brain at a certain point in time, such that it includes both his inherent reasoning ability and the empirical knowledge that the mathematician acquired during his lifetime.

3. F represents the maximum of what could ever be known by the mathematician through reasoning and empirical knowledge.

McCullough argues that this dinstinction becomes important when dealing with the question whether the mathematician could know that his reasoning powers are captured by F. If the mathematician learns this fact empirically, then this knowledge is not reflected by F , and therefore Penrose’s original argument would be invalid. However, he acknowledges that an argument analogous to Penrose’s argument goes through for an extended system F′, which is F extended by the axiom that one’s reasoning powers are captured by F.

McCullough also addresses the fact that Penrose’s argument rests on the assumption that human reasoning is consistent and that human beings can be sure of their own consistency. He argues that this assumption is not beyond doubt and presents a thought experiment in order to show how inconsistencies could turn up even during careful and justified reasoning. He proposes to imagine an interrogator asking questions that can be answered by yes or no, and an experimental subject that can answer these questions by pressing a ’yes’ button or a ’no’ button. If the interrogator asks the question ”Will you push the ’no’ button”, then this question cannot be answered truthfully. The subject knows that the true answer is ’no’, but he cannot communicate this answer by pressing the ’no’ button. McCullough now extends this thought experiment and assumes that the subject’s brain is attached to a device that is able to read if the subject’s mind is in a ’yes’ or ’no’ state of belief and correspondingly flashes a light labeled ’yes’ or ’no’. If the interrogator now poses the question ”Will the ’no’ light flash”, the subject has no way of holding a belief without communicating it. Now, if the subject’s beliefs are consistent, the answer to the question is ”no”, but the subject cannot correctly believe the answer to be ”no”, and therefore he cannot correctly believe that he is consistent. Thus, no matter how much careful thought humans give to producing their answer, and no matter how intelligent they are, they cannot be sure of their own consistency.

McCullough concludes that the only undoubtful logical consequence of Penrose’s argument is that if human reasoning can be captured by a formal system F, then there is no way to be certain that F is consistent. This, he argues, is not a limitation on what formal systems could achieve in comparison to human beings, but rather a general insight about a limitation in one’s ability to reason about one’s own reasoning process.

Another answer to Penrose’s argument has been provided by Australian philosopher David Chalmers in [Cha95]. Chalmers argues that it is inadequate to assume that a computational procedure simulating the human mind would consist of a set of axioms and inference rules. He claims that even in today’s AI research, there are examples for computational procedures that are not decomposable into axioms and rules of inference, e.g. neural networks. Chalmers acknowledges that, according to a theorem by William Craig (cf. [Cra53]), for every algorithm we can find an axiomatic system that produces the same output. But this system would be rather complex, casting doubt on Penrose’s assumption that its inference rules could be verified by human thought. Thus, Chalmers concludes that Penrose’s argument only applies to rule-based systems (like automatic theorem provers), but not to all computational procedures.

Apart from this, Chalmers also claims that the assumption that we are knowably consistent already leads to a contradiction in itself. He tries to prove formally that any system that knows about its own consistency is inconsistent. To this end, he introduces the symbol B to represent the system’s belief, where B(n) corresponds to the statement that the system believes that the statement with Gödel number n is true. Further, he introduces ⊢ A to denote that the system knows A. Now he makes the following assumptions:

1. If the system knows A, then it knows that it believes that A is true.

Formally: If ⊢ A then ⊢ B(A).

2. The system knows that it can use modus ponens in its reasoning.

Formally: ⊢ B(A1) ∧ B(A1 → A2) → B(A2)

3. The system knows the fact described in 1.

Formally: ⊢ B(A) → B(B(A))

4. The system is capable of doing arithmetic.

5. The system knows that it is consistent.

Formally: ⊢ ¬B(false)

From these assumptions, the first four of which he deems to be reasonable, Chalmers formally proves that the system must be inconsistent by making use of Gödel’s theorem. Thus, the assumption 5 that the system knows that it is consistent could not hold for any system fulfilling the premises 1 through 4. He therefore concludes that the contradiction that arises in Penrose’s argument is due to the false assumption that humans are knowably consistent, rather than the allegedly false assumption that human thought could not be captured by a formal system.

In this text, we have learned about axiomatic systems and their historical background. We have learned about the mathematicians’ endeavour to formalize mathematics and prove its consistency and we have seen how Gödel’s theorem implied that this is impossible and that there are true statements whose truth cannot be proved. We have seen how Lucas and Penrose argued that it is impossible to capture human thought by means of an axiomatic system on the grounds of Gödel’s theorem, and that therefore artificial intelligence is impossible. We have also dealt with objections to these arguments by other authors and thus we have seen that the Gödelian arguments are not generally accepted.

From my point of view, Lucas’ argument seems rather unconvincing. If a formal system F captures human thought, I agree to Benacerraf that the system would be so complex that it would be highly doubtful that a human being could see the truth of the Gödel sentence G(F). But even if this were possible, it does not show that humans are different from machines and can do something that machines cannot. As shown in [Amm97], it is indeed possible to prove the Gödel sentence algorithmically, as long as the system performing the proof does not correspond to the system whose Gödel sentence is proven. Thus, if a human being can see the truth of the Gödel sentence G(F), this can be viewed as being analogous to some formal system F′ proving the Gödel sentence G(F), which is indeed possible. Therefore, in seeing the truth of the Gödel sentence G(F), the human mind does not do something which is generally impossible for machines. Lucas’ argument would only go through if he could show that a human being is able to see the truth of his own Gödel sentence, because this is what machines cannot do. It is questionable, however, what is meant by the Gödel sentence of a human being in the first place if we do not regard human beings as formal systems to begin with.

As for Penrose’s argument, I am not convinced by fact 3 that a human being could see that a formal system F capturing his thought processes is consistent. In his first argument 3a, Penrose argues that we as human beings know that we are consistent and therefore, since F captures our reasoning, we know that F is consistent. I think that this argument is based on an ambiguity in the definition of consistency. When talking about formal systems, consistency has a very clear definition: It means that no contradictory theorems could be derived within the system. However, it is unclear what it means for a human being to be consistent. Since Penrose does not regard human beings as formal systems, we cannot apply the same definition of consistency.

A reasonable definition would be that a human being is consistent if and only if he does not believe in two contradictory sentences, which is most likely what Penrose means when talking about the consistency of a human being. This is equivalent to saying that the human’s *belief system* is a consistent system. A human’s belief system is the formal system whose theorems correspond to the sentences that the human believes to be true. Its axioms are what the human originally believes to be true, and its inference rules are the logical inferences that the human makes in order to deduce new beliefs. But a human’s belief system is not equivalent to the human’s mind. Rather, the belief system is one of the results of human thought – a model that humans use to judge the truth of statements. Therefore, if our belief system is consistent, and we know that our belief system is consistent, this does not mean that we know that a formal system simulating our mind (including but not limited to our belief system) is also consistent. This renders the argument 3a invalid.

Apart from this, it is doubtful whether a human’s belief system is necessarily consistent in the first place. Consistent belief systems are usually only held by careful thinkers that are acquainted with the rules of logic, and there are undoubtedly many humans with inconsistent belief systems resulting from logical fallacies. Therefore, Penrose’s argument could show at best that the minds of careful thinkers could not be simulated computationally, whereas it does not apply to the minds of non-careful thinkers with inconsistent belief systems.

In 3b, Penrose argues for 3 by stating that humans could verify the consistency of F by verifying its axioms and inference rules. In this argument, Penrose makes the implicit assumption that the theorems of F correspond to the statements that the simulated human mind believes to be true. This becomes evident in the way he argues for 3b.2 and 3b.3. So again Penrose fails to distinguish between the human mind and the human’s belief system, making 3b invalid out of the same reason as 3a. A formal system might be able to simulate human thought without any obvious relation between the system’s theorems and the statements that the resulting mind believes to be true. Thus, Penrose’s argument shows at best that a formal system that captures human thought could not correspond to the human’s belief system, since this would lead to a contradiction. However, it does nothing to show that human thought could not be simulated by any formal system, as long as the theorems of this formal system do not correspond to the statements that the resulting mind believes to be true.

Up to this day, Gödelian arguments are continuously debated, and there is no consensus among philosophers and researchers as to whether they are true or false. To the best of my knowledge, Lucas and Penrose have not explicitly addressed the described objections in their publications. Still, they have not backed down from being persuaded of their arguments. There are a lot more objections to their arguments in the literature, but listing them all would be out of the scope of this text. Lucas maintains a list of criticisms of the Gödelian argument on his website at http://users.ox.ac.uk/~jrlucas/Godel/referenc.html, referencing 78 sources as of August 2017.

If you have your own ideas on the matter, please leave a comment. If you’d like to read more about the possibility of strong artificial intelligence, read my article “Can Computers Think” -“No, but…” If you’d like to stay informed about more blog posts on artificial intelligence and the philosophy of mind, subscribe to deep ideas by Email.

[Amm97] Kurt Ammon. An automatic proof of Gödel’s incompleteness theorem. *Artificial Intelligence*, 95, 1997. Elsevier.

[Ben67] Paul Benacerraf. God, the devil, and Gödel. *The Monist*, 51, 1967. Oxford University Press.

[Cha95] David J Chalmers. Minds, machines, and mathematics. *Psyche*, 2, 1995. Hindawi Publishing Corporation.

[Cra53] William Craig. On axiomatizability within a system. *The Journal of Symbolic Logic*, 18, 1953. Association for Symbolic Logic.

[Euc02] Euclid. *Euclid’s Elements*. 2002. Green Lion Press.

[Fra25] Adolf Fraenkel. Untersuchungen über die Grundlagen der Mengenlehre. *Mathematische Zeitschrift*, 22, 1925. Springer.

[Göd31] Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. *Monatshefte für Mathematik und Physik*, 38, 1931. Springer.

[Gri04] Nicholas Griffin. The prehistory of Russell’s paradox. In *One hundred years of Russell’s paradox*. 2004. de Gruyter.

[Hof79] Douglas R. Hofstadter. *Gödel, Escher, Bach: An Eternal Golden Braid*. 1979. Basic Books, Inc.

[Luc61] John Lucas. Minds, machines and Gödel. *Philosophy*, 36, 1961. Cambridge University Press.

[McC95] Daryl McCullough. Can humans escape Gödel? *Psyche*, 2, 1995. Hindawi Publishing Corporation.

[NN01] Ernest Nagel and James R. Newman. *Gödel’s proof.* 2001. New York University Press.

[Pen89] Roger Penrose. *The Emperor’s New Mind*. 1989. Oxford University Press.

[Pen94] Roger Penrose. *Shadows Of The Mind*. 1994. Oxford University Press.

[Raa15] Panu Raatikainen. Gödel’s Incompleteness Theorems. In *The Stanford Encyclopedia of Philosophy*. Spring 2015 edition, 2015.

[RN03] Stuart J. Russell and Peter Norvig. *Artificial Intelligence: A Modern Approach*. 2003. Pearson Education.

[Rog87] Hartley Rogers, Jr. *Theory of recursive functions and effective computability*. 1987. MIT Press.

*The Stanford Encyclopedia of Philosophy*. Summer 2015 edition, 2015.

The post Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence appeared first on deep ideas.

]]>The post Deep Learning From Scratch V: Multi-Layer Perceptrons appeared first on deep ideas.

]]>- Part I: Computational Graphs
- Part II: Perceptrons
- Part III: Training criterion
- Part IV: Gradient Descent and Backpropagation
- Part V: Multi-Layer Perceptrons

Many real-world classes that we encounter in machine learning are not linearly separable. This means that there does not exist any line with all the points of the first class on one side of the line and all the points of the other class on the other side. Let’s illustrate this with an example.

In [49]:

```
# Create two clusters of red points centered at (0, 0) and (1, 1), respectively.
red_points = np.concatenate((
0.2*np.random.randn(25, 2) + np.array([[0, 0]]*25),
0.2*np.random.randn(25, 2) + np.array([[1, 1]]*25)
))
# Create two clusters of blue points centered at (0, 1) and (1, 0), respectively.
blue_points = np.concatenate((
0.2*np.random.randn(25, 2) + np.array([[0, 1]]*25),
0.2*np.random.randn(25, 2) + np.array([[1, 0]]*25)
))
# Plot them
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
```

As we can see, it is impossible to draw a line that separates the blue points from the red points. Instead, our decision boundary has to have a rather complex shape. This is where multi-layer perceptrons come into play: They allow us to train a decision boundary of a more complex shape than a straight line.

As their name suggests, multi-layer perceptrons (MLPs) are composed of multiple perceptrons stacked one after the other in a layer-wise fashion. Let’s look at a visualization of the computational graph:

As we can see, the input is fed into the first layer, which is a multidimensional perceptron with a weight matrix $W_1$ and bias vector $b_1$. The output of that layer is then fed into second layer, which is again a perceptron with another weight matrix $W_2$ and bias vector $b_2$. This process continues for every of the $L$ layers until we reach the output layer. We refer to the last layer as the **output layer** and to every other layer as a **hidden layer**.

an MLP with one hidden layers computes the function

$$\sigma(\sigma(X \, W_1 + b_1) W_2 + b_2) \,,$$

an MLP with two hidden layers computes the function

$$\sigma(\sigma(\sigma(X \, W_1 + b_1) W_2 + b_2) \, W_3 \,,$$

and, generally, an MLP with $L-1$ hidden layers computes the function

$$\sigma(\sigma( \cdots \sigma(\sigma(X \, W_1 + b_1) W_2 + b_2) \cdots) \, W_L + b_L) \,.$$

Using the library we have built, we can now easily implement multi-layer perceptrons without further work.

In [52]:

```
# Create a new graph
Graph().as_default()
# Create training input placeholder
X = placeholder()
# Create placeholder for the training classes
c = placeholder()
# Build a hidden layer
W_hidden = Variable(np.random.randn(2, 2))
b_hidden = Variable(np.random.randn(2))
p_hidden = sigmoid( add(matmul(X, W_hidden), b_hidden) )
# Build the output layer
W_output = Variable(np.random.randn(2, 2))
b_output = Variable(np.random.randn(2))
p_output = softmax( add(matmul(p_hidden, W_output), b_output) )
# Build cross-entropy loss
J = negative(reduce_sum(reduce_sum(multiply(c, log(p_output)), axis=1)))
# Build minimization op
minimization_op = GradientDescentOptimizer(learning_rate = 0.03).minimize(J)
# Build placeholder inputs
feed_dict = {
X: np.concatenate((blue_points, red_points)),
c:
[[1, 0]] * len(blue_points)
+ [[0, 1]] * len(red_points)
}
# Create session
session = Session()
# Perform 100 gradient descent steps
for step in range(1000):
J_value = session.run(J, feed_dict)
if step % 100 == 0:
print("Step:", step, " Loss:", J_value)
session.run(minimization_op, feed_dict)
# Print final result
W_hidden_value = session.run(W_hidden)
print("Hidden layer weight matrix:\n", W_hidden_value)
b_hidden_value = session.run(b_hidden)
print("Hidden layer bias:\n", b_hidden_value)
W_output_value = session.run(W_output)
print("Output layer weight matrix:\n", W_output_value)
b_output_value = session.run(b_output)
print("Output layer bias:\n", b_output_value)
```

Let’s now visualize the decision boundary:

In [53]:

```
# Visualize classification boundary
xs = np.linspace(-2, 2)
ys = np.linspace(-2, 2)
pred_classes = []
for x in xs:
for y in ys:
pred_class = session.run(p_output,
feed_dict={X: [[x, y]]})[0]
pred_classes.append((x, y, pred_class.argmax()))
xs_p, ys_p = [], []
xs_n, ys_n = [], []
for x, y, c in pred_classes:
if c == 0:
xs_n.append(x)
ys_n.append(y)
else:
xs_p.append(x)
ys_p.append(y)
plt.plot(xs_p, ys_p, 'ro', xs_n, ys_n, 'bo')
plt.show()
```

As we can see, we have learned a rather complex decision boundary. If we use more layers, the decision boundary can become arbitrarily complex, allowing us to learn classification patterns that are impossible to spot by a human being, especially in higher dimensions.

Congratulations on making it this far! You have learned the foundations of building neural networks from scratch, and in contrast to most machine learning practitioners, you now know how it all works under the hood and why it is done the way it is done.

Let’s recap what we have learned. We started out by considering **computational graphs** in general, and we saw how to build them and how to compute their output. We then moved on to describe **perceptrons**, which are linear classifiers that assign a probability to each output class by squashing the output of $w^Tx+b$ through a **sigmoid** (or **softmax**, in the case of multiple classes). Following that, we saw how to judge how good a classifier is – via a loss function, the **cross-entropy loss**, the minimization of which is equivalent to **maximum likelihood**. In the next step, we saw how to minimize the loss via **gradient descent**: By iteratively stepping into the direction of the negative gradient. We then introduced **backpropagation** as a means of computing the derivative of the loss with respect to each node by performing a breadth-first search and multiplying according to the chain rule. We used all that we’ve learned to train a good linear classifier for the red/blue example dataset. Finally, we learned about **multi-layer perceptrons** as a means of learning non-linear decision boundaries, implemented an MLP with one hidden layer and successfully trained it on a non-linearly-separable dataset.

You now know all the fundamentals for training arbitrary neural networks. As a next step, you should learn about the following topics (Google is your friend):

- The difference between training loss and test loss
- Overfitting and underfitting
- Regularization and early stopping
- Dropout
- Convolutional neural networks
- Recurrent neural networks
- Autoencoders
- Deep Generative Models

All of these topics are dealt with in the book “Deep Learning” by Ian Goodfellow, Yoshua Bengio and Aaron Courville, which I highly recommend everyone to read. A free online version of the book can be found at http://www.deeplearningbook.org/. Since this book is very math-oriented, it is probably a good idea to get some hands-on experience in parallel. The book doesn’t enable you to do so. Therefore, I’d recommend reading TensorFlow or Keras tutorials.

More blog posts on deep learning are coming soon. You can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated.

The post Deep Learning From Scratch V: Multi-Layer Perceptrons appeared first on deep ideas.

]]>The post Deep Learning From Scratch IV: Gradient Descent and Backpropagation appeared first on deep ideas.

]]>- Part I: Computational Graphs
- Part II: Perceptrons
- Part III: Training criterion
- Part IV: Gradient Descent and Backpropagation
- Part V: Multi-Layer Perceptrons

Generally, if we want to find the minimum of a function, we set the derivative to zero and solve for the parameters. It turns out, however, that it is impossible to obtain a closed-form solution for $W$ and $b$. Instead, we iteratively search for a minimum using a method called **gradient descent**.

As a visual analogy, imagine yourself standing on a mountain and trying to find the way down. At every step, you walk into the steepest direction, since this direction is the most promising to lead you towards the bottom.

If taking steep steps seems a little dangerous to you, imagine that you are a mountain goat (which are amazing rock climbers).

Gradient descent operates in a similar way when trying to find the minimum of a function: It starts at a random location in parameter space and then iteratively reduces the error $J$ until it reaches a local minimum. At each step of the iteration, it determines the direction of steepest descent and takes a step along that direction. This process is depicted for the 1-dimensional case in the following image.

As you might remember, the direction of steepest ascent of a function at a certain point is given by the gradient at that point. Therefore, the direction of steepest descent is given by the negative of the gradient. So now we have a rough idea how to minimize $J$:

- Start with random values for $W$ and $b$
- Compute the gradients of $J$ with respect to $W$ and $b$
- Take a small step along the direction of the negative gradient
- Go back to 2

Let’s implement an operation that minimizes the value of a node using gradient descent. We require the user to specify the magnitude of the step along the gradient as a parameter called `learning_rate`

.

In [37]:

```
from queue import Queue
class GradientDescentOptimizer:
def __init__(self, learning_rate):
self.learning_rate = learning_rate
def minimize(self, loss):
learning_rate = self.learning_rate
class MinimizationOperation(Operation):
def compute(self):
# Compute gradients
grad_table = compute_gradients(loss)
# Iterate all variables
for node in grad_table:
if type(node) == Variable:
# Retrieve gradient for this variable
grad = grad_table[node]
# Take a step along the direction of the negative gradient
node.value -= learning_rate * grad
return MinimizationOperation()
```

The following image depicts an example iteration of gradient descent. We start out with a random separating line (marked as 1), take a step, arrive at a slightly better line (marked as 2), take another step, and another step, and so on until we arrive at a good separating line.

In our implementation of gradient descent, we have used a function `compute_gradient(loss)`

that computes the gradient of a $loss$ operation in our computational graph with respect to the output of every other node $n$ (i.e. the direction of change for $n$ along which the loss increases the most). We now need to figure out how to compute gradients.

Consider the following computational graph:

By the chain rule, we have

$$\frac{\partial e}{\partial a} = \frac{\partial e}{\partial b} \cdot \frac{\partial b}{\partial a} = \frac{\partial e}{\partial c} \cdot \frac{\partial c}{\partial b} \cdot \frac{\partial b}{\partial a} = \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial b} \cdot \frac{\partial b}{\partial a}$$

As we can see, in order to compute the gradient of $e$ with respect to $a$, we can start at $e$ an go backwards towards $a$, computing the gradient of every node’s output with respect to its input along the way until we reach $a$. Then, we multiply them all together.

Now consider the following scenario:

In this case, $a$ contributes to $e$ along two paths: The path $a$, $b$, $d$, $e$ and the path $a$, $c$, $d$, $e$. Hence, the total derivative of $e$ with respect to $a$ is given by:

$$

\frac{\partial e}{\partial a}

= \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial a}

= \frac{\partial e}{\partial d} \cdot \left( \frac{\partial d}{\partial b} \cdot \frac{\partial b}{\partial a} + \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial a} \right)

= \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial b} \cdot \frac{\partial b}{\partial a} + \frac{\partial e}{\partial d} \cdot \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial a}

$$

This gives as an intuition for the general algorithm that computes the gradient of the loss with respect to another node: We perform a backwards breadth-first search starting from the loss node. At each node $n$ that we visit, we do the following for each of its consumers:

- retrieve the gradient $G$ of the loss with respect to the output of the consumer
- multiply $G$ by the gradient of the consumer’s output with respect to $n$’s output

And then we sum those gradients over all consumers.

As a prerequisite to implementing backpropagation, we need to specify a function for each operation that computes the gradients with respect to the inputs of that operation, given the gradients with respect to the output. Let’s define a decorator `@RegisterGradient(operation_name)`

for this purpose:

In [38]:

```
# A dictionary that will map operations to gradient functions
_gradient_registry = {}
class RegisterGradient:
"""A decorator for registering the gradient function for an op type.
"""
def __init__(self, op_type):
"""Creates a new decorator with `op_type` as the Operation type.
Args:
op_type: The name of an operation
"""
self._op_type = eval(op_type)
def __call__(self, f):
"""Registers the function `f` as gradient function for `op_type`."""
_gradient_registry[self._op_type] = f
return f
```

`_gradient_registry`

dictionary is already filled with gradient computation functions for all of our operations. We can now implement backpropagation:

In [39]:

```
from queue import Queue
def compute_gradients(loss):
# grad_table[node] will contain the gradient of the loss w.r.t. the node's output
grad_table = {}
# The gradient of the loss with respect to the loss is just 1
grad_table[loss] = 1
# Perform a breadth-first search, backwards from the loss
visited = set()
queue = Queue()
visited.add(loss)
queue.put(loss)
while not queue.empty():
node = queue.get()
# If this node is not the loss
if node != loss:
#
# Compute the gradient of the loss with respect to this node's output
#
grad_table[node] = 0
# Iterate all consumers
for consumer in node.consumers:
# Retrieve the gradient of the loss w.r.t. consumer's output
lossgrad_wrt_consumer_output = grad_table[consumer]
# Retrieve the function which computes gradients with respect to
# consumer's inputs given gradients with respect to consumer's output.
consumer_op_type = consumer.__class__
bprop = _gradient_registry[consumer_op_type]
# Get the gradient of the loss with respect to all of consumer's inputs
lossgrads_wrt_consumer_inputs = bprop(consumer, lossgrad_wrt_consumer_output)
if len(consumer.input_nodes) == 1:
# If there is a single input node to the consumer, lossgrads_wrt_consumer_inputs is a scalar
grad_table[node] += lossgrads_wrt_consumer_inputs
else:
# Otherwise, lossgrads_wrt_consumer_inputs is an array of gradients for each input node
# Retrieve the index of node in consumer's inputs
node_index_in_consumer_inputs = consumer.input_nodes.index(node)
# Get the gradient of the loss with respect to node
lossgrad_wrt_node = lossgrads_wrt_consumer_inputs[node_index_in_consumer_inputs]
# Add to total gradient
grad_table[node] += lossgrad_wrt_node
#
# Append each input node to the queue
#
if hasattr(node, "input_nodes"):
for input_node in node.input_nodes:
if not input_node in visited:
visited.add(input_node)
queue.put(input_node)
# Return gradients for each visited node
return grad_table
```

For each of our operations, we now need to define a function that turns a gradient of the loss with respect to the operation’s output into a list of gradients of the loss with respect to each of the operation’s inputs. Computing a gradient with respect to a matrix can be somewhat tedious. Therefore, the details have been omitted and I just present the results. You may skip this section and still understand the overall picture.

If you want to comprehend how to arrive at the results, the general approach is as follows:

- Find the partial derivative of each output value with respect to each input value (this can be a tensor of a rank greater than 2, i.e. neither scalar nor vector nor matrix, involving a lot of summations)
- Compute the gradient of the loss with respect to the node’s inputs given a gradient with respect to the node’s output by applying the chain rule. This is now a tensor of the same shape as the input tensor, so if the input is a matrix, the result is also a matrix
- Rewrite this result as a sequence of matrix operations in order to compute it efficiently. This step can be somewhat tricky.

`add`

Given a gradient $G$ with respect to $a + b$, the gradient with respect to $a$ is given by $G$ and the gradient with respect to $b$ is also given by $G$, provided that $a$ and $b$ are of the same shape. If $a$ and $b$ are of different shapes, e.g. one matrix $a$ with 100 rows and one row vector $b$, we assume that $b$ is added to each row of $a$. In this case, the gradient computation is a little more involved, but I will not spell out the details here.

In [40]:

```
@RegisterGradient("add")
def _add_gradient(op, grad):
"""Computes the gradients for `add`.
Args:
op: The `add` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `add` op.
Returns:
Gradients with respect to the input of `add`.
"""
a = op.inputs[0]
b = op.inputs[1]
grad_wrt_a = grad
while np.ndim(grad_wrt_a) > len(a.shape):
grad_wrt_a = np.sum(grad_wrt_a, axis=0)
for axis, size in enumerate(a.shape):
if size == 1:
grad_wrt_a = np.sum(grad_wrt_a, axis=axis, keepdims=True)
grad_wrt_b = grad
while np.ndim(grad_wrt_b) > len(b.shape):
grad_wrt_b = np.sum(grad_wrt_b, axis=0)
for axis, size in enumerate(b.shape):
if size == 1:
grad_wrt_b = np.sum(grad_wrt_b, axis=axis, keepdims=True)
return [grad_wrt_a, grad_wrt_b]
```

`matmul`

Given a gradient $G$ with respect to $AB$, the gradient with respect to $A$ is given by $GB^T$ and the gradient with respect to $B$ is given by $A^TG$.

In [41]:

```
@RegisterGradient("matmul")
def _matmul_gradient(op, grad):
"""Computes the gradients for `matmul`.
Args:
op: The `matmul` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `matmul` op.
Returns:
Gradients with respect to the input of `matmul`.
"""
A = op.inputs[0]
B = op.inputs[1]
return [grad.dot(B.T), A.T.dot(grad)]
```

`sigmoid`

Given a gradient $G$ with respect to $\sigma(a)$, the gradient with respect to $a$ is given by $G \cdot \sigma(a) \cdot \sigma(1-a)$.

In [51]:

```
@RegisterGradient("sigmoid")
def _sigmoid_gradient(op, grad):
"""Computes the gradients for `sigmoid`.
Args:
op: The `sigmoid` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `sigmoid` op.
Returns:
Gradients with respect to the input of `sigmoid`.
"""
sigmoid = op.output
return grad * sigmoid * (1-sigmoid)
```

`softmax`

In [42]:

```
@RegisterGradient("softmax")
def _softmax_gradient(op, grad):
"""Computes the gradients for `softmax`.
Args:
op: The `softmax` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `softmax` op.
Returns:
Gradients with respect to the input of `softmax`.
"""
softmax = op.output
return (grad - np.reshape(
np.sum(grad * softmax, 1),
[-1, 1]
)) * softmax
```

`log`

Given a gradient $G$ with respect to $log(x)$, the gradient with respect to $x$ is given by $\frac{G}{x}$.

In [43]:

```
@RegisterGradient("log")
def _log_gradient(op, grad):
"""Computes the gradients for `log`.
Args:
op: The `log` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `log` op.
Returns:
Gradients with respect to the input of `log`.
"""
x = op.inputs[0]
return grad/x
```

`multiply`

Given a gradient $G$ with respect to $A \odot B$, the gradient with respect to $A$ is given by $G \odot B$ and the gradient with respect to $B$ is given by $G \odot A$.

In [44]:

```
@RegisterGradient("multiply")
def _multiply_gradient(op, grad):
"""Computes the gradients for `multiply`.
Args:
op: The `multiply` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `multiply` op.
Returns:
Gradients with respect to the input of `multiply`.
"""
A = op.inputs[0]
B = op.inputs[1]
return [grad * B, grad * A]
```

`reduce_sum`

Given a gradient $G$ with respect to the output of `reduce_sum`

, the gradient with respect to the input $A$ is given by repeating $G$ along the specified axis.

In [45]:

```
@RegisterGradient("reduce_sum")
def _reduce_sum_gradient(op, grad):
"""Computes the gradients for `reduce_sum`.
Args:
op: The `reduce_sum` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `reduce_sum` op.
Returns:
Gradients with respect to the input of `reduce_sum`.
"""
A = op.inputs[0]
output_shape = np.array(A.shape)
output_shape[op.axis] = 1
tile_scaling = A.shape // output_shape
grad = np.reshape(grad, output_shape)
return np.tile(grad, tile_scaling)
```

`negative`

Given a gradient $G$ with respect to $-x$, the gradient with respect to $x$ is given by $-G$.

In [46]:

```
@RegisterGradient("negative")
def _negative_gradient(op, grad):
"""Computes the gradients for `negative`.
Args:
op: The `negative` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `negative` op.
Returns:
Gradients with respect to the input of `negative`.
"""
return -grad
```

Let’s now test our implementation to determine the optimal weights for our perceptron.

In [47]:

```
# Create a new graph
Graph().as_default()
X = placeholder()
c = placeholder()
# Initialize weights randomly
W = Variable(np.random.randn(2, 2))
b = Variable(np.random.randn(2))
# Build perceptron
p = softmax( add(matmul(X, W), b) )
# Build cross-entropy loss
J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))
# Build minimization op
minimization_op = GradientDescentOptimizer(learning_rate = 0.01).minimize(J)
# Build placeholder inputs
feed_dict = {
X: np.concatenate((blue_points, red_points)),
c:
[[1, 0]] * len(blue_points)
+ [[0, 1]] * len(red_points)
}
# Create session
session = Session()
# Perform 100 gradient descent steps
for step in range(100):
J_value = session.run(J, feed_dict)
if step % 10 == 0:
print("Step:", step, " Loss:", J_value)
session.run(minimization_op, feed_dict)
# Print final result
W_value = session.run(W)
print("Weight matrix:\n", W_value)
b_value = session.run(b)
print("Bias:\n", b_value)
```

In [48]:

```
# Plot a line y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -W_value[0][0]/W_value[1][0] * x_axis - b_value[0]/W_value[1][0]
plt.plot(x_axis, y_axis)
# Add the red and blue points
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
```

If you have any questions, feel free to leave a comment. Otherwise, continue with the next part: V: Multi-Layer Perceptrons

The post Deep Learning From Scratch IV: Gradient Descent and Backpropagation appeared first on deep ideas.

]]>The post Deep Learning From Scratch III: Training criterion appeared first on deep ideas.

]]>- Part I: Computational Graphs
- Part II: Perceptrons
- Part III: Training criterion
- Part IV: Gradient Descent and Backpropagation
- Part V: Multi-Layer Perceptrons

Great, so now we are able to classify points using a linear classifier and compute the probability that the point belongs to a certain class, provided that we know the appropriate parameters for the weight matrix $W$ and bias $b$. The natural question that arises is how to come up with appropriate values for these. In the red/blue example, we just looked at the training points and guessed a line that nicely separated the training points. But generally we do not want to specify the separating line by hand. Rather, we just want to supply the training points to the computer and let it come up with a good separating line on its own. But how do we judge whether a separating line is good or bad?

Ideally, we want to find a line that makes as few errors as possible. For every point $x$ and class $c(x)$ drawn from the true but unknown data-generating distribution $p_\text{data}(x, c(x))$, we want to minimize the probability that our perceptron classifies it incorrectly – the **probability of misclassification**:

$$\underset{W, b}{\operatorname{argmin}} p(\hat{c}(x) \neq c(x) \mid x, c(x) \, \tilde{} \, p_\text{data} )$$

Generally, we do not know the data-generating distribution $p_\text{data}$, so it is impossible to compute the exact probability of misclassification. Instead, we are given a finite list of $N$ **training points** consisting of the values of $x$ with their corresponding classes. In the following, we represent the list of training points as a matrix $X \in \mathbb{R}^{N \times d}$ where each row corresponds to one training point and each column to one dimension of the input space. Moreover, we represent the true classes as a matrix $c \in \mathbb{R}^{N \times C}$ where $c_{i, j} = 1$ if the $i$-th training sample has class $j$. Similarly, we represent the predicted classes as a matrix $\hat{c} \in \mathbb{R}^{N \times C}$ where $\hat{c}_{i, j} = 1$ if the $i$-th training sample has a predicted class $j$. Finally, we represent the output probabilities of our model as a matrix $p \in \mathbb{R}^{N \times C}$ where $p_{i, j}$ contains the probability that the $i$-th training sample belongs to the j-th class.

We could use the training data to find a classifier that minimizes the **misclassification rate** on the training samples:

$$

\underset{W, b}{\operatorname{argmin}} \frac{1}{N} \sum_{i = 1}^N I(\hat{c}_i \neq c_i)

$$

However, it turns out that finding a linear classifier that minimizes the misclassification rate is an intractable problem, i.e. its computational complexity is exponential in the number of input dimensions, rendering it unpractical. Moreover, even if we have found a classifier that minimizes the misclassification rate on the training samples, it might be possible to make the classifier more robust to unseen samples by pushing the classes further apart, even if this does not reduce the misclassification rate on the training samples.

An alternative is to use maximum likelihood estimation, where we try to find the parameters that maximize the probability of the training data:

\begin{align}

\underset{W, b}{\operatorname{argmax}} p(\hat{c} = c) \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N p(\hat{c}_i = c_i) \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{I(c_i = j)} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{c_{i, j}} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} log \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{c_{i, j}} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \sum_{i=1}^N \sum_{j=1}^C c_{i, j} \cdot log \, p_{i, j} \\

\end{align}

\underset{W, b}{\operatorname{argmax}} p(\hat{c} = c) \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N p(\hat{c}_i = c_i) \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{I(c_i = j)} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{c_{i, j}} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} log \prod_{i=1}^N \prod_{j=1}^C p_{i, j}^{c_{i, j}} \\

\end{align}\begin{align}

= \underset{W, b}{\operatorname{argmax}} \sum_{i=1}^N \sum_{j=1}^C c_{i, j} \cdot log \, p_{i, j} \\

\end{align}

\begin{align}

= \underset{W, b}{\operatorname{argmin}} – \sum_{i=1}^N \sum_{j=1}^C c_{i, j} \cdot log \, p_{i, j} \\

\end{align}

\begin{align}

= \underset{W, b}{\operatorname{argmin}} J

\end{align}

We refer to $J = – \sum_{i=1}^N \sum_{j=1}^C c_{i, j} \cdot log \, p_{i, j}$ as the **cross-entropy loss**. We want to minimize $J$.

We can view $J$ as yet another operation in our computational graph that takes the input data $X$, the true classes $c$ and our predicted probabilities $p$ (which are the output of the $\sigma$ operation) as input and computes a real number designating the loss:

We can build up $J$ from various more primitive operations. Using the element-wise matrix multiplication $\odot$, we can rewrite $J$ as follows:

$$

– \sum_{i=1}^N \sum_{j=1}^C (c \odot log \, p)_{i, j}

$$

Going from the inside out, we can see that we need to implement the following operations:

- $log$: The element-wise logarithm of a matrix or vector
- $\odot$: The element-wise product of two matrices
- $\sum_{j=1}^C$: Sum over the columns of a matrix
- $\sum_{i=1}^N$: Sum over the rows of a matrix
- $-$: Taking the negative

Let’s implement these operations.

This computes the element-wise logarithm of a tensor.

In [32]:

```
class log(Operation):
"""Computes the natural logarithm of x element-wise.
"""
def __init__(self, x):
"""Construct log
Args:
x: Input node
"""
super().__init__([x])
def compute(self, x_value):
"""Compute the output of the log operation
Args:
x_value: Input value
"""
return np.log(x_value)
```

This computes the element-wise product of two tensors of the same shape.

In [33]:

```
class multiply(Operation):
"""Returns x * y element-wise.
"""
def __init__(self, x, y):
"""Construct multiply
Args:
x: First multiplicand node
y: Second multiplicand node
"""
super().__init__([x, y])
def compute(self, x_value, y_value):
"""Compute the output of the multiply operation
Args:
x_value: First multiplicand value
y_value: Second multiplicand value
"""
return x_value * y_value
```

We’ll implement the summation over rows, columns, etc. in a single operation where we specify an `axis`

. This way, we can use the same method for all types of summations. For example, `axis = 0`

sums over the rows, `axis = 1`

sums over the columns, etc. This is exactly what `numpy.sum`

does.

In [34]:

```
class reduce_sum(Operation):
"""Computes the sum of elements across dimensions of a tensor.
"""
def __init__(self, A, axis = None):
"""Construct reduce_sum
Args:
A: The tensor to reduce.
axis: The dimensions to reduce. If `None` (the default), reduces all dimensions.
"""
super().__init__([A])
self.axis = axis
def compute(self, A_value):
"""Compute the output of the reduce_sum operation
Args:
A_value: Input tensor value
"""
return np.sum(A_value, self.axis)
```

This computes the element-wise negative of a tensor.

In [35]:

```
class negative(Operation):
"""Computes the negative of x element-wise.
"""
def __init__(self, x):
"""Construct negative
Args:
x: Input node
"""
super().__init__([x])
def compute(self, x_value):
"""Compute the output of the negative operation
Args:
x_value: Input value
"""
return -x_value
```

Using these operations, we can now compute $J = – \sum_{i=1}^N \sum_{j=1}^C (c \odot log \, p)_{i, j}

$ as follows:

`J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))`

Let’s now compute the loss of our red/blue perceptron.

In [36]:

```
# Create a new graph
Graph().as_default()
X = placeholder()
c = placeholder()
W = Variable([
[1, -1],
[1, -1]
])
b = Variable([0, 0])
p = softmax( add(matmul(X, W), b) )
# Cross-entropy loss
J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))
session = Session()
print(session.run(J, {
X: np.concatenate((blue_points, red_points)),
c:
[[1, 0]] * len(blue_points)
+ [[0, 1]] * len(red_points)
}))
```

If you have any questions, feel free to leave a comment. Otherwise, continue with the next part: IV: Gradient Descent and Backpropagation

The post Deep Learning From Scratch III: Training criterion appeared first on deep ideas.

]]>The post Deep Learning From Scratch II: Perceptrons appeared first on deep ideas.

]]>- Part I: Computational Graphs
- Part II: Perceptrons
- Part III: Training criterion
- Part IV: Gradient Descent and Backpropagation
- Part V: Multi-Layer Perceptrons

Perceptrons are a miniature form of neural network and a basic building block of more complex architectures. Before going into the details, let’s motivate them by an example. Assume that we are given a dataset consisting of 100 points in the plane. Half of the points are red and half of the points are blue.

In [22]:

```
import matplotlib.pyplot as plt
# Create red points centered at (-2, -2)
red_points = np.random.randn(50, 2) - 2*np.ones((50, 2))
# Create blue points centered at (2, 2)
blue_points = np.random.randn(50, 2) + 2*np.ones((50, 2))
# Plot them
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
```

As we can see, the red points are centered at $(-2, -2)$ and the blue points are centered at $(2, 2)$. Now, having seen this data, we can ask ourselves whether there is a way to determine if a point should be red or blue. For example, if someone asks us what the color of the point $(3, 2)$ should be, we’d best respond with blue. Even though this point was not part of the data we have seen, we can infer this since it is located in the blue region of the space.

But what is the general rule to determine if a point is more likely to be blue than red? Apparently, we can draw a line $y = -x$ that nicely separates the space into a red region and a blue region:

In [24]:

```
# Plot a line y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -x_axis
plt.plot(x_axis, y_axis)
# Add the red and blue points
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
```

We can implicitly represent this line using a **weight vector** $w$ and a **bias** $b$. The line then corresponds to the set of points $x$ where

$$w^T x + b = 0.$$

In the case above, we have $w = (1, 1)^T$ and $b = 0$. Now, in order to test whether the point is blue or red, we just have to check whether it is above or below the line. This can be achieved by checking the sign of $w^T x + b$. If it is positive, then $x$ is above the line. If it is negative, then $x$ is below the line. Let’s perform this test for our example point $(3, 2)^T$:

$$

\begin{pmatrix}

1 & 1

\end{pmatrix}

\cdot \begin{pmatrix}

3 \\

2

\end{pmatrix} = 5

$$

Since 5 > 0, we know that the point is above the line and, therefore, should be classified as blue.

In general terms, a **classifier** is a function $\hat{c} : \mathbb{R}^d \rightarrow \{1, 2, …, C\}$ that maps a point onto one of $C$ classes. A **binary classifier** is a classifier where $C = 2$, i.e. we have two classes. A **perceptron** with weight $w \in \mathbb{R}^d$ and bias $b \in \mathbb{R}^d$ is a binary classifier where

$$

\hat{c}(x) =

\begin{cases}

1, & \text{if } w^T x + b \geq 0 \\

2, & \text{if } w^T x + b < 0

\end{cases}

$$

$\hat{c}$ partitions $\mathbb{R}^d$ into two half-spaces, each corresponding to one of the two classes. In the 2-dimensional example above, the partitioning is along a line. In general, the partitioning is along a $d-1$ dimensional hyperplane.

Depending on the application, we may be interested not only in determining the most likely class of a point, but also the probability with which it belongs to that class. Note that the higher the value of $w^T x + b$, the higher is its distance to the separating line and, therefore, the higher is our confidence that it belongs to the blue class. But this value can be arbitrarily high. In order to turn this value into a probability, we need to “squash” the values to lie between 0 and 1. One way to do this is by applying the **sigmoid** function $\sigma$:

$$p(\hat{c}(x) = 1 \mid x) = \sigma(w^T x + b)$$

where $$\sigma(a) = \frac{1}{1 + e^{-a}}$$

Let’s take a look at what the sigmoid function looks like:

In [25]:

```
# Create an interval from -5 to 5 in steps of 0.01
a = np.arange(-5, 5, 0.01)
# Compute corresponding sigmoid function values
s = 1 / (1 + np.exp(-a))
# Plot them
plt.plot(a, s)
plt.grid(True)
plt.show()
```

As we can see, the sigmoid function assigns a probability of 0.5 to values where $w^T x + b = 0$ (i.e. points on the line) and asymptotes towards 1 the higher the value of $w^T x + b$ becomes, and towards 0 the lower it becomes, which is exactly what we want.

Let’s now define the sigmoid function as an operation, since we’ll need it later:

In [26]:

```
class sigmoid(Operation):
"""Returns the sigmoid of x element-wise.
"""
def __init__(self, a):
"""Construct sigmoid
Args:
a: Input node
"""
super().__init__([a])
def compute(self, a_value):
"""Compute the output of the sigmoid operation
Args:
a_value: Input value
"""
return 1 / (1 + np.exp(-a_value))
```

The entire computational graph of the perceptron now looks as follows:

Using what we have learned, we can now build a perceptron for the red/blue example in Python.

In [27]:

```
# Create a new graph
Graph().as_default()
x = placeholder()
w = Variable([1, 1])
b = Variable(0)
p = sigmoid( add(matmul(w, x), b) )
```

Let’s use this perceptron to compute the probability that $(3, 2)^T$ is a blue point:

In [28]:

```
session = Session()
print(session.run(p, {
x: [3, 2]
}))
```

So far, we have used the perceptron as a binary classifier, telling us the probability $p$ that a point $x$ belongs to one of two classes. The probability of $x$ belonging to the respective other class is then given by $1-p$. Generally, however, we have more than two classes. For example, when classifying an image, there may be numerous output classes (dog, chair, human, house, …). We can extend the perceptron to compute multiple output probabilities.

Let $C$ denote the number of output classes. Instead of a weight vector $w$, we introduce a weight matrix $W \in \mathbb{R}^{d \times C}$. Each column of the weight matrix contains the weights of a separate linear classifier – one for each class. Instead of the dot product $w^T x$, we compute $x \, W$, which returns a vector in $\mathbb{R}^C$, each of whose entries can be seen as the output of the dot product for a different column of the weight matrix. To this, we add a bias vector $b \in \mathbb{R}^m$, containing a distinct bias for each output class. This then yields a vector in $\mathbb{R}^C$ containing the probabilities for each of the $C$ classes.

While this procedure may seem complicated, the matrix multiplication actually just performs multiple linear classifications in parallel, one for each of the $C$ classes – each one with its own separating line, given by a weight vector (one column of $W$) and a bias (one entry of $b$).

While the original perceptron yielded a single scalar value that we squashed through a sigmoid to obtain a probability between 0 and 1, the multi-class perceptron yields a vector $a \in \mathbb{R}^m$. The higher the i-th entry of $a$, the higher is our confidence that the input point belongs to the i-th class. We would like to turn $a$ into a vector of probabilities, such that the probability for every class lies between 0 and 1 and the probabilities for all classes sum up to 1.

A common way to do this is to use the **softmax function**, which is a generalization of the sigmoid to multiple output classes:

$$

\sigma(a)_i = \frac{e^{a_i}}{\sum_{j = 1}^C e^{a_j}}

$$

In [54]:

```
class softmax(Operation):
"""Returns the softmax of a.
"""
def __init__(self, a):
"""Construct softmax
Args:
a: Input node
"""
super().__init__([a])
def compute(self, a_value):
"""Compute the output of the softmax operation
Args:
a_value: Input value
"""
return np.exp(a_value) / np.sum(np.exp(a_value), axis = 1)[:,None]
```

The matrix form allows us to feed in more than one point at a time. That is, instead of a single point $x$, we could feed in a matrix $X \in \mathbb{R}^{N \times d}$ containing one point per row (i.e. $N$ rows of $d$-dimensional points). We refer to such a matrix as a **batch**. Instead of $xW$, we compute $XW$. This returns an $N \times C$ matrix, each of whose rows contains $xW$ for one point $x$. To each row, we add a bias vector $b$, which is now an $1 \times m$ row vector. The whole procedure thus computes a function $f : \mathbb{R}^{N \times d} \rightarrow \mathbb{R}^{m}$ where $f(X) = \sigma(XW + b)$. The computational graph looks as follows:

Let’s now generalize our red/blue perceptron to allow for batch computation and multiple output classes.

In [30]:

```
# Create a new graph
Graph().as_default()
X = placeholder()
# Create a weight matrix for 2 output classes:
# One with a weight vector (1, 1) for blue and one with a weight vector (-1, -1) for red
W = Variable([
[1, -1],
[1, -1]
])
b = Variable([0, 0])
p = softmax( add(matmul(X, W), b) )
```

In [31]:

```
# Create a session and run the perceptron on our blue/red points
session = Session()
output_probabilities = session.run(p, {
X: np.concatenate((blue_points, red_points))
})
# Print the first 10 lines, corresponding to the probabilities of the first 10 points
print(output_probabilities[:10])
```

If you have any questions, feel free to leave a comment. Otherwise, continue with the next part: III: Training criterion

The post Deep Learning From Scratch II: Perceptrons appeared first on deep ideas.

]]>