The post The relationship between future-related thinking and mood appeared first on deep ideas.
]]>In a study that we currently conduct at Ruhr-University Bochum, we aim to examine the extent to which particular kinds of thoughts about the future stay the same or change on a week-by-week basis over the course of a few weeks, and how this relates to changes in people’s mood over the same time period.
The study involves a first survey, taking about 15 minutes, and then 3 short surveys (about 5 mins each) over the following three weeks.
If you want to help us advance the scientific knowledge in this area (or win one of 8 Amazon vouchers worth 20 Euros), please click on this link: https://bochumpsych.eu.qualtrics.com/jfe/form/SV_204xPSFCmghhJR3
The post The relationship between future-related thinking and mood appeared first on deep ideas.
]]>The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>It is now time to say goodbye to our own toy library and start to get professional by switching to the actual TensorFlow.
As we’ve learned already, TensorFlow conceptually works exactly the same as our implementation. So why not just stick to our own implementation? There are a couple of reasons:
TensorFlow is the product of years of effort in providing efficient implementations for all the algorithms relevant to our purposes. Fortunately, there are experts at Google whose everyday job is to optimize these implementations. We do not need to know all of these details. We only have to know what the algorithms do conceptually (which we do now) and how to call them.
TensorFlow allows us to train our neural networks on the GPU (graphical processing unit), resulting in an enormous speedup through massive parallelization.
Google is now building Tensor processing units, which are integrated circuits specifically built to run and train TensorFlow graphs, resulting in yet more enormous speedup.
TensorFlow comes pre-equipped with a lot of neural network architectures that would be cumbersome to build on our own.
TensorFlow comes with a high-level API called Keras that allows us to build neural network architectures way easier than by defining the computational graph by hand, as we did up until now. We will learn more about Keras in a later lesson.
So let’s get started. Installing TensorFlow is very easy.
pip install tensorflow
If we want GPU acceleration, we have to install the package tensorflow-gpu
:
pip install tensorflow-gpu
In our code, we import it as follows:
import tensorflow as tf
Since the syntax we are used to from the previous sections mimics the TensorFlow syntax, we already know how to use TensorFlow. We only have to make the following changes:
tf.
to the front of all our function calls and classessession.run(tf.global_variables_initializer())
after building the graphThe rest is exactly the same. Let’s recreate the multi-layer perceptron from the previous section using TensorFlow:
In the next lesson, we will learn about Keras, which is a high-level API on top of TensorFlow that allows us to define and train neural networks more abstractly – without having to specify the internal composition of all the operations everytime. You can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated.
The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>The post Connectionist Models of Cognition appeared first on deep ideas.
]]>In this video, I give an introduction to the field of computational cognitive modeling (i.e. modeling minds through algorithms) in general, and connectionist modeling (i.e. using artificial neural networks for the modeling) in particular. We deal with the following topics:
The post Connectionist Models of Cognition appeared first on deep ideas.
]]>The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>The last filtering algorithm we are going to discuss is the Particle Filter. It is also an instance of the Bayes Filter and in some ways superior to both the Histogram filter and the Kalman Filter. For instance, it is capable of handling continuous state spaces like the Kalman Filter. Unlike the Kalman Filter, however, it is capable of approximately representing deliberate belief distributions, not only normal distributions. It is therefore suitable for non-linear dynamic systems as well.
The idea of the Particle Filter is to approximate the belief $bel(x_t)$ as a set of $n$ so-called particles $p_t^{[i]} \in dom(x_t)$: $\chi_t := \{ p_t^{[1]}, p_t^{[2]}, …, p_t^{[n]} \}$. Each of these particles is a concrete guess of the actual state vector. At each time step the particles are randomly sampled from the state space in such a way that $P(p_t^{[i]} \in \chi_t)$ is proportional to $P(x_t = p_t^{[i]} \, \vert \, e_{1:t})$.
This means that the probability of a particle being included in $\chi_t$ is proportional to the probability of it being the correct representation of the state, given the sensor measurements so far. This way, the update step can be thought of as a process similar to the evolutionary mechanism of natural selection: Strong theories, that are compatible with the new measurement, are likely to live on and reproduce, whereas poor theories are likely to die out. This results in the fact that the particles are likely to be centered around strong theories. We will see a visual example for this later.
We take the same approach as we did with all the previous Bayes Filters. First, we calculate a particle representation of $\overline{bel}(x_{t+1})$ from $\chi_t$, which we denote $\overline{\chi}_{t+1}$: For each particle $p_t^{[i]} \in \chi_t$, we sample a new particle $\overline{p}_{t+1}^{[i]}$ from the distribution $P(x_{t+1} \, \vert \, x_t = p_t^{[i]})$, which can be obtained from the transition model. We put all these new particle into the set $\overline{\chi}_{t+1}$.
As an example, let’s consider a moving robot in one dimension. The state contains only one variable, the location. From time $t$ to $t + 1$ the robot has moved an expected distance of 1 meter to the right with Gaussian movement noise. In this case we would just add 1 to the locations of all the particles plus a random number that is sampled from the transition model.
Now we calculate the particle representation of $bel(x_{t+1})$, namely $\chi_{t+1}$, from $\overline{\chi}_{t+1}$. The key idea here is to assign a so-called importance weight, denoted $\omega[i]$, to each of the particles in $\overline{\chi}_{t+1}$. This importance weight is a measure of how compatible the particle $\overline{p}_{t+1}^{[i]}$ is with the new measurement $e_{t+1}$. This probability can be obtained from the sensor model. $\chi_{t+1}$ is then constructed by randomly picking $n$ particles from $\overline{p}_{t+1}^{[i]}$ with a probability proportional to their weight. The same particle may be picked multiple times. This procedure is called resampling.
We elucidate the Particle Filter with a localization example that’s similar to the Kalman Filter example, i.e. we use the same transition and sensor models as well as the same position and measurement chains. Since the particles are drawn from the state space, they are simply real numbers. This time, we start with a uniform distribution over the interval $[0, 5]$. In this instance, we use 30 particles. For obvious reasons, a numerical representation of the particle sets at each time step will not be given, but a graphical representation can be seen in the following figure. Each of the black/gray lines represents one or more particles. Since multiple particles can fall on the same pixel, the opacities of the lines are proportional to the number of particles on that pixel. Again, the blue line represents the actual position and the red graph represents $P(x_t \, \vert \, e_t)$.
In this series of articles, we have introduced the Bayes Filter as a means to maintain a belief about the state of a system over time and periodically update it according to how the state evolves and which observations are made. We came across the problem that, for a continuous state space, the belief could generally not be represented in a computationally tractable way. We saw three solutions to this problem, all of which have their advantages and disadvantages.
The first solution, the Histogram Filter, solves the problem by slicing the state space into a finite amount of bins and representing the belief as a discrete probability distribution over these bins. This allows us to approximately represent arbitrary probability distributions.
The second solution, the Kalman Filter, assumes the transition and sensor mod- els to be linear Gaussians and the initial belief to be Gaussian, which makes it inapplicable for non-linear dynamic systems – at least in its original form. As we showed, this assumption results in the fact that the belief distribution is always a Gaussian and can thus be represented by a mean and a variance only, which is very memory efficient.
The last solution, the Particle Filter, solves the problem by representing the belief as a finite set of guesses at the state, which are approximately distributed according to the actual belief distribution and are therefore a good representation for it. Like the Histogram Filter, it is able to represent arbitrary belief distributions, with the difference that the state space is not binned and therefore the approximation is more accurate.
[NORVIG] Peter Norvig, Stuart Russel (2010) Artificial Intelligence – A Modern Approach. 3rd edition, Prentice Hall International
[THRUN] Sebastian Thrun, Wolfram Burgard, Dieter Fox (2005) Probabilistic Robotics
[NEGENBORN] Rudy Negenborn (2003) Robot Localization and Kalman Filters
[DEGROOT] Morris DeGroot, Mark Schervish (2012) Probability and Statistics. 4th edition, Addison-Wesley
[BESSIERE] Pierre Bessire, Christian Laugier, Roland Siegwart (2008) Probabilistic Reasoning and Decision Making in Sensory-Motor Systems
The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>The post Robot Localization III: The Kalman Filter appeared first on deep ideas.
]]>This post deals with another solution to the continuous state space problem, the Kalman Filter, invented by Thiele, Swerling and Kalman. It has successfully been used in many applications, like the mission to Mars or automatic missile guidance systems (cf. [NEGENBORN, abstract]). The classical application is radar tracking, but there is a vast amount of other applications (cf. [NORVIG, pp. 588, 589]).
In its essence, it is an implementation of the Bayes Filter in which the belief is a normal (Gaussian) distribution and can therefore be represented by its parameters: a mean vector and a covariance matrix. In this representation, the mean vector is the expected state and the covariance matrix is a measure of uncertainty.
In order for the Kalman filter to work, we need to make a few assumptions about the system we wish to describe (in addition to the Markov assumptions of the Bayes filter). If these assumptions hold, the belief $bel(x_t)$ will be normally distributed at each time step $t$ and can thus be represented by a mean vector $\mu_t$ and a covariance matrix $\Sigma_t$. It is also true that if either of the three assumptions is violated, then the belief will always be non-Gaussian for $t \geq 1$ (cf. [RISTIC, p. 4]). Thus, these assumptions are necessary and sufficient conditions for the Kalman Filter. In the next section, we will see how the Kalman Filter algorithm follows from these assumptions for one-dimensional state spaces. After that, we will take a look at the multi-dimensional algorithm. The assumptions are as follows:
The assertion that the belief is always normally distributed is very important, since it ensures the computational tractability of the belief update for deliberate time steps, because in the general case, i.e. for deliberate sensor and measurement distributions, a representation of the belief could, as we argued in chapter 2, grow unboundedly over time.
For simplicity, we’ll first assume that we are dealing with a one-dimensional state space (i.e. $x_t$ is just a real number, e.g. a position along a line). We will take a look at the multidimensional case later. The transition phase from time $t$ to $t + 1$ just adds some number $\delta_{t+1}$ to the state, plus some unpredictable Gaussian noise $\epsilon_{t+1}$ (as before, imagine a robot moving at a desired speed of $\delta$ per time step with some unpredictable random error):
$$x_{t+1} = x_t + \delta_{t+1} + \epsilon_{t+1}$$
Then our transition model is given by
$$P(x_{t+1} \, \vert \, x_t) = \mathcal{N}(x_t + \delta_{t+1}, \phi^2)$$
The variance $\phi^2$ acts as a measure of uncertainty, reflecting the transition noise $\epsilon$. In the robot example, assuming we are at position $x_t$ at time step $t$, the position at time step $t+1$ is a Gaussian cloud around an expected position of $x_t + \delta_{t+1}$ with a variance (uncertainty) of $\phi^2$.
Our sensor model is given by
$$P(e_{t+1} \, \vert \, x_{t+1}) = \mathcal{N}(x_{t+1}, \psi^2)$$
Again, the variance $\psi^2$ acts as a measure of uncertainty, this time for the measurement noise $\zeta$. In the robot example, assuming that we are at position $x_{t+1}$, the measurement that we get can be expected to be sampled from a Gaussian cloud around $x_{t+1}$ with a variance of $\psi^2$
Assuming that the belief at some time step $t$ is a normal distribution, i.e. $bel(x_t) = \mathcal{N}(\mu_t, \sigma_t^2)$, it can be shown that the projected belief $\overline{bel}(x_{t+1})$ is also a normal distribution with mean $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$ and variance $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$.
Considering the robot example, it should not surprise us that the expected position at time step $t+1$ is just the expected position at time step $t$ plus the expected distance $\delta_{t+1}$ that we wanted to move. Moreover, it seems reasonable that our new uncertainty in the belief, $\overline{\sigma}_{t+1}^2$, is given by the old uncertainty $\sigma_{t}^2$ plus the uncertainty that we get due to the transition $\phi^2$.
Now, assuming that $\overline{bel}(x_{t+1})$ is normally distributed, it can be shown that the updated belief $bel(x_{t+1})$ after receiving a measurement $e_{t+1}$ is a normal distribution as well, this time with mean $\overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$ and variance $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$ where $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$.
We can see that the new mean is a weighted average of the new measurement and the old mean, where the weights are the transition noise and the sensor noise, respectively. This makes intuitive sense: The importance of the new measurement increases with the uncertainty of the current belief, whilst the importance of the current belief increases with the uncertainty of the measurement.
A proof of these statements can be found in [NEGENBORN, pp. 34 – 37].
Now that all the preparatory work is done, we can formulate the actual Kalman Filter algorithm. It is basically a variant of the Bayes Filter with the property that the beliefs $bel(x_t)$ and $\overline{bel}(x_t)$ are now represented by their parameterizations $(\mu_t, \sigma_t^2)$ and $(\overline{\mu}_t, \overline{\sigma}_t^2)$, respectively. As with the Bayes Filter, the correctness follows by induction.
One-Dimensional Kalman Filter
- $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$
- $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$
- $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$
- $\mu_{t+1} = \overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$
- $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$
- return $\mu_{t+1}, \sigma_{t+1}^2$
The variable $k$ is often called the Kalman gain (cf. [NORVIG, p. 588]) and functions as a measure of how important the new measurement is. If the uncertainty of the projected belief is low, then the Kalman gain will be low and thus the new measurement will not have a big impact on the belief. Additionally, if the uncertainty of the measurement is high, the Kalman gain will be low as well and if it is low, the Kalman gain will be high.
The Kalman gain is first incorporated in the expectation update. First, the deviation of the measurement from the expectation, $e_{t+1} – \mu_{t+1}$, is calculated, then it is weighted with the Kalman gain and finally it is added to the expectation. This has exactly the desired effect that the new measurement has an impact on the belief that is proportional to its importance. Dependent on how much new information has been incorporated, the uncertainty decreases, which is implemented in the variance update.
We will now shed some light on this algorithm by applying it to a one-dimensional robot localization problem up to time step 4. The state, i.e. the robot’s location, is simply a real number. The robot believes that it starts out at $x_0 = 0$ with some uncertainty, which is reflected by a prior belief of $\mathcal{N}(\mu_0 = 0, \sigma_0^2 = 1.0)$.
We assume that the robot moves at constant average speed $d_t = 1$ with a noise of $\phi^2 = 0.1$. The positions of the robot shall be $x_0 = 0, x_1 = 0.4543, x_2 = 1.3752, x_3 = 2.2080, x_4 = 3.4944$. I sampled these positions randomly using the specified transition model. Of course, they are not known to the algorithm and they shall only be used for a later comparison with the resulting beliefs (and to create the measurements). We can see that the transition noise really had an impact here. For example, from time step 0 to 1, the robot only moved 0.4543 units when the expected distance was 1 unit.
In our example, the robot is able to sense its position with a measurement noise of $\psi^2 = 1.0$. This is very big noise if we consider that it means that, in expectation, about 68.2% of the measurements are within a distance of 1 unit to the actual position (which is already a big interval) and 31.8% of the measurements might even be outside this interval. Let’s assume that we make the following measurements (which have been sampled from the sensor model using the actual positions specified above): $e_1 = 3.3558, e_2 = −0.0570, e_3 = 1.8155, e_4 = 3.7446$. We can see the obvious impact of the measurement noise: Although we were at position 0.4543 at time step 1, we measured the position 3.3558.
The following figure shows the development of the belief for the first four time steps, both numerically and graphically. At each time step, the black graphs show the belief specified in the upper right-hand corner, whereas the red graphs show the measurement probabilities $P(e_t \, \vert \, x_t)$. The blue line shows the position $x_t$ and the green line the expected position, i.e. the mean of the belief distribution. Take some time to go over the graphs and do not let the mass of information confuse you. After having understood this example, you are able to visualize the Kalman Filter, which helps a lot when using it.
We can see that even though the measurements have been very bad, we still arrive at a belief that is quite reasonable, with an error of only 0.144.
Let’s now take a glimpse at the multi-dimensional situation, which looks a little scary but really is completely analogous to the one-dimensional case.
As before, the transition model has to be a linear Gaussian ($x_{t+1} = A_{t+1} x_t + \Delta_{t+1} + \epsilon_{t+1}$), so our transition probability are given by $P(x_{t+1} \, \vert \, x_t) = \mathcal{N}(A_{t+1} x_t + \Delta_{t+1}, \Phi_{t+1})$. Our uncertainty is now reflected by a covariance matrix $\Phi_{t+1}$ instead of a variance along a single dimension.
The sensor model has to be a linear Gaussian as well ($e_{t+1} = B_{t+1} \cdot x_{t+1} + \zeta_{t+1}$), so our measurement probabilities are given analogously by $P(e_{t+1} \, \vert x_{t+1}) = \mathcal{N}(C_{t+1} x_{t+1}, \Psi_{t+1})$.
With these models defined, we can now state the multi-dimensional Kalman Filter algorithm (cf. [THRUN, p. 42]).
Multi-Dimensional Kalman Filter
It is worth noting that $\overline{\Sigma}_t$, $\Sigma_t$ and $K_t$ are independent of the measurements and can therefore be calculated in advance, which reduces the amount of computation that has to be done “live”.
When we are dealing with linear Gaussian systems, the Kalman Filter is the way to go, since it is very efficient, easy to implement and completely exact if the three assumptions really hold. There are, however, only very few systems that really behave like this. Normally, the state transition process is nonlinear, which means it can not be described with a simple matrix multiplication.
The so-called extended Kalman Filter attempts to overcome this issue. The idea here is that if the state transition process is approximately linear in regions that are close to $\mu_t$, then a Gaussian belief is a reasonable approximation. If the system behaves nonlinear in regions close to the mean, the extended Kalman Filter yields bad results.
A different solution is the so-called switching Kalman Filter, which works by running multiple instances of the Kalman Filter in parallel, where each of them uses a different transition model. The overall belief is then calculated as a weighted sum of the belief distributions of the different instances, where the weight is a measure of how compatible this particular instance is with the measurements.
Continue with IV: The Particle Filter.
The post Robot Localization III: The Kalman Filter appeared first on deep ideas.
]]>The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.
]]>Unbalanced classes create two problems:
Fortunately, these problems are not so difficult to solve. Here are a few ways to tackle them.
If possible, you could collect more data for the underrepresented classes to match the number of samples in the overrepresented classes. This is probably the most rewarding approach, but it is also the hardest and most time-consuming, if not downright impossible. In the cancer example, there is a good reason that we have way more non-cancer samples than cancer samples: These are easier to obtain, since there are more people in the world who haven’t developed cancer.
Artificially increase the number of training samples for the underrepresented classes by creating copies. While this is the easiest solution, it wastes time and computing resources. In the cancer example, we would almost have to double the size of the dataset in order to achieve a 50:50 share between the classes, which also doubles training time without adding any new information.
Similar to 2, but create augmented copies of the underrepresented classes. For example, in the case of images, create slightly rotated, shifted or flipped versions of the original images. This has the positive side-effect of making the model more robust to unseen examples. However, it only does so for the underrepresented classes. Ideally, you would want to do this for all classes, but then the classes are unbalanced again and we’re back where we started.
Remove training samples from the overrepresented classes so that the number of training samples for all classes is the same. This solves our problem and reduces training time, but it makes our model worse. After all, we want to use as much labelled data as we possibly can, even if this causes unbalanced classes. I don’t recommend this solution.
The sensitivity tells us the probability that we detect cancer, given that the patient really has cancer. It is thus a measure of how good we are at correctly diagnosing people who have cancer.
$$sensitivity = Pr(detect\, cancer \; \vert \; cancer) = \frac{\text{true positives}}{\text{positives}}$$
The specificity tells us the probability that we do not detect cancer, given that the patient doesn’t have cancer. It measures how good we are at not causing people to believe that they have cancer if in fact they do not.
$$specificity = Pr(\lnot \, detect\, cancer \; \vert \; \lnot \, cancer) = \frac{\text{true negatives}}{\text{negatives}}$$
A model that always predicts cancer will have a sensitivity of 1 and a specificity of 0. A model that never predicts cancer will have a sensitivity of 0 and a specificity of 1. An ideal model should have both a sensitivity of 1 and a specificity of 1. In reality, however, this is unlikely to be achievable. Therefore, we should look for a model that achieves a good tradeoff between specificity and sensitivity. So which one of the two is more important? This can’t be said in general. It highly depends on the application.
If you build a photo-based skin cancer detection app, then a high sensitivity is probably more important than a high specificity, since you want to cause people who might have cancer to get themselves checked by a doctor. Specificity is a little less important here, but still, if you detect cancer too often, people might stop using your app since they unnecessarily get annoyed and scared.
Now suppose that our desired tradeoff between sensitivity and specificity is given by a number $t \in [0, 1]$ where $t = 1$ means that we only pay attention to sensitivity, $t = 0$ means we only pay attention to specificity and $t = 0.5$ means that we regard both to be equally important. In order to incorporate the desired tradeoff into the training process, we need the samples of the different classes to have a different contribution to the loss. To achieve this, we can simply multiply the contribution of the cancer samples to the loss by
$$\frac{\text{number of non-cancer samples}}{\text{number of cancer samples}} \cdot t$$
In Keras, the class weights can easily be incorporated into the loss by adding the following parameter to the fit function (assuming that 1 is the cancer class):
class_weight={ 1: n_non_cancer_samples / n_cancer_samples * t }
Now, while we train, we want to monitor the sensitivity and specificity. Here is how to do this in Keras. In other frameworks, the implementation should be similar (for instance, you could replace all the K calls by numpy calls).
from keras import backend as K def sensitivity(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) return true_positives / (possible_positives + K.epsilon()) def specificity(y_true, y_pred): true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1))) possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1))) return true_negatives / (possible_negatives + K.epsilon())
model.compile( loss='binary_crossentropy', optimizer=RMSprop(0.001), metrics=[sensitivity, specificity] )
If we have more than two classes, we can generalize sensitivity and specificity to a “per-class accuracy”:
$$perClassAccuracy(C) = Pr(detect\, C \; \vert \; C)$$
In order to train for maximum per-class accuracy, we have to specify class weights that are inversely proportional to the size of the class:
class_weight={ 0: 1.0/n_samples_0, 1: 1.0/n_samples_1, 2: 1.0/n_samples_2, ... }
Here is a Keras implementation of the per-class accuracy, which I adopted from jdehesa at Stackoverflow.
INTERESTING_CLASS_ID = 0 # Choose the class of interest def single_class_accuracy(y_true, y_pred): class_id_true = K.argmax(y_true, axis=-1) class_id_preds = K.argmax(y_pred, axis=-1) accuracy_mask = K.cast(K.equal(class_id_preds, INTERESTING_CLASS_ID), 'int32') class_acc_tensor = K.cast(K.equal(class_id_true, class_id_preds), 'int32') * accuracy_mask class_acc = K.sum(class_acc_tensor) / K.maximum(K.sum(accuracy_mask), 1) return class_acc
If you have any questions, feel free to leave a comment. If you want to stay updated about new machine learning articles, you can either subscribe to deep ideas by Email, subscribe to my Facebook page or follow me on Twitter.
The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.
]]>The post Robot Localization II: The Histogram Filter appeared first on deep ideas.
]]>The Histogram Filter is the most straightforward solution to represent continuous beliefs. We simply divide $dom(x_t)$ into $n$ disjoint bins $b_0, …, b_{n−1}$ such that $\cup_i b_{i} = dom(x_t)$. Then we define a new state $x_t^\prime \in \{0, …, n − 1\}$ where $x_t^\prime = i$ if and only if $x_t \in b_i$. Since $x_t^\prime$ has a discrete, finite state space, we can use the discrete Bayes Filter to calculate $bel(x_t^\prime)$.
$bel(x_t^\prime)$ is an approximation for $bel(x_t)$ then: For each bin $b_i$, it gives us the probability that $x_t$ is in that bin. The more bins we use, the more accurate the approximation becomes, with the downside of increasing computational complexity.
To make this more clear, we shall apply the Histogram Filter to a global localization example as displayed in the following image:
A self-driving car lives in a one-dimensional, cyclic world that is 5 meters wide. By cyclic, we mean that if it is in the rightmost cell and moves one step to the right, it’s back in the leftmost cell. The robot’s position at each time step is given as $pos_t \in [0, 5)$, which is the only state variable. It has a sensor that is, under uncertainty, able to tell the color of the wall next to it. We assume that the car is constantly moving right under noise, at an expected speed of one meter per time step.
In order to apply the Histogram Filter, we choose the following decomposition of the state space: $b_0 = [0, 1)$, $b_1 = [1, 2)$, $b_2 = [2, 3)$, $b_3 = [3, 4)$, $b_4 = [4, 5)$. This way, the position can be measured as a discrete variable $pos_t^\prime \in \{0, …, 4\}$, which is an estimate of the true, continuous position. Each discrete position corresponds to exactly one of the distinguished grid cells in the above image.
We can now specify the transition and sensor models. We assume that the car intends to move exactly one grid cell to the right at each time step, but that the inaccuracy of the motor causes it to move 2 grid cells in 5% of the cases, not move at all in 5% of the cases and move exactly 1 grid cell in 90% of the cases. This results in the following transition model:
$$
P(pos_t^\prime = x + 2 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.05\\
P(pos_t^\prime = x + 1 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.9\\
P(pos_t^\prime = x \; \vert \; pos_{t−1}^\prime = x) = 0.05
$$
As for the sensors, we assume that in 90% of the cases the measured color is correct and in 10% of the cases it is incorrect, yielding the following sensor model:
$$
P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 0, 2, 3) = 0.9\\
P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 0, 2, 3) = 0.1\\
P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 1, 4) = 0.1\\
P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 1, 4) = 0.9
$$
Let’s now use the discrete Bayes filter to calculate the car’s belief for three time steps where the sensor measurements are Orange, Blue and Orange in that order. We assume that the car starts at the very left (but it does not know that it does) and travels exactly one grid cell to the right per time step (which it does not know either). We can represent the belief as a 5-dimensional row vector $bel(pos_t^\prime) = (bel_{t,1}, bel_{t,2} bel_{t,3}, bel_{t,4}, bel_{t,5})$ where $bel_{t,i}$ represents the probability that the robot is in cell $i$ at time-step $t$.
The car has no prior knowledge about its position. Thus, it starts out with the following belief:
$bel(pos_0^\prime) = (0.2, 0.2, 0.2, 0.2, 0.2)$
First, it projects the previous belief to the current time step:
$\overline{bel}(pos_1^\prime) = \sum_{pos_0^\prime} P(pos_1^\prime \; \vert \; pos_0^\prime) \cdot bel(pos_0^\prime)$
$= (0.05, 0.9, 0.05, 0.0, 0.0) \cdot 0.2 + (0.0, 0.05, 0.9, 0.05, 0.0) \cdot 0.2$
$+ (0.0, 0.0, 0.05, 0.9, 0.05) \cdot 0.2 + (0.05, 0.0, 0.0, 0.05, 0.9) \cdot 0.2$
$+ (0.9, 0.05, 0.0, 0.0, 0.05) \cdot 0.2 = (0.2, 0.2, 0.2, 0.2, 0.2)$
This results in the same belief as before, which shouldn’t surprise us, since each cell was equally likely to be the car’s position at time $t = 0$ and therefore, since the robot just moved blindly, each cell is still equally likely to be its position at time $t = 1$.
Now the robot updates the projected belief with the sensor input:
$bel(pos_1^\prime) = \eta \cdot P(MeasuredColor_1 = Orange \; \vert \; pos_1^\prime) \cdot \overline{bel}(pos_1^\prime)$
$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.2, 0.2, 0.2, 0.2, 0.2)$
$= \eta \cdot (0.02, 0.18, 0.02, 0.02, 0.18)$
$= (0.04762, 0.42857, 0.04762, 0.04761, 0.42857)$
where the last step follows by dividing the vector by the sum over all vector values so that the probabilities sum up to 1. We can see that each of the two orange cells are equally likely to have caused the sensor measurement. Thus, the robot currently has two salient theories on where it might be.
$\overline{bel}(pos_2^\prime) = \sum_{pos_1^\prime} P(pos_2^\prime \; \vert \; pos_1^\prime) \cdot bel(pos_1^\prime)$
$= (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$
$bel(pos_2^\prime) = \eta \cdot P(MeasuredColor_2 = Orange \; \vert \; pos_2^\prime) \cdot \overline{bel}(pos_2^\prime)$
$= \eta \cdot (0.9, 0.1, 0.9, 0.9, 0.1) \cdot (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$
$= (0.45165, 0.01102, 0.45165, 0.07711, 0.00857)$
$\overline{bel}(pos_3^\prime) = \sum_{pos_2^\prime} P(pos_3^\prime \; \vert \; pos_2^\prime) \cdot bel(pos_2^\prime)$
$= (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$$bel(pos_3^\prime) = \eta \cdot P(MeasuredColor_3 = Orange \; \vert \; pos_3^\prime) \cdot \overline{bel}(pos_3^\prime)$
$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$
$= (0.00683, 0.73358, 0.01102, 0.08219, 0.16637)$
We can see that after 3 time steps the robot is already about 73% certain that it is in the second grid cell. After another 3 time steps of travelling right and sensing the correct colors, it is 94% certain about its position.
The disadvantage of the Histogram Filter is obvious: We are not able to tell the probability of each possible state. We are only able to tell the probability that the state is in a certain region of the state space. This disadvantage might be circumvented by using a very fine-grained decomposition of the state space, but this drastically increases the computational complexity.
Continue with the next part: Robot Localization III: The Kalman Filter
The post Robot Localization II: The Histogram Filter appeared first on deep ideas.
]]>The post Robot Localization I: Recursive Bayesian Estimation appeared first on deep ideas.
]]>The methods that we will learn are generic in nature, in that they can be used for various other tasks that involve rational decision making in the face of uncertainty. We will, for the main part, deal with filtering, which is a general method for estimating variables from noisy observations over time. In particular, we will explain the Bayes Filter and some of its variants – the Histogram Filter, the Kalman Filter and the Particle Filter. We will show the benefits and shortcomings of each of these algorithms and see how they can be applied to the robot localization problem.
The traditional approach in reasoning over time involves strict logical inference. In order for this to work, a few assumptions have to be made about the environment we wish to make decisions in. For instance, the environment has to be fully observable, which means that at any point in time we can exactly measure each aspect of the environment that is relevant to our decision making. Additionally, the environment needs to be deterministic, which means that, given the state of the environment at a certain point in time and a decision we choose, the resulting state of the environment is already determined – there is no randomness whatsoever. Last but not least, the environment has to be static, which basically means that it waits for us to make our decision before it changes.
None of these assumptions hold in realistic environments. We can never measure every aspect of an environment that might have an influence on the decision making. We can, however, use sensors to measure a small portion of the environment, but even this small portion we can not measure with complete certainty. We call such environments partially observable.
Whether realistic environments are deterministic or not is actually an unanswered philosophical question. At least for humans and agents, it appears to be non-deterministic, because even though we know physical laws that allow us to describe most natural processes, there are just too many influential factors that we are unable to model precisely (e.g. wind turbulence causing a seemingly random change in the trajectory of a flying ball). Regardless of the nondeterminism, we can usually tell what is likely to happen and what is unlikely to happen. Thus, we call realistic environments stochastic. Moreover, realistic environments are dynamic as opposed to static – they are always changing. For a more thorough treatise of the nature of environments cf. [NORVIG, pp. 40 – 46].
All of these properties of realistic environments result in uncertainty about the state of the world. It is a big challenge to make rational decisions in the face of uncertainty. Humans do a great job at this every day. Even though we can never know the true state of the world and predict what is going to happen next and how we should act to achieve a desired outcome, we still manage to achieve many of our goals remarkably well. We do this by maintaining a belief about the state of the world at a certain point in time, which we arrive at by both prediction and observation. This belief can be thought of as a probability distribution over all the possible states of the world, conditioned by our observations. Given a belief, we can, for each possible decision, determine the probabilities of each possible outcome. After that, we choose the decisions that are most probable to achieve a desired goal state, maximize a performance measure, or the like. This behavior can reasonably be called rational. Of course, we do not actually maintain precise probability distributions in our brains and carry out calculations, but this is a way of imagining how this cognitive ability of ours roughly works and it gives us a first idea of how it can be implemented algorithmically.
It is a difficult but interesting task to implement such a behavior for autonomous agents. The purpose of this text is to give an insight into how the first half can be done – the task of maintaining a belief about the state of an environment that is updated over time through making predictions according to a model of how the system develops, interpreting periodically arriving, noisy observations (more specifically, sensor measurements) and incorporating them into the belief.
Robot localization is one of the most fundamental problems in mobile robotics. There are multiple instances of the localization problem with different difficulties (cf. [NEGENBORN, pp. 9 – 11]). In this article, we shall deal with the problem that the robot is given a map of the environment and then either needs to keep track of its position when the initial position is known, or localize itself from scratch when it could theoretically be anywhere.
One might use methods like GPS for positioning, but in many scenarios it is not accurate enough. Self-driving cars, for example, need a few centimeters accuracy to be considerable for road traffic. As everyone with a car navigator knows, the accuracy for GPS can be grim. Therefore, it is not always an option. Since there is no reliable sensor to measure a position directly, we need to rely on other observations and infer the actual position from it. A possible way to do so would be to install cameras, use pattern recognition to spot landmarks whose positions on the map are known, determine the distances of the landmarks and then use trilateration to determine the robot’s position.
It is reasonable to assume that the distance sensors are noisy. It becomes even more difficult when we assume that the robot is moving through the world, because movement is usually noisy as well: Even though the robot can control its average speed, motors are subjected to an unmodeled inaccuracy, resulting in unpredictable speed variations. As we can see, this is a situation as described in the previous section: The robot cannot infer its exact position from sensor data and, even if it does know its exact position at a certain point in time, it does not know it for certain anymore a moment later. This is due to the fact that the model it uses to describe the environment cannot describe the marginal factors that cause the motor to be inaccurate. As such, this problem is a good example for filtering and will therefore be used to elucidate the algorithms presented in this article.
Before we can deal with the concrete filter algorithms, we have to lay a theoretical foundation. In this article, we will model the world in such a way that all the changes in the environment take place at discrete, equidistant time steps $t \in \mathbb{N}_0$, where sensor measurements arrive at every time step $t \geq 1$. To model uncertainty over continuous time is more difficult, since it involves stochastic differential equations. The discrete-time model can be seen as an approximation at the continuous case. [NORVIG, p. 567]
At each point in time $t$, we can characterize a dynamic system by a state vector $x_t$, which we simply call the state. This state vector contains the so-called state variables that are necessary to describe the system. We assume that it contains the same state variables at each time step. We define the so-called state space $dom(x_t)$ as the set of all the possible values that $x_t$ might take. If we consider a moving robot on a plain, the state could be $x_t = (X_t, Y_t, \dot{X}_t, \dot{Y}_t)$ where $X_t$ and $Y_t$ refer to the robot’s current position and $\dot{X}_t$ and $\dot{Y}_t$ to its movement speed in the X and Y direction, respectively. In this case, the state space would be $dom(x_t) = \mathbb{R}^4$.
For each environment, there are virtually infinitely many possible state vectors, where additional state variables generally make the description of the environment more precise, with the downside of increasing the computational complexity of maintaining a belief. For example, if we consider the robot on a plain again, we could include the wind direction and force in the state vector to account for variations in the robot’s movement that are caused by the wind.
A state is called complete if it includes all the information that is necessary to predict the future of the system. In realistic examples, the state is usually incomplete. For example, if we assume that there are human beings interfering with the robot on the plain, then the state would have to include data that makes it possible to predict their decisions, which is practically impossible. Even in situations where we could in principle include all the influencing factors in the state, it is still often preferable not to include them to reduce computational complexity. In practice, the algorithms described in this article have turned out to be robust to incomplete states. A rule of thumb is to include enough state variables to make unmodeled effects approximately random. [THRUN, p. 33]
As alluded to in the introduction, the state $x_t$ is usually unobservable, which means that we cannot measure it directly. Instead, we have sensors that generate a measurement $e_t$ at each time step $t \geq 1$, which is a vector of arbitrary dimension. This measurement vector contains noisy sensor measurements that are caused by the state. In our modeling, $e_t$ always contains the same measurement variables. If we have a GPS sensor, then this measurement vector could consist of the measured X and Y coordinates. It is important to realize that these measured coordinates are generally not the same as the actual coordinates. Instead, they are caused by the actual coordinates but underlie a certain measurement noise due to the inaccuracy of GPS.
As we said, the state $x_t$ is unobservable. All we can do is maintain a belief $bel(x_t)$, given the observations. The process of determining the belief from observations is called filtering or state estimation (cf. [NORVIG, p. 570]). In mathematical terms, the belief is a probability distribution over all possible states, conditioned by the observations so far: $bel(x_t) := P(x_t \mid e_{1:t})$, where we use $e_{1:t}$ as a short-hand notation for $(e_1, e_2, …, e_t)$.
We also define $\overline{bel}(x_t) := P(x_t \mid e_{1:t−1})$, which is the projected or predicted belief, i.e. the probability distribution over all the possible states at time $t$, given only past observations.
As we can see, the number of measurements we have to condition by in order to determine the belief increases unboundedly over time. This means that we would have to store all the measurements, which is impossible with a limited memory. Additionally, the time needed to compute the belief would increase unboundedly, since we have to consider all the measurements so far. If we want to have a computationally tractable method for calculating the belief at deliberate points in time, we have to find a function $f$ such that $bel(x_{t+1}) = f(bel(x_t), e_{t+1})$. This means that in order to calculate the belief at a certain time step, we take the belief of the previous time step, project it to the new time step and then update it in accordance with new evidence. Such a method is called recursive estimation (cf. [NORVIG, p. 571]). The Bayes Filter is an algorithm for doing this. But before we can formulate the algorithm and prove its correctness, we have to specify how the world evolves over time and how we interpret sensor input. Also, as we will see in the next sections, we have to make some assumptions about the system in order to arrive at a recursive formulation.
As stated in the introduction, realistic environments are non-deterministic but stochastic – given a state $x_t$, we can not tell what the state $x_{t+1}$ will be. Regardless of that, we can tell how likely each of the possible states $x_{t+1}$ is, given the state $x_t$. In mathematical terms, we can specify the conditional probability distribution $P(x_{t+1} \mid x_t)$. We call this distribution the transition model, since it is a model of how the environment transitions from one time step to the next.
Analogously, due to the partial observability of the environment (in particular, the inaccuracy of the sensors), we cannot tell which state causes exactly which sensor measurement, since there is always some measurement noise. However, we can tell how likely each possible sensor measurement $e_t$ is, given the state $x_t$. In mathematical terms, we can specify $P(e_t \mid x_t)$, which we call the sensor model. Given a sensor measurement $e_t$, it tells us how likely each state is to cause this measurement.
We will see examples for transition and sensor models in the following sections.
In order to be able to arrive at a recursive formula for maintaining the belief $bel(x_t)$, we have to make so-called Markov assumptions about both the transition model and the sensor model. We will see in the next section that these two assumptions allow us to arrive at a method to calculate the belief recursively.
For the transition model, the Markov assumption states, that, given the state $x_t$, all states $x_{t+j}$ with $j \geq 1$ are conditionally independent of $x_{0:t−1}$ (cf. [DEGROOT, p. 188, 189]). This gives us $P(x_{t+1} \mid x_{0:t}) = P(x_{t+1} \mid x_t)$. Intuitively speaking, this assumption means that if we know the state at a certain point in time, then no previous states give us additional knowledge about the future.
We also make a sensor Markov assumption as follows: $P(e_{t+1} \mid x_{t+1}, e_{1:t}) = P(e_{t+1} \mid x_{t+1})$. This means that if we know the state $x_{t+1}$, then no sensor measurements from the past tell us anything more about the probabilities of each possible sensor measurement $e_{t+1}$.
As we stated in section 3.2, we want a method to calculate $bel(x_{t+1})$ from $bel(x_t)$ and $e_{t+1}$. We can do this in two consecutive steps First, we calculate the projected belief $\overline{bel}(x_{t+1})$ from $bel(x_t)$. This step is usually called projection: We project the belief of the previous time step to the current time step. We can do this in the following way (a proof for this statement can be found in [NORVIG, p. 572]):
$$
\overline{bel}(x_{t+1}) = \int_{x_t} P(x_{t+1} \mid x_t) bel(x_t)
$$
The process of calculating $bel(x_{t+1})$ from $\overline{bel}(x_{t+1})$ is called update: We update the projected belief with the new evidence $e_{t+1}$. This can be done as follows:
$$
bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) \overline{bel}(x_{t+1})
$$
In this formula, $P(e_{t+1} \mid x_{t+1})$ can be obtained from the sensor model. $\eta$ has the function of a normalizing constant. This means that we do not need to calculate it directly from its definition. In the discrete case, it follows from the fact that the probabilities need to sum up to 1. In the continuous case, it follows from the fact that the probability density function needs to integrate to 1 (cf. [DEGROOT, p. 105]).
For the recursive formulation to work, we need a prior belief $bel(x_0)$. Most commonly, we have no knowledge beforehand, in which case we should assign equal probabilities to each possible state. If we know the state at the beginning and need to keep track of it, we should use a point mass distribution. If we only have partial knowledge, we could use some other distribution.
The Bayes filter algorithm for calculating $bel(x_{t+1})$ from $bel(x_t)$ and $e_t$ can now be formulated as follows (cf. [THRUN, p. 27]):
Continuous Bayes Filter
- $\overline{bel}(x_{t+1}) = \int_{x_t} P(x_{t+1} \mid x_t) bel(x_t)$
- $bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) \overline{bel}(x_{t+1})$
Under the assumption that $bel(x_0)$ has been initialized correctly, the correctness of this algorithm follows by induction, since we already showed that $bel(x_{t+1})$ is correctly calculated from $bel(x_t)$.
In principle, we now have a method to calculate the belief at each time step. The question arises, however, how we should represent the belief distribution. For finite state spaces, we can simply replace the integral with a sum over all possible $x_t$ and represent the belief as a finite table. We call this modified version the Discrete Bayes Filter (cf. [THRUN, pp. 86, 87]). We will see a concrete example for the discrete Bayes Filter in the next section.
Discrete Bayes Filter
- $\overline{bel}(x_{t+1}) = \sum{x_t} P(x_{t+1} \mid x_t) bel(x_t)$
- $bel(x_{t+1}) = \eta P(e_{t+1} \mid x_{t+1}) bel(x_{t+1})$
It becomes more difficult if we consider continuous state spaces. In this case, the belief becomes a probability density function (from now on abbreviated p.d.f.) over all possible states. The general way to represent such a function is by a symbolic formula. The problem arises that an exact representation of a formula for the belief function could, in the general case, grow without bounds over time (cf. [NORVIG, p. 585]). Additionally, the integration step becomes more and more complex and some p.d.f.s are not guaranteed to be integrable offhand. We are going to see three different solutions to this problem, all of which introduce a different way of representing the belief distribution: The Histogram Filter, the Kalman Filter and the Particle Filter.
Continue with Part II: The Histogram Filter.
[NORVIG] Peter Norvig, Stuart Russel (2010) Artificial Intelligence – A Modern Approach. 3rd edition, Prentice Hall International
[THRUN] Sebastian Thrun, Wolfram Burgard, Dieter Fox (2005) Probabilistic Robotics
[NEGENBORN] Rudy Negenborn (2003) Robot Localization and Kalman Filters
[DEGROOT] Morris DeGroot, Mark Schervish (2012) Probability and Statistics. 4th edition, Addison-Wesley
[BESSIERE] Pierre Bessire, Christian Laugier, Roland Siegwart (2008) Probabilistic Reasoning and Decision Making in Sensory-Motor Systems
The post Robot Localization I: Recursive Bayesian Estimation appeared first on deep ideas.
]]>The post Why the Chinese Room Argument is Flawed appeared first on deep ideas.
]]>In his essay Can Computers Think? [11], Searle gives his own definition of strong artificial intelligence, which he subsequently tries to refute. His definition is as follows:
One could summarise this view […] by saying that the mind is to the brain, as the program is to the computer hardware.
Searle’s first attempt at refuting the possibility of strong artificial intelligence is based on the insight that mental states have, by definition, a certain semantic content or meaning. Programs, on the other hand, are purely formal and syntactical, i.e. a sequence of symbols that do not have a meaning in themselves. Therefore, a program could not be equivalent to a mind. A formal reconstruction of this argument looks as follows:
Searle emphasizes the fact that his argument is based solely on the property that programs are defined formally, regardless of which physical system is used to run the program. Therefore, it does not state that it is impossible for us today to create a strong artificial intelligence, but that this is generally impossible for any conceivable machine in the future, regardless of how fast it is or which other properties it might have.
In order to make his first premise more plausible (“Syntax is not sufficient for semantics”), Searle describes a thought experiment – the Chinese Room. Assume there were a program that is capable of answering Chinese questions in Chinese. No matter which question you pose in Chinese, it gives you an appropriate answer that a human Chinese speaker might also give. Searle now tries to argue that a computer running this program doesn’t actually understand Chinese in the same sense as a Chinese human being understands Chinese.
To this end, he assumes that the formal instructions of the program are carried out by a person who does not understand Chinese. This person is locked in a room, and the Chinese questions are passed into the room as a sequence of symbols. The room contains baskets with many other Chinese symbols, along with a list of formal instructions, which are purely syntactical rules that tell the person how to produce an answer to the question by assembling the symbols from the baskets. The answer generated by these instructions are then passed out of the room by the person. The person is not aware that the symbols that are passed into the room are questions and the symbols that are passed out of the room are answers to these questions. He just blindly carries out the instructions strictly and correctly. And these instructions generate meaningful Chinese sentences that are answers to the questions which couldn’t be distinguished from the answers a real Chinese speaking person would give.
Now Searle raises attention to the fact that the person in the room doesn’t understand Chinese simply by following formal instructions for generating answers. He continues to argue that a computer running a program that generates Chinese answers to Chinese questions therefore also doesn’t understand Chinese. Since this experiment could be generalized to arbitrary tasks, Searle concludes that computers are inherently incapable of understanding something.
There are numerous objections to the Chinese Room argument by various authors. Many of these arguments are similar in nature. In the following, I will present the most commonly presented ones, including answers to these objections by Searle himself.
One of the most commonly raised objection is that even though the person in the Chinese Room does not understand Chinese, the system as whole does – the room with all its constituents, including the person. This objection is often called the Systems Reply and there are various versions of it.
For example, artificial intelligence researcher, entrepreneur and author Ray Kurzweil says in [5] that the person is only an executive unit and that its properties are not to be confused with the properties of the system. If one looks at the room as an overall system, the fact that the person does not understand Chinese doesn’t entail that this also holds for the room.
Cognitive scientist Margaret Boden argues in [1] that the human brain is not the carrier of intelligence, but rather that it causes intelligence. Analogously, the person in the room causes an understanding of Chinese to arise, even though it does not understand Chinese itself.
Searle responds to the Systems Reply with the semantic argument: Even the system as a whole couldn’t go from syntax to semantics and, hence, couldn’t understand the meaning of the Chinese symbols. In [9], he adds that the person in the room could theoretically memorize all the formal rules and perform all the computations in its head. Then, he argues, the person is the entire system, could answer Chinese questions without help and perhaps even lead Chinese conversations, but still wouldn’t understand Chinese since it only carries out formal rules and can’t associate a meaning with the formal symbols.
Similar to the Systems Reply, the Virtual Mind Reply states that the person does not understand Chinese, but that a running system could create new entities that differ from both the person and the system as a whole. The understanding of Chinese could be a new entity of this sort. This standpoint is argued for by artificial intelligence researcher Marvin Minsky in [15] and philosopher Tim Maudlin in [6]. Maudlin notes that Searle didn’t provide an adequate answer to this reply thus far.
Another reply changes the thought experiment in such a way that the program is put into a robot that can perceive the world through sensors (like cameras or microphones) and interact with the world via effectors (like motors or loudspeakers). This causal interaction with the environment, the argument goes, is a guarantee that the robot understands Chinese, since the formal symbols are endowed with semantics this way – namely objects in the real world. This view presupposes an externalist semantics. This reply is raised, for example, by Margaret Boden in [1].
Searle answers to this argument in [17] with the semantic argument: The robot still only has a computer as its brain and couldn’t go from syntax to semantics. He makes this more plausible by adapting the thought experiment such that the Chinese Room itself is integrated into a robot as its central processing unit. The Chinese symbols would then be generated by sensors and passed into the room. Analogously, the symbols passed out of the room would control the effectors. Even though the robot interacts with the external world this way, the person in the room still doesn’t understand the meaning of the symbols.
Some authors, e.g. philosophers Patricia and Paul Churchland in [2], suggest that one should imagine that instead of manipulating the Chinese symbols, a computer should simulate the neuronal firings in the brain of a Chinese person. Since the computer operates in exactly the same way as a brain, the argument goes, it must understand Chinese.
Searle answers to this argument in [10]. He argues that one could also simulate the neuronal structures by a system of water pipes and valves and put it into the Chinese Room. The person in the room then has instructions on how to guide the water through the pipes in order to simulate the brain of a Chinese person. Still, he says, no understanding of Chinese is generated.
Now I present my own reply, which I have coined the Emergence Reply.
I grant that Searle’s arguments prove that a mind can not be equated with a computer program. This is immediately obvious from the semantic argument: Since a mind has properties that a program does not have (namely semantic content), a program can not be equal to a mind. Hence, it refutes the possibility of strong artificial intelligence by his own definition.
However, one can phrase another definition of strong artificial intelligence which, as I will argue, is not affected by Searle’s arguments:
A system exhibits strong artificial intelligence if it can create a mind as an emergent phenomenon by running a program.
I explicitly include any type of system, regardless of the material from which it is made – be it a computer, a Chinese Room or a gigantic hall of falling dominos or beer cans that simulate a Turing machine.
I will not try to argue for the possibility of strong artificial intelligence according to this definition. It is doubtful whether this is even possible. However, I will argue why this definition is not affected by Searle’s arguments.
In my proposed definition, no analogy between the program and the mind created by the program is demanded. Therefore, the semantic argument becomes obsolete: Even though a program as a syntactical construct doesn’t create semantics (and therefore couldn’t be equal to a mind), it doesn’t follow that a program can’t create semantic contents in the course of its execution.
Moreover, this definition doesn’t state that the computer hardware is the carrier of the mental processes. The hardware is not enabled to think this way. Rather, the computer creates the mental processes as an emergent phenomenon, similarly to how the brain creates mental processes as an emergent phenomenon. So, if one considers the question in the title of Searle’s original essay “Can Computers Think?”, the answer would be “No, but they might create thinking.”
How a mind can be created through the execution of a program, and what sort of ontological existence this mind would have, is a discussion topic of its own. In order to make this more plausible, imagine a program that exactly simulates the trajectories and interactions of elementary particles in a brain of a Chinese speaker. This way, the program does not only create the same outputs for the same inputs as the Chinese’s brain, but proceeds completely analogously. There is no immediate way to exclude the possibility that the simulated brain can’t create a mind in exactly the same way as a real brain can. The only assumption here is that the physical processes in a brain are deterministic. There are some theories claiming that a mind requires non-deterministic quantum phenomena that can’t be simulated algorithmically. One such theory is presented by physicist Sir Roger Penrose in [7], who has founded the Penrose Institute to explore this possibility. If such theories turn out to be true, then this would be a strong argument against the possibility of strong artificial intelligence.
As regards the Chinese Room Argument, it convincingly shows that the fact that a system gives the impression of understanding something doesn’t entail that it really understands it. Not every program that the person in the Chinese Room could execute in order to converse in Chinese does in fact create understanding. This is an important insight that refutes some common misconceptions, like the fact that IBM’s Deep Blue understands chess in the same way as a human does, or that Apple’s Siri understands spoken language. Deep Blue just calculates the payoff of certain moves, and Siri just transcribes one sequence of numbers into another (albeit in a sophisticated way). This definitely doesn’t create understanding or a mind.
Moreover, the Chinese Room Argument shows that the Turing Test is no reliable indicator of strong artificial intelligence. In this test, described by Alan Turing in [12], a human subject should converse with an unknown entity and decide whether it is talking to another human or a computer, solely based on the answers that the entity gives. If the computer repeatedly manages to trick the subject, we call it intelligent. This test only measures how good a computer is at giving the impression of being intelligent without making any restrictions as to how the computer does it internally, which, as we argued already, is an important factor in determining whether a computer really exhibits strong artificial intelligence.
Additionally, Searle’s argument shows that it is not the hardware itself that understands Chinese. Even if a hardware running a program creates a mind that understands Chinese, the person in the Chinese Room is the hardware and doesn’t understand Chinese.
It does not, however, refute the possibility that the hardware can create a mind that understands Chinese by executing the program. Assume there is a program that answers Chinese questions and creates mental processes that exhibit an understanding of the Chinese questions and answers. This assumption can not be refuted by the Chinese Room Argument. If we let the person in the room execute the program via pen and paper, it is correct that the person doesn’t understand Chinese. But the person is only the hardware in this case. Its mind does not equal the mind that is created by the execution of the program.
It might seem intuitively implausible that arithmetical operations carried out with pen and paper could give rise to a mind. But this can be made more plausible by assuming, as before, that the neuronal processes in the brain are simulated in the form of these arithmetical operations. The fact that a mind could not arise in such a way may be a false intuition. There is no immediately obvious logical reason to exclude this possibility. Similar things hold for Searle’s system of water pipes, beer can domino or other unorthodox hardware. If one assumes that a computer hardware can create a mind, one must grant that this is also possible for other, more exotic mechanical systems.
Whether it is indeed possible to create a mind by the execution of a program is still an open question. Maybe Roger Penrose turns out to be right that consciousness is a natural phenomenon that can’t be created by the deterministic interaction of particles. Are organisms really just algorithms? How can the parallel firing of tens of billions of neurons give rise to consciousness and a mind? As of now, neuroscience has not the slightest idea. However, I would say with some certainty that this question cannot be answered by thought experiments alone.
If you liked this article, you may also be interested in my article Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence.
[1] Boden, Margaret A: Escaping from the Chinese Room. University of Sussex, School of Cognitive Sciences, 1987.
[2] Churchland, Paul M und Patricia Smith Churchland: Could a Machine Think? Machine Intelligence: Perspectives on the Computational Model, 1:102, 2012.
[3] Cole, David: The Chinese Room Argument. In: Zalta, Edward N. (Herausgeber): The Stanford Encyclopedia of Philosophy. Summer 2013. http://plato.stanford.edu/archives/ sum2013/entries/chinese-room/.
[4] Dennett, Daniel C: Fast thinking. 1987.
[5] Kurzweil, Ray: Locked in his Chinese Room. Are We Spiritual Machines: Ray Kurzweil vs. the Critics of Strong AI, 2002.
[6] Maudlin, Tim: Computation and consciousness. The journal of Philosophy, pp 407–432, 1989.
[7] Penrose, Roger: The Emperor’s New Mind (1990). Vintage, London.
[8] Russell, Stuart Jonathan et al.: Artificial Intelligence: A Modern Approach. Prentice hall Englewood Cliffs, 1995.
[9] Searle, John: The Chinese Room Argument. Encyclopedia of Cognitive Science, 2001.
[10] Searle, John R: Minds, brains, and programs. Behavioral and brain sciences, 3(03):417–424, 1980.
[11] Searle, John R: Minds, brains, and science. Harvard University Press, 1984.
[12] Turing, Alan M: Computing machinery and intelligence. Mind, pp 433–460, 1950.
The post Why the Chinese Room Argument is Flawed appeared first on deep ideas.
]]>The post Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence appeared first on deep ideas.
]]>This text gives an overview of Gödel’s Incompleteness Theorem and its implications for artificial intelligence. Specifically, we deal with the question whether Gödel’s Incompleteness Theorem shows that human intelligence could not be recreated by a traditional computer.
Sections 2 and 3 feature an introduction to axiomatic systems, including a brief description of their historical development and thus the background of Gödel’s Theorem. These sections provide the basic knowledge required to fully understand Gödel’s Theorem and its significance for the history of mathematics – a necessary condition for understanding the arguments to follow. Section 4 features a thorough description of Gödel’s Theorem and outlines the basic idea of its proof. Sections 5 and 6 deal with arguments advocating the view that intelligence has a non-algorithmic component on the grounds of Gödel’s Theorem. In addition to a detailed account of the arguments, these sections also feature a selection of prominent objections to these arguments raised by other authors. The last section comprises a discussion of the arguments and my own objections.
At the beginning of the 20th century, the mathematical community suffered from a crisis regarding the very foundations of mathematics, triggered by the discovery of various paradoxes that called into question the reliability of mathematical intuition and the notion of proof. At that time, some fields of mathematics were grounded on a rigorous formal basis, called an axiomatic system (or interchangeably formal system), whereas other fields relied on a certain degree of intuitive insight.
In a formal way, an axiomatic system is a set of propositions, expressed in a formal language, called axioms. These axioms represent statements that are assumed to be true without proof. The set of axioms is equipped with a set of inference rules which can be used to derive other propositions, called theorems, by applying them to the axioms. Applying the rules of inference boils down to replacing expressions by certain other expressions according to precise syntactical rules. The axioms and the set of inference rules are ideally chosen in such a way that they are intuitively evident. This way, the truth of a complex, non-obvious statement can be accepted by accepting the truth of the axioms and sequentially applying the inference rules until the complex statement in question is deduced.
An early, prominent example of such an axiomatic system is the Euclidean geometry described by the ancient Greek philosopher Euclid in c. 300 BC (an English translation can be found in [Euc02]). It consists of 5 axioms making trivial statements about points, lines and circles (e.g. that any two points could be connected by a line). From these axioms, Euclid derived 48 non-trivial geometric propositions solely by means of logical inference and without making use of informal geometric intuition or perception.
Up until modern times, geometry was the only branch of mathematics that was predicated on such a sound axiomatic basis, whereas research and applications in other branches were carried out without a rigid formal notion about which types of inference were allowed and which statements were assumed to be intuitively evident. This was due to the fact that, for most practical purposes, mathematicians saw no need for doing so. However, this changed with the discovery of various paradoxes around the turn of the 20th century. In 1901, the British mathematician Bertrand Russell put forward what later came to be known as Russell’s paradox (cf. [Gri04]). This paradox showed an inherent flaw in the informal set theory proposed by German mathematician Georg Cantor, according to which every definable collection of distinct elements is a set. Russell defined the set R of all sets that do not contain themselves, symbolically:
$$\{x \; \mid \; x \not\in x\}$$
According to Cantor, R is a valid set. The paradox arises when one asks the question whether R contains itself. If R contains itself, then by definition it does not contain itself. If, on the other hand, it does not contain itself then it contains itself by definition. Symbolically:
$$R \in R \; \iff R \not\in R$$
Therefore, the question whether R contains itself has no well-defined answer. This example shows that the notion of a set defined by Cantor is flawed, even though it seems to be intuitively reasonable. Examples like this lead many mathematicians to recognize that intuition is not a safe guide and that there was a need to supply all branches of mathematics with an axiomatic system that would be sufficient to formally derive all true propositions, a standpoint later termed formalism (cf. [NN01] p. 3). Over time, more and more branches, both new and old, were equipped with sets of axioms (e.g. the Zermelo-Fraenkel set theory, cf. [Fra25]).
It is worth noting that axiomatic systems and formal proofs do not require an intuitive understanding of the entities described or the nature of the proven statements. Consider the following example:
Axiomatic system 1.
1. Every member of P is contained in exactly two members of L.
2. Every member of L contains exactly two members of P.
3. Every two members of L share exactly one member of P.
This axiomatic system makes statements about some abstract sets L and P , and even though we can understand the axioms per se, we do not associate any meaning with the symbols and we do not have any intuition about the overall structure of L and P. Still, we can deduce theorems from these axioms. For example, it can be shown that every three members of L contain exactly three members of P. Even though the axioms were given informally, they can be translated into second-order logic and the proof for the theorem can be carried out using rules that just replace certain sequences of symbols with other symbols. This way, the proof could be carried out by a computer simply by iteratively applying symbol replacement rules on meaningless sequences of symbols until the theorem is obtained. It is then clear that the theorem follows from the axioms without any intuition as to what the theorem or the axioms actually represent.
A prominent representative of the formalist standpoint was David Hilbert, who initiated what was later termed Hilbert’s Program (cf. [Zac15]). Hilbert advocated the view that all fields of mathematics should be grounded on an axiomatic basis. Furthermore he demanded that every such system should be proven to be consistent, which means that it is impossible to deduce two contradictory theorems from the axioms.
Proving the inconsistency of an axiomatic system can be done by deducing a contradiction. The question that Hilbert wanted to address, however, was how to prove the consistency, i.e. how to prove the impossibility to deduce a contradiction. One way to do so is to find an interpretation of the axioms, such that they form true statements about some part of reality or some abstract concept of our intuition. A possible model for the axiomatic system 1 is given in the following image:
When we interpret the set P as the corners of a triangle and the set L as its edges, then the axioms are invested with meaning and we can verify beyond doubt that all axioms represent true statements about the model by verifying them for each individual element. This can be done easily since there are only finitely many elements. This proves the consistency of the system, because no contradiction can be deduced from true premises.
However, there are axiomatic systems for which the model-based approach to proving their consistency is open to dispute. If, for example, the axioms require the model to contain an infinite number of elements, then it is impossible to verify the truth of the axioms beyond doubt, since the truth can no longer be verified for each individual element. Moreover, the model-based approach actually only reduces the consistency of one system to the consistency of another system. As regards the triangle example, we established the consistency of the axioms by verifying them for the triangle, but in doing so we implicitly assumed the consistency of geometry. Therefore, we have only shown that if geometry is consistent, then our axiomatic system is also consistent; we have given what is called a relative proof of consistency.
Hilbert urged to find absolute proofs of consistency, i.e. proofs that establish the consistency of an axiomatic system without presupposing the consistency of another axiomatic system. Absolute proofs of consistency use structural properties of the axioms and inference rules in order to show that no contradictions can be derived; they are not proofs within the formal axiomatic system itself, but rather proofs about the system. They are, so to speak, proofs in some meta-system. To better understand the concept of a meta-system, consider the statement ”’$p \vee p \rightarrow p$’ is a tautology”. This is not a statement within propositional logic, but a statement in some meta-system about propositional logic, and it can be proved within that meta-system.
Absolute proofs of consistency have successfully been established for some axiomatic systems, e.g. propositional logic (cf. [NN01] p. 45). This lead Hilbert to believe that such a proof could be found for any consistent axiomatic system, which is where Gödel’s Incompleteness Theorem comes into play: amongst other things, it shows that this is impossible for most of the axiomatic systems.
First presented in [Göd31], Gödel’s Incompleteness Theorem is actually comprised of two related but distinct theorems, which roughly state the following (cf. [Raa15]):
1. Any consistent formal [axiomatic] system F within which a certain amount of elementary arithmetic can be carried out is incomplete; i.e. there are statements of the language of F which can neither be proved nor disproved in F.
2. For any consistent system F within which a certain amount of elementary arithmetic can be carried out, the consistency of F cannot be proved in F itself.
The first of these two theorems is often referred to simply as Gödel’s Incompleteness Theorem.
Let us further elaborate these statements. The first theorem basically states that all axiomatic systems that are expressive enough to perform elementary arithmetic contain statements that can neither be proved nor disproved within the system itself, i.e. neither the statements nor their negations can be obtained by iteratively applying the inference rules to the axioms.
To say that a system is capable of performing arithmetic means that it either contains the natural numbers along with addition and multiplication, or that natural number arithmetic can be translated into the system such that the system mimics arithmetic in one way or another.
The second theorem states that the question of whether or not an axiomatic system is consistent belongs to those statements that cannot be proved within the system. Note that this does not mean that a proof showing the consistency of the system in question could not be given in some meta-system. However, if the consistency cannot be shown within the system itself, then a proof within the meta-system has to make inferences which cannot be modeled within the system itself. Such methods would then be open to dispute, because the consistency of the meta-system is not established. A proof of its consistency would require us to use even more elaborate methods of proof within some meta-meta-system, resulting in an infinite regress. Therefore, absolute proofs of consistency, as envisioned by Hilbert, cannot be given for axiomatic systems that are capable of doing arithmetic.
The implications of Gödel’s incompleteness theorems came as a shock to the mathematical community. For instance, it implies that there are true statements that could never be proved, and thus we can never know with certainty if they are true or if at some point they turn out to be false. It also implies that no absolute proofs of consistency could be given. Hence, the entirety of mathematics might be inconsistent and we cannot know for sure whether at some point a contradiction might occur which renders all of mathematics invalid.
For some of the following arguments, it is necessary to have a rough understanding of the ideas underlying the proof of Gödel’s theorem. At the core of the proof lies a sophisticated method of mapping the symbols of arithmetic (like =, +, ×, …), formulas within arithmetic (like ’∃x(x = y+1)’) and proofs within arithmetic (i.e. sequences of formulas) onto a unique natural number (called Gödel number) in such a way that the original symbol, formula or proof could be reconstructed from that number. This allows to express statements about arithmetic (like ”The first sign of ’∃x(x = y+1)’ is the existential quantifier”) as a formula within arithmetic itself (i.e. by stating that the Gödel number g of ’∃x(x = y + 1)’ has a certain property, expressible as an arithmetical formula F(g), that is only possessed by the Gödel numbers of statements beginning with the existential quantifier), effectively allowing arithmetic to talk about itself.
Next, Gödel defined a statement G which states ”G cannot be proved within arithmetic”, and showed how it could be translated into a formula within arithmetic using Gödel numbering (this formula G has come to be refered to in the literature as the Gödelian formula). G yields a contradiction similar to Russell’s paradox: If it could be proved within the system, then it would be false, hence the system would be inconsistent. Assuming that arithmetic is consistent it follows that G cannot be shown within arithmetic and thus it follows that G is true. So G is an example of a formula that is true but cannot be proved in the system, which proves the first Incompleteness Theorem.
Various objections against the possibility of artificial intelligence have been raised on the grounds of Gödel’s incompleteness theorems, which have come to be referred to as Gödelian Arguments. The following sections give an overview of two of the most prominent arguments, along with several objections to these arguments.
An early argument stems from the British philosopher John Lucas, put forward in a scientific paper with the title Minds, Machines and Gödel ([Luc61]). Lucas argues that, by definition, cybernetical machines (which includes computers in particular) are instantiations of a formal system. He backs up his claim by arguing that a machine has only finitely many types of operations it can perform and likewise only a finite number of initial assumptions built into the system. Thus, the initial assumptions could be represented as symbolic axioms within some formal system and the possible types of operations could be represented by formal rules of inference. Hence, every operation that a machine performs could be represented by formulas representing the state of the machine before and after the operation and stating which inference rule was used to get from the first formula to the second. In this manner the entire sequence of operations performed by the machine could be represented as a proof within a formal system and therefore the types of outputs that the machine could produce correspond to the theorems that can be proved within this formal system.
Now, since human minds can do arithmetic, a formal system F that adequately models the mind would also have to be capable of doing arithmetic, hence there are true statements that the machine is incapable of producing, but which the mind can. Lucas states that the Gödelian formula G(F) is an example for this: By following Gödel’s proof, the human mind knows that G(F) is true, but, as shown by Gödel, G(F) cannot be proved within the formal system F and consequently cannot be produced by the machine as being true. He concludes that a machine could then not be an adequate model of the mind and that the mind and machines are essentially different, since there exist true statements that the mind can know to be true but a machine cannot know to be true.
Following his argument, Lucas addresses some possible objections to his point and tries to refute them.
The first objection addressed by Lucas is that if a formal system F is not capable of constructing G(F), an extended, more adequate machine could be constructed that is indeed capable of producing G(F) and everything that follows from it. But then, he argues, this new machine will correspond to a different formal system F′ with other axioms or other rules of inference and this formal system will again have a Gödelian formula G(F′) that the machine is incapable of producing but the human mind can see to be true. If the machine was again modified to be able to produce G(F′), resulting in a new formal system F′′, then again a new formula G(F′′) could be constructed and so forth, ad infinitum. This way, no matter how many times the machine gets improved, there will always be a formula that it is incapable of producing but the human mind knows to be true.
The second objection he addresses is related: Since the construction of the Gödelian formula G(F) is a mechanizable procedure, the machine could be programmed in such a way that, in addition to its standard operations, it is capable of going through the Gödelian procedure to produce the Gödelian formula G(F) from the rest of the formal system, adding it to the formal system, then going through the procedure again to produce the Gödelian formula G(F′) of the strengthened formal system, adding it again, etc. This, as Lucas says, would correspond to a formal system that, in addition to its standard axioms, contains an infinite sequence of additional axioms, each one being the Gödelian formula of the system with the axioms that came before. Lucas objects to this argument by refering to a proof given by Gödel in a lecture at the Institute of Advanced Study, Princeton, N.J., U.S.A. in 1934. In this lecture, Gödel showed that even for formal systems that contain such an infinite sequence of Gödelian formulas as axioms, a formula could be constructed that the human mind can see to be true but that cannot be proved within the system. The intuition behind this proof, as Lucas points out, is the fact that the infinite sequence of axioms would have to be specified by some finite procedure and thus a finite formal system could be constructed that precisely models the infinite formal system.
Lucas also addresses an objection raised by Hartley Rogers in [Rog87]. Rogers claims that a machine modeling a mind should allow for non-inductive inferences. Specifically, he suggests that a machine should maintain a list of propositions neither proved nor unproved and occasionally add them to its list of axioms. If at some point their inclusion leads to a contradiction, it should be dropped again. This way, the machine could produce a formula as true even though it could not be proved from its axioms, which would render Lucas’ argument invalid. Lucas replies to this argument by stating that such a machine must choose the formulas it accepts to be true without proof at random, because a deterministic procedure would again make the whole system an axiomatic system for which an unprovable formula could be constructed that we can see to be true. Such a system, Lucas argues, would not be a good model of the human mind, because the formulas it randomly accepts as true could be wrong, even if they are consistent with the axioms.
Rogers also calls attention to the fact that the Gödelian argument is only applicable if we know that the machine is consistent. Human beings, he argues, might be inconsistent and, hence, an inconsistent machine could be a model of the human mind. Lucas answers by stating that even though human beings are inconsistent in certain situations, this is not the same type of inconsistency as in a formal system. A formal inconsistency allows to derive every sentence and its negation as true. However, when humans arrive at contradictory conclusions, they do not stick to this contradiction, but rather try to resolve it. In this sense human beings are self-correcting. Lucas continues to argue that a self-correcting machine would still be subject to the Gödelian argument, and that only a fundamentally inconsistent machine, which would allow to derive all formulas as true, could escape the Gödelian argument.
Lucas concludes his essay by stating that the characteristic attribute of human minds is the ability to step outside the system. Minds, he argues, are not constrained to operate within a single formal system, but rather they can switch between systems, reason about a system, reason about the fact that they reason about a system, etc. Machines, on the other hand, are constrained to operate within a single formal system that they could not escape. Thus, he argues, it is this ability that makes human minds inherently different from machines.
The following objections are not addressed in [Luc61], but have been voiced by other authors in reaction to Lucas’ argument.
In the book Artificial Intelligence: A Modern Approach ([RN03]), Peter Norvig and Stuart Russell, two artificial intelligence researchers, argue that a computer could be programmed to try out an arbitrary amount of different formal systems, or even invent new formal systems. This way, the computer could produce the Gödel sentence of one system S by switching to another, more powerful system T and carrying out the proof of S’s Gödel sentence in T.
Further, they try to reduce Lucas’ argument to absurdity by pointing out that the brain is a deterministic physical device operating according to physical laws and in consequence also constitutes a formal system. Therefore, they argue, Lucas’ argument could be used to show that human minds could not simulate human minds, which is a contradiction. Thus, they conclude that Lucas’ argument must be flawed.
Paul Benacerraf presents an objection to Lucas’ argument in [Ben67]. He raises attention to the fact that in order to produce the Gödel sentence of a formal system, one must have a profound understanding of the system’s axioms and inference rules. Constructing the Gödel sentence for arithmetic might be simple, but Benacerraf claims that if the human mind could be simulated by a formal system, then this formal system would be so complex that a human being could never understand it to the extent that he would be able to construct its Gödel sentence. Therefore, Benacerraf concludes that Lucas’ argument does not actually prove that the human mind could not be simulated by a formal system, but rather that it proves a disjunction: Either the human mind could not be simulated by a formal system, or such a formal system would be so complex that a human being could not fully understand it.
In his book Gödel, Escher, Bach [Hof79], physicist and cognitive scientist Douglas Hofstadter extends on the objections described above. As demonstrated in Lucas’ refutation of the Extended Machine Objection, adding a procedure to produce the Gödel formula G(F) does not refute his argument since this corresponds to a new system F′ with a new Gödel formula G(F′) that it is unable to produce. No matter how many times the capability of producing the Gödel formula of the system obtained so far is added, the resulting system will always have a new Gödel formula G(F′…′) that it is unable to produce. Hofstadter argues that if this process of adding the capability to produce the Gödel formula is carried out sequentially, the resulting system becomes more and more complex with every step. He claims that at some point the system is so complex that human beings would be unable to produce the Gödel formula. At this point, he concludes, neither the system F′…′ nor the human being that the system models can produce the Gödel formula and, therefore, the human being does not have more power than the system.
Hofstadter takes the view that a program that models human thought needs to be able to switch between systems in an arbitrary fashion. Rather than being constrained to operating within a certain system, it must always be able to jump out of the current system into a meta-system, eventually allowing the system to reflect about itself, to reflect about the fact that it reflects about itself, and so forth. This, he argues, would require the program to be able to understand and modify its own source code.
An argument similar to Lucas’ argument has been proposed by Roger Penrose in a book with the title The Emperor’s New Mind ([Pen89]), in which Penrose claims to overcome the objections that were raised against Lucas’ argument. His argument has later been refined and extended in his book Shadow’s of the Mind ([Pen94]). Here, we shall deal with this refined version of the argument.
Penrose attempts to show that mathematical insight cannot be simulated algorithmically by leading this assumption to a contradiction. He defines mathematical insight as the means by which mathematicians generate mathematical propositions and their proofs and are able to follow and understand each other’s proofs.
Penrose’s argument can be reconstructed as follows:
1. Assume (for the sake of contradiction) that there is some formal system F that captures the thought processes required for mathematical insight.
2. Then, according to Gödel’s theorem, F cannot prove its own consistency.
3. We, as human beings, can see that F is consistent.
4. Therefore, since F captures our reasoning, F could prove that F is consistent.
5. This is a contradiction and, therefore, such a system F could not exist.
The strong assumption in this otherwise logically sound argument is 3, that we can see that F is consistent. Penrose argues for 3 in two different ways (labeled as 3a and 3b in the following)
3a.1: We, as human beings, know that we are consistent.
3a.2: Therefore, if we know that F captures our reasoning, we know that F is consistent.
Penrose recognises that this argument rests on the assumption that we could know that F captures our reasoning. He therefore extends his argument to capture the case where we do not know that F captures our reasoning, but we can still see that F is consistent.
3b.1: By definition, F consists of a set of axioms and inference rules.
3b.2: Each individual axiom could be verified by us, since if F is able to see their truth, then so are we.
3b.3: Furthermore, the validity of the inference rules could also be verified by us, since it would be implausible to believe that human reasoning relies on dubious inference rules.
3b.4: Therefore, since we know that the axioms are true and that the inference rules are valid, we know that F is consistent.
Later in his book, Penrose addresses the question why his argument is not applicable to human brains. To this end, he presents four possible views on the question of how human consciousness and reasoning comes into existence (cf. [Pen94]):
A: All thinking is computation; in particular, feelings of conscious awareness are evoked merely by the carrying out of appropriate computations.
B: Awareness is a feature of the brain’s physical action; and whereas any physical action can be simulated computationally, computational simulation cannot by itself evoke awareness.
C: Appropriate physical action evokes awareness, but this physical action cannot even be properly simulated computationally.
D: Awareness cannot be explained by physical, computational, or any other scientific terms.
Penrose himself embraces position (C). He points out that all the physical laws presently known to us are algorithmic in nature, and therefore argues that there must be non-algorithmic physical phenomena yet to be discovered. He hypothesizes that these phenomena are based on the interaction between quantum mechanics and general relativity.
In [McC95], Daryl McCullough stresses a few loose ends in Penrose’s argument. For instance, he points out that there is an ambiguity in the definition of F, and that there are actually three different ways to interpret F:
1. F represents the mathematician’s inherent reasoning ability.
2. F represents a snapshot of the mathematician’s brain at a certain point in time, such that it includes both his inherent reasoning ability and the empirical knowledge that the mathematician acquired during his lifetime.
3. F represents the maximum of what could ever be known by the mathematician through reasoning and empirical knowledge.
McCullough argues that this dinstinction becomes important when dealing with the question whether the mathematician could know that his reasoning powers are captured by F. If the mathematician learns this fact empirically, then this knowledge is not reflected by F , and therefore Penrose’s original argument would be invalid. However, he acknowledges that an argument analogous to Penrose’s argument goes through for an extended system F′, which is F extended by the axiom that one’s reasoning powers are captured by F.
McCullough also addresses the fact that Penrose’s argument rests on the assumption that human reasoning is consistent and that human beings can be sure of their own consistency. He argues that this assumption is not beyond doubt and presents a thought experiment in order to show how inconsistencies could turn up even during careful and justified reasoning. He proposes to imagine an interrogator asking questions that can be answered by yes or no, and an experimental subject that can answer these questions by pressing a ’yes’ button or a ’no’ button. If the interrogator asks the question ”Will you push the ’no’ button”, then this question cannot be answered truthfully. The subject knows that the true answer is ’no’, but he cannot communicate this answer by pressing the ’no’ button. McCullough now extends this thought experiment and assumes that the subject’s brain is attached to a device that is able to read if the subject’s mind is in a ’yes’ or ’no’ state of belief and correspondingly flashes a light labeled ’yes’ or ’no’. If the interrogator now poses the question ”Will the ’no’ light flash”, the subject has no way of holding a belief without communicating it. Now, if the subject’s beliefs are consistent, the answer to the question is ”no”, but the subject cannot correctly believe the answer to be ”no”, and therefore he cannot correctly believe that he is consistent. Thus, no matter how much careful thought humans give to producing their answer, and no matter how intelligent they are, they cannot be sure of their own consistency.
McCullough concludes that the only undoubtful logical consequence of Penrose’s argument is that if human reasoning can be captured by a formal system F, then there is no way to be certain that F is consistent. This, he argues, is not a limitation on what formal systems could achieve in comparison to human beings, but rather a general insight about a limitation in one’s ability to reason about one’s own reasoning process.
Another answer to Penrose’s argument has been provided by Australian philosopher David Chalmers in [Cha95]. Chalmers argues that it is inadequate to assume that a computational procedure simulating the human mind would consist of a set of axioms and inference rules. He claims that even in today’s AI research, there are examples for computational procedures that are not decomposable into axioms and rules of inference, e.g. neural networks. Chalmers acknowledges that, according to a theorem by William Craig (cf. [Cra53]), for every algorithm we can find an axiomatic system that produces the same output. But this system would be rather complex, casting doubt on Penrose’s assumption that its inference rules could be verified by human thought. Thus, Chalmers concludes that Penrose’s argument only applies to rule-based systems (like automatic theorem provers), but not to all computational procedures.
Apart from this, Chalmers also claims that the assumption that we are knowably consistent already leads to a contradiction in itself. He tries to prove formally that any system that knows about its own consistency is inconsistent. To this end, he introduces the symbol B to represent the system’s belief, where B(n) corresponds to the statement that the system believes that the statement with Gödel number n is true. Further, he introduces ⊢ A to denote that the system knows A. Now he makes the following assumptions:
1. If the system knows A, then it knows that it believes that A is true.
Formally: If ⊢ A then ⊢ B(A).
2. The system knows that it can use modus ponens in its reasoning.
Formally: ⊢ B(A1) ∧ B(A1 → A2) → B(A2)
3. The system knows the fact described in 1.
Formally: ⊢ B(A) → B(B(A))
4. The system is capable of doing arithmetic.
5. The system knows that it is consistent.
Formally: ⊢ ¬B(false)
From these assumptions, the first four of which he deems to be reasonable, Chalmers formally proves that the system must be inconsistent by making use of Gödel’s theorem. Thus, the assumption 5 that the system knows that it is consistent could not hold for any system fulfilling the premises 1 through 4. He therefore concludes that the contradiction that arises in Penrose’s argument is due to the false assumption that humans are knowably consistent, rather than the allegedly false assumption that human thought could not be captured by a formal system.
In this text, we have learned about axiomatic systems and their historical background. We have learned about the mathematicians’ endeavour to formalize mathematics and prove its consistency and we have seen how Gödel’s theorem implied that this is impossible and that there are true statements whose truth cannot be proved. We have seen how Lucas and Penrose argued that it is impossible to capture human thought by means of an axiomatic system on the grounds of Gödel’s theorem, and that therefore artificial intelligence is impossible. We have also dealt with objections to these arguments by other authors and thus we have seen that the Gödelian arguments are not generally accepted.
From my point of view, Lucas’ argument seems rather unconvincing. If a formal system F captures human thought, I agree to Benacerraf that the system would be so complex that it would be highly doubtful that a human being could see the truth of the Gödel sentence G(F). But even if this were possible, it does not show that humans are different from machines and can do something that machines cannot. As shown in [Amm97], it is indeed possible to prove the Gödel sentence algorithmically, as long as the system performing the proof does not correspond to the system whose Gödel sentence is proven. Thus, if a human being can see the truth of the Gödel sentence G(F), this can be viewed as being analogous to some formal system F′ proving the Gödel sentence G(F), which is indeed possible. Therefore, in seeing the truth of the Gödel sentence G(F), the human mind does not do something which is generally impossible for machines. Lucas’ argument would only go through if he could show that a human being is able to see the truth of his own Gödel sentence, because this is what machines cannot do. It is questionable, however, what is meant by the Gödel sentence of a human being in the first place if we do not regard human beings as formal systems to begin with.
As for Penrose’s argument, I am not convinced by fact 3 that a human being could see that a formal system F capturing his thought processes is consistent. In his first argument 3a, Penrose argues that we as human beings know that we are consistent and therefore, since F captures our reasoning, we know that F is consistent. I think that this argument is based on an ambiguity in the definition of consistency. When talking about formal systems, consistency has a very clear definition: It means that no contradictory theorems could be derived within the system. However, it is unclear what it means for a human being to be consistent. Since Penrose does not regard human beings as formal systems, we cannot apply the same definition of consistency.
A reasonable definition would be that a human being is consistent if and only if he does not believe in two contradictory sentences, which is most likely what Penrose means when talking about the consistency of a human being. This is equivalent to saying that the human’s belief system is a consistent system. A human’s belief system is the formal system whose theorems correspond to the sentences that the human believes to be true. Its axioms are what the human originally believes to be true, and its inference rules are the logical inferences that the human makes in order to deduce new beliefs. But a human’s belief system is not equivalent to the human’s mind. Rather, the belief system is one of the results of human thought – a model that humans use to judge the truth of statements. Therefore, if our belief system is consistent, and we know that our belief system is consistent, this does not mean that we know that a formal system simulating our mind (including but not limited to our belief system) is also consistent. This renders the argument 3a invalid.
Apart from this, it is doubtful whether a human’s belief system is necessarily consistent in the first place. Consistent belief systems are usually only held by careful thinkers that are acquainted with the rules of logic, and there are undoubtedly many humans with inconsistent belief systems resulting from logical fallacies. Therefore, Penrose’s argument could show at best that the minds of careful thinkers could not be simulated computationally, whereas it does not apply to the minds of non-careful thinkers with inconsistent belief systems.
In 3b, Penrose argues for 3 by stating that humans could verify the consistency of F by verifying its axioms and inference rules. In this argument, Penrose makes the implicit assumption that the theorems of F correspond to the statements that the simulated human mind believes to be true. This becomes evident in the way he argues for 3b.2 and 3b.3. So again Penrose fails to distinguish between the human mind and the human’s belief system, making 3b invalid out of the same reason as 3a. A formal system might be able to simulate human thought without any obvious relation between the system’s theorems and the statements that the resulting mind believes to be true. Thus, Penrose’s argument shows at best that a formal system that captures human thought could not correspond to the human’s belief system, since this would lead to a contradiction. However, it does nothing to show that human thought could not be simulated by any formal system, as long as the theorems of this formal system do not correspond to the statements that the resulting mind believes to be true.
Up to this day, Gödelian arguments are continuously debated, and there is no consensus among philosophers and researchers as to whether they are true or false. To the best of my knowledge, Lucas and Penrose have not explicitly addressed the described objections in their publications. Still, they have not backed down from being persuaded of their arguments. There are a lot more objections to their arguments in the literature, but listing them all would be out of the scope of this text. Lucas maintains a list of criticisms of the Gödelian argument on his website at http://users.ox.ac.uk/~jrlucas/Godel/referenc.html, referencing 78 sources as of August 2017.
If you have your own ideas on the matter, please leave a comment. If you’d like to read more about the possibility of strong artificial intelligence, read my article “Can Computers Think” -“No, but…” If you’d like to stay informed about more blog posts on artificial intelligence and the philosophy of mind, subscribe to deep ideas by Email.
[Amm97] Kurt Ammon. An automatic proof of Gödel’s incompleteness theorem. Artificial Intelligence, 95, 1997. Elsevier.
[Ben67] Paul Benacerraf. God, the devil, and Gödel. The Monist, 51, 1967. Oxford University Press.
[Cha95] David J Chalmers. Minds, machines, and mathematics. Psyche, 2, 1995. Hindawi Publishing Corporation.
[Cra53] William Craig. On axiomatizability within a system. The Journal of Symbolic Logic, 18, 1953. Association for Symbolic Logic.
[Euc02] Euclid. Euclid’s Elements. 2002. Green Lion Press.
[Fra25] Adolf Fraenkel. Untersuchungen über die Grundlagen der Mengenlehre. Mathematische Zeitschrift, 22, 1925. Springer.
[Göd31] Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38, 1931. Springer.
[Gri04] Nicholas Griffin. The prehistory of Russell’s paradox. In One hundred years of Russell’s paradox. 2004. de Gruyter.
[Hof79] Douglas R. Hofstadter. Gödel, Escher, Bach: An Eternal Golden Braid. 1979. Basic Books, Inc.
[Luc61] John Lucas. Minds, machines and Gödel. Philosophy, 36, 1961. Cambridge University Press.
[McC95] Daryl McCullough. Can humans escape Gödel? Psyche, 2, 1995. Hindawi Publishing Corporation.
[NN01] Ernest Nagel and James R. Newman. Gödel’s proof. 2001. New York University Press.
[Pen89] Roger Penrose. The Emperor’s New Mind. 1989. Oxford University Press.
[Pen94] Roger Penrose. Shadows Of The Mind. 1994. Oxford University Press.
[Raa15] Panu Raatikainen. Gödel’s Incompleteness Theorems. In The Stanford Encyclopedia of Philosophy. Spring 2015 edition, 2015.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 2003. Pearson Education.
[Rog87] Hartley Rogers, Jr. Theory of recursive functions and effective computability. 1987. MIT Press.
[Zac15] Richard Zach. Hilbert’s Program. In The Stanford Encyclopedia of Philosophy. Summer 2015 edition, 2015.
The post Gödel’s Incompleteness Theorem And Its Implications For Artificial Intelligence appeared first on deep ideas.
]]>