1
Curse-of-Dimensionality
q 4 4 6 6 10 10 20 20 20
N 100 1000 100 1000 1000 10000 10000
d(q,N) 0.42 0.23 0.71 0.48 0.91 0.72 1.51 1.20 0.76
610 1010
~N•• RandomRandom sample of sizesample of size uniformuniform distribution in thedistribution in theq--dimensionaldimensional unit hypercubeunit hypercube
•• Diameter of a neighborhood using EuclideanDiameter of a neighborhood using Euclidean1=K)(),( /1 qNONqd −=distance: distance:
As dimensionality increases, the distance from the As dimensionality increases, the distance from the closest point increases fasterclosest point increases faster
Large Highly biased estimationsLarge Highly biased estimations⇒),( Nqd
Curse-of-Dimensionality
In high dimensional spaces data become In high dimensional spaces data become extremely sparse and are far apart from extremely sparse and are far apart from each othereach other
The curse of dimensionality affects The curse of dimensionality affects anyany estimation problem with high estimation problem with high dimensionalitydimensionality
2
Curse-of-Dimensionality
It is a serious problem in many It is a serious problem in many realreal--world applicationsworld applications
Microarray data:Microarray data: 3,0003,000--4,000 genes;4,000 genes;
Documents:Documents: 10,00010,000--20,000 words in 20,000 words in dictionary;dictionary;
Images, face recognitionImages, face recognition, etc., etc.
How can we deal withHow can we deal withthe curse of dimensionality?the curse of dimensionality?
3
Curse-of-DimensionalityEffective techniques applicable to high
dimensional spaces exist.
The reasons are twofold:Real data are often confined to regions of
lower dimensionality
Real data typically exhibit smoothness properties (at least locally). Local interpolation techniques can be used to make predictions
⎥⎦
⎤⎢⎣
⎡5.19122.92
2.9268.7 Covariance matrix. The covariance measures the extent to which the two variables vary together. If they are independent, the covariance vanishes
4
( )( )[ ]
( )
( ) ( )( )( )( ) ( )
( ) ( )( )( )( ) ( )∑
∑
=
=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
µ−µ−µ−µ−µ−µ−
=⎥⎦
⎤⎢⎣
⎡
µ−µ−µ−µ−µ−µ−
=⎥⎦
⎤⎢⎣
⎡µ−µ−⎟⎟
⎠
⎞⎜⎜⎝
⎛µ−µ−
=−−
×
=⎟⎟⎠
⎞⎜⎜⎝
⎛µµ
=⎟⎟⎠
⎞⎜⎜⎝
⎛=
N
iiii
iii
T
N
ii
xxxxxx
N
xxxxxxE
xxxx
E
E
Nxx
12
222211
22112
11
2222211
22112
11
221122
11
12
1
2
1
1
,
: 22
1
µxµx
xµx
matrix covariance
( ) ( )( )( )( ) ( )
( ) ( )( )[ ]
( )( )[ ] ( )⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
µ−µ−µ−
µ−µ−µ−
=⎥⎥⎦
⎤
⎢⎢⎣
⎡
µ−µ−µ−µ−µ−µ−
∑∑
∑∑
∑
==
==
=
N
i
iN
i
ii
N
i
iiN
i
i
N
iiii
iii
xN
xxN
xxN
xN
xxxxxx
N
1
222
12211
12211
1
211
12
222211
22112
11
11
11
1
variancevariance
variancevariance
covariancecovariance
covariancecovariance
5
⎥⎦
⎤⎢⎣
⎡−
−06.15.0
5.099.0⎥⎦
⎤⎢⎣
⎡−
−15.105.105.104.1
⎥⎦
⎤⎢⎣
⎡05.101.001.093.0
⎥⎦
⎤⎢⎣
⎡04.149.049.097.0
⎥⎦
⎤⎢⎣
⎡03.193.093.094.0
Dimensionality Reduction
• Many dimensions are often interdependent (correlated);
We can:
• Reduce the dimensionality of problems;
• Transform interdependent coordinates into significant and independent ones;
6
Bayesian Probabilities
A key issue in pattern recognition is uncertainty. It is due to incomplete and/or ambiguous information, i.e. finite and noisy data.
Probability theory and decision theory provide the tools to make optimal predictions given the limited available information.
In particular, the Bayesian interpretation of probabilityallows to quantify uncertainty, and make precise revisions of uncertainty in light of new evidence.
Bayes’ Theorem
is the prior probability: it expresses the probability before we observe any data
is the posterior probability: it expresses the probability after we observed the data
The effect of the observed data is captured through the conditional probability
( ) ( ) ( )( )xXp
yYpyYxXpxXyYp=
======
||
( )yYp =
( )xXyYp == |
( )yYxXp == |
7
Curve fitting re-visited
We can adopt a Bayesian approach when estimating the parameters for polynomial curve fitting.
captures our assumptions about before observing the data. The effect of the observed data D is captured by the conditional probability Bayes’ theorem allows to evaluate the uncertainty in after we have observed the data D (in the form of posterior probability):
is the likelihood functionMaximum likelihood approach: set to the value that maximizes
w( )wp
( )w|Dp
( ) ( ) ( )( )Dp
pDpDp www || =
w
w
( )w|Dp
( )w|Dpw
Curve fitting re-visited: ML approach
Training data:
We can express our uncertainty over the value of the target variable using a probability distribution
Assumption: Given a value of x, the corresponding value of t has a Gaussian distribution with a mean equal to
Thus:
( ) ( )TNT
N ttxx LL ,,,, 11 == tx
( ) ∑=
=++++=M
j
jj
MM xwxwxwxwwxy
0
2210, Lw
( ) ( )( )1,,|,,| −Ν= ββ ww xytxtp
8
Curve fitting re-visited: ML approach
Curve fitting re-visited: ML approach
We use the training data to estimate by maximum likelihood
Assuming data are drawn independently, the likelihood function can be written as the product of the marginal distributions:
β,w
( ) ( )( )∏=
−Ν=N
nnn xytp
1
1,,|,,| ββ wwxt
{ }tx,
9
Curve fitting re-visited: ML approach
Gaussian distribution:
( ) ( )( )( )( )
( )( )
( )( ) ( )∑
∑ ∑
∏
∏
=
= =
−
=
−−
−
=
−
−+−−=
−−−=
=
Ν=⇒
−
N
nnn
N
n
N
nnn
N
n
xyt
N
nnn
NNxyt
xyt
e
xytp
nn
1
2
1 1
12
1
2,
1
1
1
2ln2
ln2
,2
2ln,2
21ln
,,|ln,,|ln
1
2
πββ
πββ
πβ
ββ
β
w
w
wwxt
w
( )( )
2
2
22
2
21,| σ
µ
πσσµ
−−
=Νx
ex
( ) 12,, −== βσµ wxy
Curve fitting re-visited: ML approach
Maximum likelihood solution for the polynomial coefficients: maximize log likelihood with respect to
It is equivalent to minimize the negative log likelihood:
Thus: The sum-of-squares error function results from maximizing the likelihood under the assumption of a Gaussian noise distribution
( ) ( )( ) ( )∑=
−+−−=N
nnn
NNxytp1
2 2ln2
ln2
,2
,,|ln πβββ wwxt
w
( )( )⎭⎬⎫
⎩⎨⎧
−= ∑=
N
nnnML xyt
1
2,21minarg ww
w
10
Curve fitting re-visited: ML approach
Maximum likelihood solution for the parameter : maximize log likelihood with respect to
( ) ( )( ) ( )∑=
−+−−=N
nnn
NNxytp1
2 2ln2
ln2
,2
,,|ln πβββ wwxt
β
( )( )∑=
−=N
nMLnn
ML
xytN 1
2,11 wβ
β
Curve fitting re-visited: ML approach
We now have the maximum likelihood solutions for the parameters:
We can now make predictions for new values of x by using the resulting probability distribution over t(predictive distribution)
MLML β,w
( ) ( )( )1,,|,,| −Ν= MLMLMLML xytxtp ββ ww
11
Let us introduce a prior distribution over the polynomial coefficientsRecall: Gaussian distribution of a D-dimensional vector x
Prior distribution:
Using Bayes’ theorem:
w
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ −−− −
=ΝµxΣµx
ΣΣµx
1
21
2/12/1
21,,
T
eDπ
Maximum a Posteriori (MAP) approach
( ) ( )( )
⎟⎠⎞
⎜⎝⎛ −
+− ⎟
⎠⎞
⎜⎝⎛=Ν=
wwIww
T
epM
22/1
1
2,0||
α
πααα
( ) ( ) ( )αββα |,,|,,,| wwxttxw ppp ∝
MAP approach
Maximum a Posteriori solution for the parameters maximize the posterior distribution
It is equivalent to minimize the negative log posterior distribution:
Thus: maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function
w
( ) ( ) ( )αββα |,,|,,,| wwxttxw ppp ∝
( )( )⎭⎬⎫
⎩⎨⎧
+−= ∑=
N
n
TnnMAP xyt
1
2
2,
2minarg wwwww
αβ
12
Decision Theory• Decision theory, when combined with probability
theory, allows to make optimal decisions in situations involving uncertainty
• Training data:
• Inference: joint probability distribution
• Decision step: make optimal decision
tx vector targetvector input ,
( )tx,p
Decision TheoryClassification example: medical diagnosis problem
• set of pixel intensities in an image
• Two classes: – absence of cancer
– presence of cancer
• Inference step: estimate
• Decision step: given predict so that a measure of error is minimized according to the given probabilities
x
01 =C12 =C
( )kCp ,x
x kC
13
Decision TheoryHow probabilities play a role in decision making?
• Decision step: given predict
Thus, we are interested in
Intuitively: we want to minimize the chance of assigning to the wrong class. Thus, choose the class that gives the higher posterior probability
( )x|kCp
x kC
( ) ( ) ( )( )x
xxp
CpCpCp kkk
|| =
x
Minimizing the misclassification rate• Goal: Minimize the number of misclassifications
We need to find a rule that assigns each input vector to one of the possible classes
Such rule divides the input space into regions so that all points in are assigned to
Boundaries between regions are called decision boundaries
kC
kRkR kC
14
Minimizing the misclassification rate• Goal: Minimize the number of misclassifications
• Assign x to the class that gives the smaller value of the integrand:– Choose if
– Choose if
( ) ( ) ( )( ) ( ) xxxx
xx
dCpdCp
CRpCRpmistakep
RR∫∫ +=
∈+∈=
21
12
1221
,,
,,
1C2C
( ) ( )21 ,, CpCp xx >( ) ( )12 ,, CpCp xx >
Minimizing the misclassification rate– Choose if
– Choose if
Thus:
– Choose if
– Choose if
1C2C
( ) ( )21 ,, CpCp xx >
( ) ( )xx || 12 CpCp >
( ) ( ) ( )xxx pCpCp kk |, =
1C2C
( ) ( )12 ,, CpCp xx >
( ) ( )xx || 21 CpCp >
15
Minimizing the misclassification rate
0ˆ xx =Optimal decision boundary:
Minimizing the misclassification rate
Thus:
General case of K classes:
( )xCp k |Choose that gives the largest kC
( ) ( ) ( )∑ ∑ ∫= =
=∈=K
k
K
k Rkkk
k
dCpCRpcorrectp1 1
,, xxx
16
Minimizing the expected loss
jCkjLkC
Some mistakes are more costly than others.Loss function (cost function): overall measure of
loss incurred in taking any of the available decisions
: loss incurred when we assign to class and the true class is
x
⎟⎟⎠
⎞⎜⎜⎝
⎛01
10000cancer
cancer normal
normal
The optimal solution is the one that minimizes the loss function
Minimizing the expected loss
[ ] ( )∑∑ ∫=k j R
kkj
j
dCpLLE xx,
The loss function depends on the true class, which is unknown.
The uncertainty of the true class is expressed through the joint probability
We minimize the expected loss:
For each x we should minimize
( )kCp ,x
( ) ( ) ( )xxx pCpLCpL kk
kjkk
kj |, ∑∑ =
17
Minimizing the expected lossFor each x we should minimize
Thus, to minimize the expected loss: Assign each x to the class j that minimizes
( ) ( ) ( )xxx pCpLCpL kk
kjkk
kj |, ∑∑ =
( )x|kk
kj CpL∑
The Reject Option
18
Inference and Decision
Inference stage: use the training data to learn a
model for
Decision stage: use the given posterior probabilities to
make optimal class assignments
( )x|kCp
Generative MethodsSolve the inference problem of estimating the class-
conditional densities for each class
Infer the prior class probabilities
Use Bayes’ theorem to find the class posterior probabilities:
where
Use decision theory to determine class membership for each new input x
( )kCp |x kC
( )kCp
( ) ( ) ( )( )x
xxp
CpCpCp kkk
|| =
( ) ( ) ( )∑=k
kk CpCpp |xx
19
Discriminative MethodsSolve directly the inference problem of estimating the
class posterior probabilities
Use decision theory to determine class membership for each new input x
( )x|kCp
Discriminant FunctionsFind a function which maps each input
directly onto a class label. Probabilities play no role here.
Use decision theory to determine class membership for each new input x
( )xf
20
Example
Linear Models for ClassificationClassification: Given an input vector x, assign it to
one of K classes where k = 1,…, K
The input space is divided in decision regions whose boundaries are called decision boundaries or decision surfaces
Linear models: decision surfaces are linear functions of the input vector x. They are defined by (D -1)-dimensional hyperplanes within the D -dimensional input space
kC
21
Linear Models for ClassificationFor regression:
For classification, we want to predict class labels, or more generally class posterior probabilities.
We transform the linear function of w using a nonlinear function f () so that
( ) 0wy T += xwx
( )0wf T +xw
Generalized Linear Models
Linear Discriminant FunctionsTwo classes:
Decision boundary:
( ) 0wy T += xwx
( )2
10CtoassignotherwiseCtoassignyif
x x x ≥
( ) 0=xy
22
Linear Discriminant FunctionsGeometrical properties:
Decision boundary:
Let be two points which lie on the decision boundary
( ) 00 =+= wy T xwx
21, xx
( ) ( )( ) 0
0,0
21
022011
=−⇒
=+==+=
xx w
xwxxwxT
TT wywy
w represents the orthogonal direction to the decision boundary
Geometrical properties (con’t)
www
TT =*
0x( )
( )
( ) ( )
( ) ( )
( )ww
x 0x
wxwxw
w
xwxww
xxw
w
xx
xxw
0
0
00
*0
0*
,
1
1
wywhen
y
directionwtheonto
ofprojectiontheis
T
TTT
T
==
=+=
−=−
−
−
Signed orthogonal distance of the origin from the decision surface
23
Linear Discriminant FunctionsMultiple classes
one-versus-the-rest: K-1 classifiers each of which solves a two-class problem of separating points of from points not in that class
kC
Linear Discriminant FunctionsMultiple classes
one-versus-one: K(K-1)/2 binary discriminant functions, one for every possible pair of classes.
24
Linear Discriminant FunctionsMultiple classes
Solution: consider a single K-class discriminantcomprising K linear functions of the form
Assign a point x to class if The decision boundary betweenclass and class is given by
( ) 0kTkk wy += xwx
kC ( ) ( ) kjxyxy jk ≠∀>
( ) ( )( ) ( ) 000 =−+−⇒
=
jkT
jk
jk
ww
xyxy
xww
kC jC
Linear Discriminant Functions
Two approaches:
Fisher’s linear discriminant
Perceptron algorithm
25
Fisher’s Linear DiscriminantOne way to view a linear classification model is in terms of dimensionality reduction.
Two class case:
Suppose we project x onto one dimension:
Set a threshold t
xwTy =
2
1
CtoassignotherwiseCtoassigntyif
x x ≥
Fisher’s Linear Discriminant
• Find an orientation along which the projected samples are well separated;
• This is exactly the goal of linear discriminant analysis (LDA);
• In other words: we are after the linear projection that best separates the data, i.e. best discriminates data of different classes.
How can we find such discriminant direction?How can we find such discriminant direction?
26
LDA
• samples of class
• samples of class
• Consider with
• Then: is the projection of along the direction of
• We want the projections where separated from the projections where
Niin C 1)},{( =x q
n ℜ∈x } ,{ 21 CCCi ∈
1N 1C
2N 2Cqℜ∈w 1=w
xwT
wx
xwT1C∈x
xwT
2C∈x
LDA• A measure of the separation between the
projected points is the difference of the sample means:
iC∑∈
=iCi
i N xxm 1 Sample mean of classSample mean of class
iCi
ii
Nm mwxw T
x
T == ∑∈
1 Sample mean for the Sample mean for the projected pointsprojected points
)( 2121 mmwT −=−mm
We wish to make the above difference as large as We wish to make the above difference as large as we can. In addition…we can. In addition…
27
LDA• To obtain good separation of the projected data we
really want the difference between the means to be large relative to some measure of the standard deviation of each class:
iC( )∑∈
−=iC
iT
i msx
xw 22 Scatter for the projected Scatter for the projected samples of classsamples of class
Total Total withinwithin--class class scatterscatter of the projected of the projected samplessamples
22
21 ss +
22
21
221 max arg
ssmm+−
w
Fisher linear discriminant Fisher linear discriminant analysisanalysis
LDA
28
LDA
( ) : matrices following the
define we of function explicit an as obtain To wwJ
( ) 22
21
221
ssmm
J+−
=w
( )( )
21 SSS
S
W
Cx
Tiii
i
+=
−−=∑∈
mxmx
WithinWithin--class scatter matrixclass scatter matrixThen:
( ) ( )
( )( ) wwwmxmxw
mwxwxw
x
x
iTT
iC
iT
Cxi
TT
Ci
Ti
S
ms
i
ii
=−−=
−=−=
∑
∑∑
∈
∈∈
222
LDA
( ) www w
wwww :Thus
ww and ww :So
WTT
TT
TT
SSS
SSss
SsSs
=+
=+=+
==
21
2122
21
2221
21
( ) ( )( )( )
( )( )TB
BT
TT
TT
S
S
mm
2121
2121
221
221
where
:Similarly
mmmm
ww
wmmmmw
mwmw
−−=
=−−
=−=−
BetweenBetween--class scatter class scatter matrixmatrix
29
LDA
ww WTSss =+ 2
221
: obtained have We
( ) ww BTSmm =− 2
21
( )wwwww
WT
BT
SS
ssmm
J =+−
= 22
21
221
wwww
wW
TB
T
SSmaxarg
LDA
( )( ) wmmmmw TBS 2121
: thatobserve We
−−=
scalarscalarAlways in the direction ofAlways in the direction of( )21 mm −
( )211 mmw −= −
WS
( )wwwww
WT
BT
SSJ =
( ) ( ) ( ) wwwwww w BWWBT SSSSwhenimizedisJ =max
30
LDA
Projection onto the line joining the class means
LDA
Solution of LDA
31
LDA
( )211 mmw −= −
WS
•• Gives the linear function with the maximum ratio of Gives the linear function with the maximum ratio of betweenbetween--class scatter to withinclass scatter to within--class scatter.class scatter.
•• The problem, e.g. classification, has been reduced The problem, e.g. classification, has been reduced to a to a qq--dimensional problem to a more manageable dimensional problem to a more manageable oneone--dimensional problem. dimensional problem.
•• Optimal for multivariate normal class conditional Optimal for multivariate normal class conditional densities.densities.
LDA•• The analysis can be extended to multiple classes.The analysis can be extended to multiple classes.
•• LDA is a LDA is a linearlinear technique for dimensionality technique for dimensionality reduction: it projects the data along directions that reduction: it projects the data along directions that can be expressed as can be expressed as linear combinationlinear combination of the of the input features. input features.
•• NonNon--linear extensions of LDA exist (e.g., generalized linear extensions of LDA exist (e.g., generalized LDA). LDA).
•• The “appropriate” transformation depends on the data The “appropriate” transformation depends on the data and on the and on the tasktask we want to perform on the data. Note we want to perform on the data. Note that LDA uses class labels.that LDA uses class labels.
32
The Perceptron Algorithm
Perceptron (Frank Rosenblatt, 1957)
• First learning algorithm for neural networks;
• Originally introduced for character classification, where each character is represented as an image;
33
Perceptron (contd.)
input
output
( )⎩⎨⎧
<≥
=0001
xx
xH if if
∑=
n
jjj xw
1Total input to output node:
Output unit performs the function: (activation function):
Perceptron: Learning Algorithm• Goal: we want to define a learning algorithm for the
weights in order to compute a mapping from the inputs to the outputs;
• Example: two class character recognition problem.
– Training set: set of images representing either the character ‘a’ or the character ‘b’ (supervised learning);
– Learning Task: Learn the weights so that when a new unlabelled image comes in, the network can predict its label.
– Settings:Class ‘a’ 1 (class C1)Class ‘b’ 0 (class C2)n input units (intensity level of a pixel)1 output unit
The perceptronneeds to learn
{ }1,0: →ℜnf
34
Perceptron: Learning AlgorithmThe algorithm proceeds as follows:
• Initial random setting of weights;
• The input is a random sequence
• For each element of class C1, if output = 1 (correct) do nothing, otherwise update weights;
• For each element of class C2, if output = 0 (correct) do nothing, otherwise update weights.
{ } ℵ∈kkx
Perceptron: Learning AlgorithmA bit more formally:
( )nxxx ,...,, 21=x ( )nwww ,...,, 21=w:θ
nnT xwxwxw +++= ...2211wx
0≥−θTwx
x1 xnx2 xn+1
w1 wnw2 -θ
=1
∑+
=
≥=1
10ˆˆ
n
iii
T xwxw
Threshold of the output unit
Output is 1 if
To eliminate the explicit dependence on :θ
Output is 1 if:
35
Perceptron: Learning Algorithm• We want to learn values of the weights so
that the perceptron correctly discriminate elements of C1 from elements of C2:
• Given x in input, if x is classified correctly, weights are unchanged, otherwise:
⎩⎨⎧
−+
=12
21'
)0()1(
Cfied as inwas classiCssent of claif an elemCfied as inwas classi Cssent of claif an elem
xw xw
w
Perceptron: Learning Algorithm
• 1st case: The correct answer is 1, which corresponds to:We have instead:
We want to get closer to the correct answer:
21 CC in classified wasand x∈0ˆˆ ≥Txw
0ˆˆ <TxwTT xwwx '<
( ) TT xxwwx +<TT xwwx '<
( ) 2xwxxxwxxxw +=+=+ TTTT
iff
ifiedion is verthe conditbecause x ,02 ≥
⎩⎨⎧
−+
=12
21'
)0()1(
Cfied as inwas classiCssent of claif an elemCfied as inwas classi Cssent of claif an elem
xw xw
w
36
Perceptron: Learning Algorithm
• 2nd case: The correct answer is 0, which corresponds to:We have instead:
We want to get closer to the correct answer:
21 Cnassified iand was clC x∈0ˆˆ <Txw
0ˆˆ ≥TxwTT xwwx '>
( ) TT xxwwx −>TT xwwx '>( ) 2xwxxxwxxxw −=−=− TTTT
iff
ifiedion is verthe conditbecause x ,02 ≥
The previous rule allows the network to get closer to the correct answer when it performs an error.
⎩⎨⎧
−+
=12
21'
)0()1(
Cfied as inwas classiCssent of claif an elemCfied as inwas classi Cssent of claif an elem
xw xw
w
Perceptron: Learning Algorithm• In summary:
1. A random sequence is generated such that
2. If is correctly classified, then otherwise
LL ,x,,x,x k21
21 CCi ∪∈x
kx kk ww =+1
⎩⎨⎧
∈−∈+
=+2
11 Cif
Cif
kkk
kkkk x xw
x xww
37
Perceptron: Learning AlgorithmDoes the learning algorithm converge?
Convergence theorem: Regardless of the initial choice of weights, if the two classes are linearly separable, i.e. there exist s.t.
then the learning rule will find such solution after a finite number of steps.
⎪⎩
⎪⎨⎧
∈<
∈≥
2
1
0ˆˆ
0ˆˆ
C
CT
T
x if xw
x if xw
w
Representational Power of Perceptrons• Marvin Minsky and Seymour Papert,
“Perceptron” 1969:“The perceptron can solve only problems with
linearly separable classes.”• Examples of linearly separable Boolean functions:
����� �����
����� ����� ��
�� AND OR
����� �����
����� ����� ��
��
38
Representational Power of Perceptrons
Perceptron that computes the AND function
�� �� �
� � ��
�� �� �
� � ��1 1
1 1-1.5 -0.5
Perceptron that computes the OR function
Representational Power of Perceptrons• Example of a non linearly separable Boolean
function:
����� �����
����� ����� ��
��
EX-OR
The EX-OR function cannot be computed by a perceptron