Enhancing Decision-Making with k-Nearest Neighbors with Amazon SageMaker
In this post, we shall delve into k-Nearest Neighbors(k-NN) algorithm, one of the simplest yet very powerful algorithm, that can be used to solve very complex problems. We will start with some basic concepts, which always play an integral role irrespective of which algorithm we pick and that will help us set the premise before we dive deep into k-NN. We shall then proceed to solve some problems on Amazon SageMaker with the build-in k-NN, index based algorithm.
k-Nearest Neighbors(k-NN)
algorithm. One of the simplest yet very powerful algorithm, that can be used to solve very complex problems. We will start with some basic concepts, which always play an integral role irrespective of which algorithm we pick and that will help us set the premise before we dive deep into k-NN. We shall then proceed to solve some problems on Amazon SageMaker with the build-in k-NN, index based algorithm.Classification
and Regression
can be considered as two different types of prediction problems which comes under Supervised Machine Learning.Classification
are problems wherein the class/target labels have finite set of outcomes. Let's take an example of Amazon Fine Food dataset wherein we have numerous customer reviews and each review has a class label of positive
or negative
. Now the problem in our hand is, we need to train a model such that, given a new review text, our model should be able identify/predict whether the review is positive
or negative
.classification
problem.regression
problems, the class/target label have infinite set of outcome, or in other words it can be any real number. For example, imagine that we have a dataset of students with input features as "Age, Gender and Race" and class/target label as Height
. And now, we need to train a model such that, once the model is trained, it should be able to predict the Height
of a student, given her/his Age, Gender and Race.Height
could be any real number(e.g. 170 cm, 172.65 cm, and so on ...), which cannot be defined by a finite set of possibilities. So, we can call these kind of problems as regression
problem.positive
or negative
, which defines that a particular user review can be either a positive review or a negative review. So, the first thing that we generally do, is we convert each review into a vector(you can think of a vector as an array of numbers), and we can always represent a vector as a column vector
or a row vector
. Data scientists and machine learning practitioner uses both, but if someone is simply referring a data point as vector, it's safe to assume, they are referring to a column vector
. Having said that, most of the time we will have the context which will help us to identify that a given vector is a column vector or row vector.row
would represent one review
. And we can do so, only after we convert each review vector into a row vector, and we know we can do so by taking a transpose of the vector.vector
and matrix
.1 to n
, such that Xi belongs to Rd
, thats is Xi is a d
dimensional vector where d
can be any real values and Yi belongs to {0, 1}
.class A
or class B
.Q
which we give to the model, and we need to know which class this new query point Q
belongs to class A
or class B
.Q
may belong to class B
as all the points which are closed neighbors of the query point Q
belongs to class B
.S
, and this time lets say we pick k=4
, so now we can see below, that out of 4 neighbors of query point S
, three of them belongs to class B
and one of them belongs to class A
, So, k-NN would conclude that the query point S
belongs to class B
, as the majority of the neighbors belongs to class B
- Finds k nearest neighbors of the query point
q
- Notes the
class labels
for all those neighbors - Takes the
majority vote
out of all the class labels, and whichever class gets the maximum vote, it declares the same as the class label for the query pointq
majority vote
it takes either mean
or median
of all the class labels, as in case of regression class labels would be any real number. And statistically median
would be little less prone to outliers/noise.Euclidian Distance
. So, it's basically the shortest distance between two data points or vectors.L2 Norm
of a vector, in simple terms, we generally refer them as:- Euclidian Distance, when we consider 2
points
- L2 Norm, when we consider 2
vectors
L1 Norm
of a vector.euclidian(L2 Norm)
and manhattan(L1 Norm)
. We also have a generalisation of these two L1
and L2
Norms, which is called Lp
Norm, also know as Minkowski
distance, where p could be any number.- When p=1, Minkowski distance becomes Manhattan distance
- When p=2, Minkowski distance becomes Euclidian distance
- We refer to
Distance
, when we think about 2 points - We refer to
Norm
, when we think about 2 vectors - Manhattan Distance between 2 points is same as L1 Norm of 2 vectors
- Euclidian Distance between 2 points is same as L2 Norm of 2 vectors
- Minkowski Distance between 2 points is same as Lp Norm of 2 vectors
positive
or negative
.- Let's take the data set
D
, which has all the "n" no. reviews. - Now, let's break the data set
D
, into two subgroups,D_train
andD_test
, such that any single review would be part of eitherD_test
orD_train
, but not to both. WhereinD_train
dataset would be used during the training andD_test
would be used during testing - And after we train the k-NN with the
D_train
, we can pick each point fromD_test
, and use this data setD_train
andk-NN
to get its respectiveclass label
(prediction). - Once we get the predicted class label, we can compare the prediction with the actual class label to understand the model accuracy.
k-NN
is said to be a lazy learner. As in, there is no such training phase in real mean, unlike the machine learning algorithm. During the training phase it simply loads all the train data points into the RAM. Apart from that there's no learning. The actual task of prediction begins in the test phase. Hence we directly took the query point and measure the k nearest points in the training data and then do the majority voting
of the class labels for all those k points. Hence, there is no training phase of the algorithm. As the training phase consists of only storing the data points, we say that there's no training in KNN.D_Train
. For example, let's say we have n of data points in D_Train
, and each point is of d dimension. In that case, it would take an order of O(nd)
time to compute the k-nearest neighbour from the query point. This is dangerous if you think intuitively. Imagine, you are at amazon.com and posting a review for a product you bought. Now, lets say I am running a k-NN at the backend at amazon.com, whose task is to categorise each new review into positive or negative, and if the system takes O(nd)
time to come up with the prediction, it might take several mins to categorise the review, and its not at all useful.KD-Tree
, LSH(Locality Sensitive Hashing)
, etc. Although they are not perfect but these techniques can significantly reduce the space/time complexity of k-NN.k
, which would give us the best accuracy for our respective problem we are trying to solve using k-NN?- When
k=1
: This means, for a given query point, theclass
label for that query point would be same as theclass
label of its closest neighbour. So, here class label for the query point Q would be+ve
- When
k=5
: For this following query point the class label would be+ve
as out of 5 closest neighbour, 4 of them have have the class label of+ve
- When
k=n
: This means, let's say we have 1000 data points, and k=1000, no matter where the query point lies, k-NN would always predict the class label which have the majority of the class, e.g. if out of 1000 data points if 400 belongs to+ve
and 600 belongs to-ve
, no matter whats the actual class of the query point, the prediction would be always-ve
.
Decision Surface
is generally used to visually see on plane where our algorithm is classifying the point as in this example as +ve
or -ve
class. As here for different k the points on the plane will be classified into any class depending upon where the point lies on the plane from other points.k=n
, everything belongs to the majority class.Overfit
and Underfit
. If you consider the above plot, when k=1
, we could classify a query point incorrectly due to some noise, in other words, if my decision surface is very rough, that actually indicates that the surface is capturing most of the noisy/outlier data point. And in those cases, we call the model as Overfit
.k=n
, such models are termed as Underfit
.Well-fit
model, like what we have seen when k=5
, although it also makes wrong prediction, but in general that's a good and preferable model to have over an overfit model. It kind of takes the middle path between Overfit
and Underfit
.Cross Validation
briefly before moving towards hands on, as we were discussing about k-NN
.train
and test
datasets, which we have seen above. They just used to fit
the model on the train
data with different set of values of 'k' and used to check the performance on test
data against all the models and then whichever value of 'k' used to give the best results on the test
data, that would have been considered as the optimal value of k. But with this process we are able to fit the model on the train data and compute the hyperparameter(which is nothing but the value of k
) using the test data, but not able to build a model that could generalise well on completely unseen
data(ie., the data which was not used in training and cross validation phase). So far the model built works better on train and the test data as it has already seen both of them in computing the nearest neighbors and 'k' value.Cross Validation(CV)
where the given dataset is divided into 3 parts names train
, cv
and test
where they used to fit
the model on the train data, compute the hyperparameters(right value of k
) on the basis of cv data and then make the model generalise to the test data(ie., checking how well the model performs on the unseen data). This way though the model has already seen both the train and cv data during the computation of Nearest neighbors(NN) and 'k' value, but it hasn't seen the test data and so generalising the model using the test data have accomplished the objective of building an machine learning model. But here again there's a drawback. As we split the data into 3 parts now, Train, CV and Test, lets say in the ratio of 60%, 20% and 20% respectively, so what we are eventually doing is, we are using only 60% of the data to compute our NN(using the training data) and 20% of the data to compute k
(using the CV data) and rest 20% for testing(using the test data, we use only 60% of the data compute NN, which is quite less and we know in general in machine learning more the training data, better is the algorithm).k-fold cross validation
for our rescue. Don't get confused with k-fold and k-nn, both of these k
are different. We can just think of k
as a variable which could be any number. Let's take a quick example and understand this k-fold cross validation
.D(train)
, D(CV)
and D(test)
with 60%, 20% and 20% of data respectively(as we mentioned earlier). But to achieve a good k-NN algorithm function, more data should be use to train instead of just 60%. And since we can not do anything about the D(test)
data, as we should not touch that and keep it as it is as unseen data so that we can use the same to check the accuracy of our model. Therefore we mix the remaining 80% of the data, which comprises of D(train)
and D(CV)
.k=4
for the k-fold cross validation, we can call it as 4-fold
cross validation.And this same process is done for all values of k like 2,3,4.....n. From this a plot is plotted and best k can be observed.
1
2
3
4
%%bash
wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
mkdir -p /tmp/covtype/raw
mv covtype.data.gz /tmp/covtype/raw/covtype.data.gz
We'll first load the data into numpy arrays, and randomly split it into train and test with a 90/10 split.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import os
data_dir = "/tmp/covtype/"
processed_subdir = "standardized"
raw_data_file = os.path.join(data_dir, "raw", "covtype.data.gz")
train_features_file = os.path.join(data_dir, processed_subdir, "train/csv/features.csv")
train_labels_file = os.path.join(data_dir, processed_subdir, "train/csv/labels.csv")
test_features_file = os.path.join(data_dir, processed_subdir, "test/csv/features.csv")
test_labels_file = os.path.join(data_dir, processed_subdir, "test/csv/labels.csv")
# read raw data
print("Reading raw data from {}".format(raw_data_file))
raw = np.loadtxt(raw_data_file, delimiter=',')
# split into train/test with a 90/10 split
np.random.seed(0)
np.random.shuffle(raw)
train_size = int(0.9 * raw.shape[0])
train_features = raw[:train_size, :-1]
train_labels = raw[:train_size, -1]
test_features = raw[train_size:, :-1]
test_labels = raw[train_size:, -1]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import io
import boto3
import os
import sagemaker
import sagemaker.amazon.common as smac
bucket = "my-ai-bucket-suman"
prefix = "knn"
key = "recordio-pb-data"
print('train_features shape = ', train_features.shape)
print('train_labels shape = ', train_labels.shape)
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, train_features, train_labels)
buf.seek(0)
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))
print('test_features shape = ', test_features.shape)
print('test_labels shape = ', test_labels.shape)
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, test_features, test_labels)
buf.seek(0)
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('uploaded test data location: {}'.format(s3_test_data))
1
2
3
4
5
6
train_features shape = (522910, 54)
train_labels shape = (522910,)
uploaded training data location: s3://my-ai-bucket-suman/knn/train/recordio-pb-data
test_features shape = (58102, 54)
test_labels shape = (58102,)
uploaded test data location: s3://my-ai-bucket-suman/knn/test/recordio-pb-data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import matplotlib.pyplot as plt
import time
import sagemaker
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
def trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, s3_test_data=None):
"""
Create an Estimator from the given hyperparams, fit to training data,
and return a deployed predictor
"""
# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
get_execution_role(),
train_instance_count=1,
train_instance_type='ml.m5.2xlarge',
output_path=output_path,
sagemaker_session=sagemaker.Session())
knn.set_hyperparameters(**hyperparams)
# train a model. fit_input contains the locations of the train and test data
fit_input = {'train': s3_train_data}
if s3_test_data is not None:
fit_input['test'] = s3_test_data
knn.fit(fit_input)
return knn
1
2
3
4
5
6
7
8
9
hyperparams = {
'feature_dim': 54,
'k': 10,
'sample_size': 200000,
'predictor_type': 'classifier'
}
output_path = 's3://' + bucket + '/' + prefix + '/output'
knn_estimator = trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path,
s3_test_data=s3_test_data)
1
2
3
4
2023-07-11 11:44:55 Uploading - Uploading generated training model
2023-07-11 11:44:55 Completed - Training job completed
Training seconds: 247
Billable seconds: 247
knn_estimator
object above contains all the information we need for hosting the model. Below is a simple helper function that gives an estimator, sets up an endpoint that hosts the model. Other than the estimator object, we provide it with a name (string) for the estimator, and an instance_type
. The instance_type
is the machine type that will host the model. It is not restricted in any way by the parameter settings of the training job.1
2
3
4
5
6
7
def predictor_from_estimator(knn_estimator, estimator_name, instance_type, endpoint_name=None):
knn_predictor = knn_estimator.deploy(
initial_instance_count=1, instance_type=instance_type, endpoint_name=endpoint_name
)
knn_predictor.serializer = CSVSerializer()
knn_predictor.deserializer = JSONDeserializer()
return knn_predictor
endpoint
:1
2
3
4
5
instance_type = "ml.m4.xlarge"
model_name = "knn_%s" % instance_type
endpoint_name = "knn-ml-m4-xlarge-%s" % (str(time.time()).replace(".", "-"))
print("setting up the endpoint..")
predictor = predictor_from_estimator(knn_estimator, model_name, instance_type, endpoint_name=endpoint_name)
1
2
setting up the endpoint..
---------------!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
batches = np.array_split(test_features, 100)
print('data split into 100 batches, of size %d.' % batches[0].shape[0])
# obtain an np array with the predictions for the entire test set
start_time = time.time()
predictions = []
for batch in batches:
result = predictor.predict(batch)
cur_predictions = np.array([result['predictions'][i]['predicted_label'] for i in range(len(result['predictions']))])
predictions.append(cur_predictions)
predictions = np.concatenate(predictions)
run_time = time.time() - start_time
test_size = test_labels.shape[0]
num_correct = sum(predictions == test_labels)
accuracy = num_correct / float(test_size)
print('time required for predicting %d data point: %.2f seconds' % (test_size, run_time))
print('accuracy of model: %.1f%%' % (accuracy * 100) )
1
2
3
data split into 100 batches, of size 582.
time required for predicting 58102 data point: 42.34 seconds
accuracy of model: 92.2%
1
2
3
4
5
6
7
8
def delete_endpoint(predictor):
try:
boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
print('Deleted {}'.format(predictor.endpoint))
except:
print('Already deleted: {}'.format(predictor.endpoint))
delete_endpoint(predictor)
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.