KNN machine learning algorithm: Cancer data
KNN machine learning algorithm
The K-Nearest Neighbors (KNN) machine learning algorithm is a supervised learning method used for classification and regression. It predicts the label of a new piece of data based on the classes of its K nearest neighbors in the training set, usually measured by Euclidean distance. It is a “lazy” algorithm because it does not require an explicit training phase, but rather stores the data and performs the calculations at the time of prediction. Although it is easy to implement and understand, it can be slow with large volumes of data and is sensitive to noise and the scale of the variables.
In this guide we will review two cases, the first in a dataset where KNN fails to adequately classify the observations and the second case where with a greater number of predictor variables it proves to be a powerful tool.
Menopause in women diagnosed with Cancer
For the first example, we will use the ´BrCa´ dataset from the ´Epi´ package and test whether KNN can classify women diagnosed with breast cancer into those who are in menopause or not, based on progesterone and estrogen hormone levels.
First we will import the packages required to implement the KNN algorithm.
data("BrCa")
list.of.packages <- c("Epi","janitor","tidymodels",
"kableExtra","kknn", "dplyr",
"shiny")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(janitor)
library(dplyr)
library(tidymodels)
library(kableExtra)
library(kknn)
library(Epi)
Now we´ll subset the original data, selecting only the variables relevant to our research and standardizing the variable names with ´clean_names()´. If you want to learn more about this dataset, you can use ´?BrCa´ in your console.
data(BrCa)
BrCa_ <- BrCa |>
clean_names("upper_camel") |>
mutate(Meno = as.factor(Meno)) |>
select(Meno, Pr, Er)
With our variables selected, we proceed to perform a 30/70 split on our data to generate the training and testing dataset.
## split
set.seed(876)
Split7030 = initial_split(BrCa_, prop = 0.7, strata = Meno)
DataTrain = training(Split7030)
DataTest = testing(Split7030)
#head(DataTrain)
We then create the recipe indicating what we want to predict, the variable ´Meno´ and our other two variables that contain the hormone levels of more women under study, ´Pr´ and ´Er´.
## recipe
Recipe = recipe(Meno ~ ., data = BrCa_) |>
step_naomit() |>
step_normalize(all_predictors())
KNN is an algorithm where the ability to adequately classify a dataset can be influenced by the number of Ks or “neighbors” of an observation when calculating the distance, which is why K is considered a hyperparameter that can be tested and find the most appropriate K value to use in the algorithm.
For this reason we will declare the model with the function ´nearest_neighbor()´, the argument ´weight_func = “rectangular”´ indicates that we will calculate the distance between neighbors with the Eucleidian distance
Next we will join the model and the recipe, but instead of adjusting it directly we create the ´ParGrid´ object which contains a range of different K values that we will test to find the optimal one to classify the dataset.
## Model
ModelDesignKNN = nearest_neighbor(neighbors = tune(),
weight_func = "rectangular") |>
set_engine("kknn") |>
set_mode("classification")
## workflow
TuneWFModel = workflow() |>
add_recipe(Recipe) |>
add_model(ModelDesignKNN)
#print(TuneWFModel)
## hyperparameter grid
ParGrid=data.frame(neighbors=c(1:10))
#print(ParGrid)
Now we will create direct folds for cross validation and thus reduce the risk of overfitting, with all this together we can adjust the model and with the ´autoplot()´ function we will review the metrics for the different values of K.
## cross-validation
set.seed(123)
FoldsForTuning = vfold_cv(DataTrain,
v = 4,
strata = Meno)
## tunning
TuneResults = tune_grid(
TuneWFModel,
resamples = FoldsForTuning,
grid = ParGrid,
metrics = metric_set(accuracy, sensitivity, specificity)) #s et metrics
autoplot(TuneResults)

After interpreting the graph, we observe that the sensitivity is better when the value of K is equal to 3, so we will select this value to adjust the model.
## get the best
BestHyperPar = select_best(TuneResults, metric = "specificity")
print(BestHyperPar)
## # A tibble: 1 × 2
## neighbors .config
## <int> <chr>
## 1 7 Preprocessor1_Model07
## completar el modelo tunneado
WFModelBest = TuneWFModel |>
finalize_workflow(BestHyperPar) |>
fit(DataTrain)
Finally, with the model adjusted by months, make predictions using the testing data and finally review the confusion matrix to check the predictive quality of our KNN, that is, the number of correctly classified observations and the number of false positives and false negatives.
## Prediccion
DataTestWithPredBestModel = augment(WFModelBest, DataTest)
conf_mat(DataTestWithPredBestModel,
truth = Meno,
estimate = .pred_class)
## Truth
## Prediction pre post
## pre 221 106
## post 173 395
If you look at this matrix, the number of false positives (118) and false negatives (178) are too many, therefore in this first case KNN fails to classify women as already in the menopause stage or not based on hormonal levels.
Classification of malignant/benign tumors.
For this second case we will take a dataset where multiple variables that have been introduced into lung cancer tumors are used, such as the permimeter, area, concavity, symmetry among others to classify whether a tumor can be benign or malignant, in this case the KNN proves to be a robust tool if there are enough predictor variables that highlight the differences between categories and in turn serves as diagnostic support.
First we will take the ´cancer´ dataset from the Kaggle platform, you can access the dataset through the following link: Kaggle
cancer <- read.csv("Data/Cancer_Data.csv")
cancer <- cancer |>
clean_names("upper_camel") |>
mutate(Diagnosis = as.factor(Diagnosis)) |>
select(-Id, -X)
Now we follow the same steps as the previous exercise.
## split
set.seed(876)
Split7030 = initial_split(cancer,
prop = 0.7,
strata = Diagnosis)
DataTrain = training(Split7030)
DataTest = testing(Split7030)
#head(DataTrain)
## Crear recipe
Recipe = recipe(Diagnosis ~ ., data = cancer) |>
step_naomit() |>
step_normalize(all_predictors())
#print(Recipe)
## crea modelo
ModelDesignKNN = nearest_neighbor(neighbors = tune(),
weight_func = "rectangular") |>
set_engine("kknn") |>
set_mode("classification")
## workflow
TuneWFModel = workflow() |>
add_recipe(Recipe) |>
add_model(ModelDesignKNN)
#print(TuneWFModel)
## hyperparameter grid
ParGrid=data.frame(neighbors=c(1:10))
print(ParGrid)
## neighbors
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## cross-validation
set.seed(123)
FoldsForTuning = vfold_cv(DataTrain,
v = 4,
strata = Diagnosis)
## tunning
TuneResults = tune_grid(
TuneWFModel,
resamples = FoldsForTuning,
grid = ParGrid,
metrics = metric_set(accuracy, sensitivity, specificity)) # set metrics
autoplot(TuneResults)

## get the best
BestHyperPar = select_best(TuneResults, metric = "specificity")
print(BestHyperPar)
## # A tibble: 1 × 2
## neighbors .config
## <int> <chr>
## 1 3 Preprocessor1_Model03
## final model
WFModelBest = TuneWFModel |>
finalize_workflow(BestHyperPar) |>
fit(DataTrain)
Now we can classify the tumors that we have stored in the testing database with the confusion matrix, this time unlike the previous exercise, our model is much more sensitive and adequately classifies lung tumors since the number of false positives and false negatives is low.
## Prediccion
DataTestWithPredBestModel = augment(WFModelBest, DataTest)
conf_mat(DataTestWithPredBestModel,
truth = Diagnosis,
estimate = .pred_class)
## Truth
## Prediction B M
## B 107 2
## M 1 62
To better visualize the predictive capacity of our KNN we will visualize it using a PCA.
library(ggfortify)
library(cluster)
cancer.pca <- cancer |>
select(-Diagnosis) |>
prcomp(scale. = TRUE, center = TRUE)
autoplot(cancer.pca, data = cancer, colour = "Diagnosis")+
scale_color_manual(values = c("M" = "#f1150e", "B" = "#6dca11"))
