Why Testing Error Rate Increases At High Values Of K In Knn Algorithm?
Solution 1:
The parameter K in KNN controls the complexity of the model. You don't give details of your specific problem, but what you likely seeing is the the bias/variance trade-off. This post is a good read about it.
Usually you try different values of the hyper parameters from the model (the value of K in the KNN) in a validation set and keep the best one. Notice that this validation set is not the same as the test set.
Solution 2:
K in KNN stands for the number of closest neighbours that are taken into account. Therefore, the more neighbours are considered, the more distant ones have an impact on the final outcome. It makes sense though that with more neighbours taken, more elements of the different category are also taken. This may lead to misclassification especially for the elements on the boundaries of clusters.
Another example to consider would be two clusters that are imbalanced - one cluster having let's say 5 elements and the second having 20. With K=10 all of the elements from the first cluster will be categorized as the second one. On the other hand K=3 will yield better results if clusters are nicely separated.
Exact reason for your results will depend on the number of clusters you have, their placement, density and cardinality.
Solution 3:
What happens with higher value of K is that ,the Majority Class in the Dataset has a bigger say on the outcome of the result ,So the error rate increases
Let's say that there are 100 data points , and let's say that 80 belong to class label "0" and 20 belong to class label "1"
Now , if I choose any value of k > 40 , all the datapoints will now belong to the majority class
Generally, Large value of K leads to Underfitting at the sametime very small value of K(though problem specific) leads to Overfitting
Post a Comment for "Why Testing Error Rate Increases At High Values Of K In Knn Algorithm?"