Home / KNIME / k-Nearest Neighbor Classification in KNIME

k-Nearest Neighbor Classification in KNIME

In this post, we shall see how to solve a classification problem using k-Nearest Neighbor (kNN) algorithm in KNIME. We shall use the Teaching Assistant Evaluation dataset from UCI repository.

http://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation

Data Set Information

The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-sized categories (“low”, “medium”, and “high”) to form the class variable.

Attribute Information

1. Whether or not the TA is a native English speaker (binary); 1=English speaker, 2=non-English speaker
2. Course instructor (categorical, 25 categories)
3. Course (categorical, 26 categories)
4. Summer or regular semester (binary) 1=Summer, 2=Regular
5. Class size (numerical)
6. Class attribute (categorical) 1=Low, 2=Medium, 3=High

Objective

Class attribute, which contains the three score categories, is our target column. Our aim is to build a kNN model that will try to predict the correct score category based on the other given explanatory attributes.

Reading tae.data from the UCI repository

Step-1: Add the File Reader node from Node Repository: IO > Read > File Reader

Step-2: Right click on the node and select ‘Configure’

Step-3: In the Settings tab enter the URL of the data archive location

Step-4: In the Basic Settings section, choose “,” as column delimiter and select ignore spaces and tabs. The Preview of the data will be displayed.

Step-5: Click Apply and OK and then Execute the node

Now our dataset is loaded and we can view the table by right clicking on the node and selecting ‘File Table’. We can see that our table has 151 rows and 6 columns.

Assigning appropriate column name and type

We will now use the Column Rename node in KNIME to assign appropriate column names and column types to our dataset.

Step-1: Add the Column Rename node from Node Repository: Manipulation > Column > Convert & Replace > Column Rename

Step-2: Connect it to the File Reader node

Step-3: Right click on the node and select Configure

Step-4: In the Change columns tab add all the columns from Col0 to Col5 and enter appropriate column name and column type for each one of them (IntValue for Class_Size and StringValue for the rest of the columns)

Step-5: Click Apply and OK and then Execute the node

Partitioning data using stratified sampling

We will use the Partitioning node and utilize the stratified sampling techniques for splitting data into two parts, with first part containing 70% of the rows and the second part containing the remaining 30%.

Step-1: Add the Partitioning node from Node Repository: Manipulation > Row > Transform > Partitioning

Step-2: Connect it to the Column Rename node

Step-3: Right click on the node and select Configure

Step-4: In the First partition tab select Relative[%] and enter 70 as the percentage

Step-5: Select Stratified sampling and choose Score_Category column from the dropdown and select Use random seed option

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon execution right click on the node and select First partition or Second partition to view the two parts/samples of the data

K-Nearest Neighbor Classification

We shall use the K Nearest Neighbor node in KNIME to perform the classification and prediction.

Step-1: Add the K Nearest Neighbor node from Node Repository: Analytics > Mining > Misc Classifiers > K Nearest Neighbor

Step-2: Connect it to the Partitioning node

Step-3: Right click on the node and select Configure

Step-4: In the Standard settings tab select Score_Category as the column with class labels. Enter 1 as the k value and select Weight neighbors by distance

Step-5: Click Apply and OK and then Execute the node

Note: The KNN algorithm considers only numeric columns and all other columns are ignored.

Measuring Performance and Accuracy

Now that we have built the classifier, we shall now proceed to measure the performance in terms of classification accuracy. We shall use the Scorer node in KNIME to view the confusion matrix and accuracy statistics.

Step-1: Add the Scorer node from Node Repository: Analytics > Mining > Scoring > Scorer

Step-2: Connect it to the K Nearest Neighbor node

Step-3: Right click on the node and select Configure

Step-4: In the Scorer tab select Score_Category as the First Column and Class [kNN] (which is the model output) as the Second Column

Step-5: Click Apply and OK and then Execute the node

Step-7: Upon execution right click on the node and select View: Confusion Matrix to view the confusion matrix and accuracy statistics

With accuracy of 45.652% and an error rate of 54.348%, the classifier that we built is not performing well. However, please be reminded that this exercise was only meant to show how to perform KNN in KNIME and not particularly focussed on classification accuracy.

About V2K

Check Also

Binning Numeric Data with KNIME

In many situations, we find it convenient if the variables are categorical in nature while …

Leave a Reply

Your email address will not be published. Required fields are marked *