Home / KNIME / Handling Missing Value in KNIME

Handling Missing Value in KNIME

Often datasets come with varying levels of missing values. Therefore, it becomes important to handle those missing values before getting into any kind of analysis. In this post, we shall cover three basic ways of handling missing value using KNIME.

Reading auto_mpg_missing.csv file

Step-1: Add the CSV Reader node from Node Repository: IO > Read > CSV Reader

Step-2: Right click on the node and select ‘Configure’

Step-3: In the Settings tab browse and choose the data from where it is located. Select the appropriate reader options as applicable.

Step-4: Click Apply and OK and then Execute the node

Now our dataset is loaded and we can view the table by right clicking on the node and selecting ‘File Table’. We can see that our table has 24 rows and 9 variables. We observe that four columns have missing values denoted by ‘?’

The output of the Statistics node also confirms this.

Handling Missing Values

In KNIME there are dedicated nodes like ‘Missing Value’ node and ‘Missing Value Column Filter’ node for handling missing values.

Default Missing Treatment

Out of the four columns with missing values car_name column is of string column type, acceleration is numeric (double) column type and displacement and horsepower are of numeric (integer) column type.

We will see how to use the ‘Missing Value’ node to assign default treatment for each of the missing value columns based on the column type they belong.

Step-1: Add the Missing Value node from Node Repository: Manipulation > Column > Transform > Missing Value

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Default tab select

  • String Type: In the dropdown against String select Fix Value and type ‘Unknown’
  • Number (double): In the dropdown against Number (double) select Median
  • Number (integer): In the dropdown against Number (integer) select Rounded Mean

 

Step-5: Click Apply and OK and then Execute the node

Step-6: Upon Execution, right click on the node and select Output table

We observe that the missing values in the four columns have now been updated with the default values that we had assigned for each of the column types.

Column Specific Missing Treatment

Assigning default missing treatment based on column type may not be most appropriate in some cases. In such cases, it may be more suitable to assign custom missing treatment that is specific to each of the columns.

We will see how to use the ‘Missing Value’ node to assign specific treatment for each of the missing value columns independently.

Step-1: Add the Missing Value node from Node Repository: Manipulation > Column > Transform > Missing Value

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Default tab select leave the dropdown to Do nothing for all the column types

Step-5: In the Column Settings tab double click on the columns (from the left corner) that are to be treated for missing value

  • car_name: In the dropdown select Fix Value and type ‘NotAvailable’
  • acceleration: In the dropdown select Mean
  • displacement: In the dropdown select Median
  • horsepower: In the dropdown select Rounded Mean

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon Execution, right click on the node and select Output table

We observe that the missing values in the four columns have now been updated with the column specific values that we had assigned for each of the columns.

Removing Column using Missing Value Threshold

We might sometimes want to selectively filter out or drop columns which have more than certain percentage of missing values in them. We will use the Missing Value Column Filter to perform such filtering.

Step-1: Add the Missing Value Column Filter node from Node Repository: Manipulation > Column > Filter > Missing Value

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Configuration tab, select those columns that needs to be included for applying the threshold rule

Step-5: Assign Missing value threshold (in %) to 10.0

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon Execution, right click on the node and select Filtered table

We observe that the filtered table now has only 8 columns. Because displacement column which had more than 10% of missing values has been dropped from the table. All other columns with missing values continue to remain in the filtered table as they are below the threshold percentage.

About V2K

Check Also

Creating Dummy Variables with KNIME

Dummy variables are an effective way of utilizing categorical variables in data mining methods like …

Leave a Reply

Your email address will not be published. Required fields are marked *