Home / KNIME / Partitioning Data in KNIME

Partitioning Data in KNIME

In a typical data mining project, it is a good practice to evaluate the performance of the model by applying it on a hold-out sample. Therefore, the available dataset needs to be partitioned. In this post, we shall see the multiple ways of partitioning the data in KNIME using different sampling techniques.

Reading auto_mpg.csv file

Step-1: Add the CSV Reader node from Node Repository: IO > Read > CSV Reader

Step-2: Right click on the node and select ‘Configure’

Step-3: In the Settings tab browse and choose the data from where it is located. Select the appropriate reader options as applicable.

Step-4: Click Apply and OK and then Execute the node

Now our dataset is loaded and we can view the table by right clicking on the node and selecting ‘File Table’. We can see that our table has 398 rows and 9 columns.

Number to String

We will first use the Number to String node in KNIME to convert the Origin into string column type.

Step-1: Add the Number To String node from Node Repository: Manipulation > Column > Convert & Replace > Number To String

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select Configure

Step-4: In the Options tab include origin column

Step-5: Click Apply and OK and then Execute the node

Partitioning Data based on different sampling techniques

We will use the Partitioning node and utilize different sampling techniques for splitting data into two parts.

Technique-1:  Absolute Sampling

We will use absolute sampling to partition the data into two parts, with first part containing the top 300 rows and the second part containing the remaining rows.

Step-1: Add the Partitioning node from Node Repository: Manipulation > Row > Transform > Partitioning

Step-2: Connect it to the Number to String node

Step-3: Right click on the node and select Configure

Step-4: In the First partition tab select Absolute and enter 300 as the size

Step-5: Select Take from top

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon execution right click on the node and select First partition or Second partition to view the two parts/samples of the data

Technique-2:  Stratified Sampling

We will use stratified sampling to partition the data into two parts, with first part containing 60% of the rows and the second part containing the remaining 40%.

Step-1: Add the Partitioning node from Node Repository: Manipulation > Row > Transform > Partitioning

Step-2: Connect it to the Number to String node

Step-3: Right click on the node and select Configure

Step-4: In the First partition tab select Relative[%] and enter 60 as the percentage

Step-5: Select Stratified sampling and choose origin column from the dropdown

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon execution right click on the node and select First partition or Second partition to view the two parts/samples of the data

Technique-3:  Random Sampling

We will use random sampling to partition the data into two parts, with first part containing 60% of the rows and the second part containing the remaining 40%.

Step-1: Add the Partitioning node from Node Repository: Manipulation > Row > Transform > Partitioning

Step-2: Connect it to the Number to String node

Step-3: Right click on the node and select Configure

Step-4: In the First partition tab select Relative[%] and enter 60 as the percentage

Step-5: Select Draw randomly and Use random seed and enter 400 as the seed value

Step-6: Click Apply and OK and then Execute the node

Step-7: Upon execution right click on the node and select First partition or Second partition to view the two parts/samples of the data

About V2K

Check Also

Binning Numeric Data with KNIME

In many situations, we find it convenient if the variables are categorical in nature while …

Leave a Reply

Your email address will not be published. Required fields are marked *