Analysing Numeric Bivariate Data in KNIME

In this post, we will see how to analyse bivariate data in KNIME. We will use the mammals.csv dataset, which contains the body and brain size for 62 species of land mammals.

Download the dataset from here: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Reading mammals.csv file

Step-1: Add the CSV Reader node from Node Repository: IO > Read > CSV Reader

Step-2: Right click on the node and select ‘Configure’

Step-3: In the Settings tab browse and choose the data from where it is located. Select the appropriate reader options as applicable.

Step-4: Click Apply and OK and then Execute the node

Now our mammals.csv dataset is loaded and we can view the table by right clicking on the node and selecting ‘File Table’. We can see that our table has 62 rows and 3 variables.

Summary Statistics and Visualization

We will now perform some basic summary statistics and visualization on the dataset.

Viewing Summary Statistics

Step-1: Add the Statistics node from Node Repository: Analytics > Statistics > Statistics

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Options tab include all the variables for which you want the summary statistics to be computed.

Step-5: Click Apply and OK and then Execute the node

This will provide us with the summary statistics for all the variables included.

Exploring and Visualizing Data

Let us now see how to generate box plots for the two numeric variables body and brain.

Step-1: Add the Box Plot node from Node Repository: Views > Box Plot

Step-2: Connect it to the CSV Reader node

Step-3: Upon Execution, go to Column Selection tab in the Box Plot output window and include ‘body’ and ‘brain’ variables

We can observe that the box plot of the two variables are not very informative. This could be because of the scale. Therefore, we will perform a log transformation on these two variables.

Step-1: Add the Math Formula node from Node Repository: Manipulation > Column > Convert & Replace > Math Formula

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Math Expression tab enter the Expression log($body$) and select Append Column

Step-5: Click Apply and OK and then Execute the node

Now our mammals table has one more column added log_body which is the log transformation of the original body column. Now we shall repeat the same log transformation for brain variable to create log_brain column. Subsequently, we shall generate the box plot again, only this time we will use the log transformed columns. We can observe that the box plots on the log scaled variables are much more informative.

While the box plot provided us the individual distribution of the two variables, it still does not provide us any insight about the potential relationship between the variables. Therefore, we will generate scatter plot matrix for this purpose.

Step-1: Add the Scatter Matrix node from Node Repository: Views > Scatter Matrix

Step-2: Connect it to the Math Formula node

Step-3: Upon Execution, go to Column Selection tab in the Scatter Matrix output window and include all the required variables

We can observe a linear trend; log_brain and log_body are positively correlated. So, let us analyse further to understand this linear correlation.

Generating Correlation Matrix

Step-1: Add the Linear Correlation node from Node Repository: Analytics > Statistics > Linear Correlation

Step-2: Connect it to the Math Formula node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Options tab include log_body and log_brain columns

Step-5: Click Apply and OK and then Execute the node

Step-6: Upon execution right click on the node and select ‘Correlation Matrix’ and ‘Correlation Measure’ to view the results

The correlation coefficient of 0.96 clearly shows that brain size and body size are highly positively correlated.

Fitted Linear Regression Line

Let us now use the Linear Regression Learner node to construct a fitted line for the scatter plot between these two variables.

Step-1: Add the Linear Correlation node from Node Repository: Analytics > Statistics > Regression > Linear Regression Learner

Step-2: Connect it to the Math Formula node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Settings tab select log_brain as target column and include log_body column

Step-5: Click Apply and OK and then Execute the node

Step-6: Upon execution right click on the node and select ‘Linear Regression Scatter Plot View’ to view the fitted linear regression line

 

About V2K

Check Also

Creating Dummy Variables with KNIME

Dummy variables are an effective way of utilizing categorical variables in data mining methods like …

Leave a Reply

Your email address will not be published. Required fields are marked *