Analysing Numeric Bivariate Data by Group in KNIME

In this post, we will see how to analyse bivariate data by group in KNIME. We will use the TwinIQ.csv dataset, which contains IQ data on identical twins that were separated after birth.

Reading TwinIQ.csv file

Step-1: Add the CSV Reader node from Node Repository: IO > Read > CSV Reader

Step-2: Right click on the node and select ‘Configure’

Step-3: In the Settings tab browse and choose the data from where it is located. Select the appropriate reader options as applicable.

Step-4: Click Apply and OK and then Execute the node

Now our TwinIQ.csv dataset is loaded and we can view the table by right clicking on the node and selecting ‘File Table’. We can see that our table has 27 rows and 3 variables.

Summary Statistics and Visualization

We will now perform some basic summary statistics and visualization on the dataset.

Viewing Summary Statistics

Step-1: Add the Statistics node from Node Repository: Analytics > Statistics > Statistics

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Options tab include all the variables for which you want the summary statistics to be computed.

Step-5: Click Apply and OK and then Execute the node

This will provide us with the summary statistics for all the variables included.

Step-6: Upon execution right click on the node and select ‘Occurrences Table’ to view the results

The number of occurrences for each of the three different classes (low, medium and high) is displayed.

Generating Conditional Box Plot

We shall now generate conditional box plot to view the difference in distribution of IQ across different classes for foster and biological twins.

Step-1: Add the Box Plot node from Node Repository: Views > Conditional Box Plot

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Standard Settings tab select class as the Nominal column and IQbio as the Numeric column

Step-5: Click Apply and OK and then Execute the node

Now we shall repeat the same conditional box plot for IQfoster variable.

The conditional box plot suggests there could be differences in the IQ for twins raised by biological parents vs foster parents. Therefore, let us do a deep dive analysis using scatter plot. And to help us spot the difference we will use color and shape across the class variable.

Assigning shape and color to indicate different categories of class variable

Step-1: Add the Shape Manager node from Node Repository: Views > Property > Shape Manager

Step-2: Connect it to the CSV Reader node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Shape Settings tab select Class as the nominal column and assign shape mapping for each of the categories of the class column

Step-5: Click Apply and OK and then Execute the node

Step-1: Add the Color Manager node from Node Repository: Views > Property > Color Manager

Step-2: Connect it to the Shape Manager node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Color Settings tab select Class as the nominal column and assign color mapping for each of the categories of the class column

Step-5: Click Apply and OK and then Execute the node

Generating Scatter Plot

Step-1: Add the Scatter Plot node from Node Repository: Views > Scatter Plot

Step-2: Connect it to the Color Manager node

Step-3: Upon Execution, go to Column Selection tab in the Scatter Plot output window and select IQbio as X-column and IQfoster as Y-column

In the resulting scatter plot we observe that red squares represent high, green circles represent medium and blue triangles represent low. We can even go ahead and use the Linear Regression Learner node to construct a fitted line for the scatter plot between these two variables.

Step-1: Add the Linear Correlation node from Node Repository: Analytics > Statistics > Regression > Linear Regression Learner

Step-2: Connect it to the Color Manager node

Step-3: Right click on the node and select ‘Configure’

Step-4: In the Settings tab select IQfoster as target column and include IQbio column

Step-5: Click Apply and OK and then Execute the node

Step-6: Upon execution right click on the node and select ‘Linear Regression Scatter Plot View’ to view the fitted linear regression line

About V2K

Check Also

Creating Dummy Variables with KNIME

Dummy variables are an effective way of utilizing categorical variables in data mining methods like …

Leave a Reply

Your email address will not be published. Required fields are marked *