Learn practical skills, build real-world projects, and advance your career

Detecting similarities in a multidimensional dataset

Real life example #1

Level: Beginner

Most of data science is oriented towards the Train, Test, Predict paradigm. Who doesn't want to guess the future! But there are some cases where we need other implementations like unsupervised classification or discovering patterns in existing data, I think this aspect is a little bit disregarded and in my personal quest, I had to search a while before to find the right way to do it. Hence the reason of this little contribution.

Here's the story:
A client of us needed a way to find similar items (neighbors) of a given entity, according to a fixed number of parameters.
Practically, the dataset is composed by votes from Human Resources Professionals who could attribute up to 5 skills to an arbitrary amount of World universities. It means the Edouard from HR could vote for MIT as a good institution for Digitalization, Oxford for Internationality and La Sorbonne for Soft Skills.

I prepared the data, output a Spiderweb chart where the client could choose any Institution and compare it with the others, here is an example for three random universities:

alt

At that point, it seemed interesting to search and display Universities that would have been voted the same way, maybe to study their actions and compare what they were doing good and what wrong.
The data came in a spss file, with one row by vote, the process had to be fast, as it was meant to be used from a Backend service, with real time results.

I thought that the best processing format for that would be a KD Tree for its multi-dimensional nature and its relatively easy and fast processing possibilities.
I won't explain in detail what KD Trees are but you can refer to the wikipedia article

It is fully integrated into the sklearn module, and very easy to use as we´ll see below.

Let's do some processing!

Data Preparation

Our dataset, as property of the client, has been anonymized. The names of the universities have been taken away, but the values are real.

We´ll start by importing the libraries:

import pandas as pd
from sklearn.neighbors import KDTree
  • Pandas dataframe will be our main data type, it is very fast and useful for all type of database-like operations
  • sklearn stands for scikit-learn, one the most famous library for data analysis. It is used for classification, clustering, regression and more. We'll just import KDTree from the Nearest Neighbors sub-library

We already converted the spss file to csv file, so we just have to import it using pandas read_csv method, and examine it