Most of data science is oriented towards the Train, Test, Predict paradigm. Who doesn't want to guess the future! But there are some cases where we need other implementations like unsupervised classification or discovering patterns in existing data, I think this aspect is a little bit disregarded and in my personal quest, I had to search a while before to find the right way to do it. Hence the reason of this little contribution.
Here's the story: A client of us needed a way to find similar items (neighbors) of a given entity, according to a fixed number of parameters. Practically, the dataset is composed by votes from Human Resources Professionals who could attribute up to 5 skills to an arbitrary amount of World universities. It means the Edouard from HR could vote for MIT as a good institution for Digitalization, Oxford for Internationality and La Sorbonne for Soft Skills.
I prepared the data, output a Spiderweb chart where the client could choose any Institution and compare it with the others, here is an example for three random universities:
At that point, it seemed interesting to search and display Universities that would have been voted the same way, maybe to study their actions and compare what they were doing good and what wrong. The data came in a spss file, with one row by vote, the process had to be fast, as it was meant to be used from a Backend service, with real time results.
I thought that the best processing format for that would be a KD Tree for its multi-dimensional nature and its relatively easy and fast processing possibilities. I won't explain in detail what KD Trees are but you can refer to the wikipedia article
It is fully integrated into the sklearn module, and very easy to use as we´ll see below.
Let's do some processing!
Our dataset, as property of the client, has been anonymized. The names of the universities have been taken away, but the values are real.
We´ll start by importing the libraries:
import pandas as pd from sklearn.neighbors import KDTree
We already converted the spss file to csv file, so we just have to import it using pandas read_csv method, and examine it
df = pd.read_csv("https://bitbucket.org/zeerobug/material/raw/12301e73580ce6b623a7a6fd601c08549eed1c45/datasets/votes_skills_anon.csv", index_col=0) df.head()
Each row corresponds to a vote where:
So for example that means the user #538 voted "Internationality" as an important skill for University #5c9e7a6834cda4f3677e662b.
Our next step consists in grouping by institution and skill (response). We do it with the excellent groupby function that generates a SeriesGroupBy object that we can use to count the number of similar pairs of (univid, response) in the dataset.
By using reset_index, we generate a DataFrame back from the serie which is ouput by the count() function, and we create the "value" column that contains that count.
skills = df.groupby(["univid", "response"])["response"].count().reset_index(name='value') skills.head(6)
Even if a lot more readable, this format is useless for our goal. It is difficult to distinguish between institutions as the number of rows is arbitrary (some skills may have not been voted), and lots of tools work best with row values instead of columns values.
Luckily, Pandas offers a very powerful tool for that matter, pivot_table, which allows us to convert rows to columns. Pivot_table arguments are self explanatory so I won't enter into details.
univSkills = skills.pivot_table(values="value", columns="response", index=["univid"]) univSkills.head()
Et voilà! Our data is almost ready for processing, but we still have an issue: To be comparable, each row must be in the same range and if we calculate the sum of values in a row, the total is far from being similar from one row to another:
univid 5c9e345f34cda4f3677e1047 69.0 5c9e349434cda4f3677e109f 51.0 5c9e34a834cda4f3677e10bd 40.0 5c9e34ae34cda4f3677e10c7 66.0 5c9e34d534cda4f3677e1107 70.0 dtype: float64
This is because universities like Harvard have had a lot more votes that some remote unknown university. It could be interesting to use that parameter in some other calculation but for the present problem, we need to get rid of that difference. It can be done by using percentages instead of absolute values.
So we have to sum each line and divide each value by that sum. This is done in a one-liner, and then we get rid of some Nan values and display the result.
univSkills = univSkills.div(univSkills.sum(axis=1), axis=0) univSkills.fillna(0, inplace=True ) univSkills.head()
Our data is now clean, ready, the values are in the same range, we can start playing with some more interesting processing.
So our algorithm has to detect amongst all the universities in our dataset, which ones have the closest values to our 5 variables at the same time. We can immediately think of a brute force algorithm with intricated loops that would compare value by value until it finds the 5 closest values for each variable but that would be far from fast enough for our appplication!
The KD Tree algorithm is way more efective, it consists of a geometrical approach of the data, where by subsequent divisions of a n-dimensional space it generates a tree that re-organizes the data in a way that allows complex queries to run very fast. The KDTree function from sklearn does exactly that.
tree = KDTree(univSkills)
And now we can start searching our tree, we will give it and initial row to start our comparison, for example the row of index 9 (
univSkills[9:10]), we want a result set of the 5 closest universities (
k=5) and the "query" function applied to our tree will return a tuple of 2 numpy arrays (
dist, index), which will be respectively the "distance" and the index of the result, sorted from the closest to the furthest.
Let's see it in action.
dist, ind = tree.query(univSkills[9:10], k=5)
And then we can display the values of our neighbors:
We notice right away that the values are very close, but we can cofirm that with a new Radar chart, and the result is impressive!
Here is the chart for the values shown above:
You can try with different starting row, in most of the cases, the radar charts will remain very similar.
You can also play with the other KDTree method variables that you will find in the documentation. Let me know if your experiments lead you to better results.
I think there are a lot of other applications of this algorithm. It can be used in recommender systems, dating sites, and generally any processing that relies on proximity between multidimensional matrices
Relationship, for example: by determining a maximum distance and apply the query function to each row of the whole dataset, we could spot unseen relations and generate a GraphQL-like database.
Due to its speed, simplicity and efectivity, the KDTree can also be employed in some simple cases as a replacement for far more complicated libraries like TensorFlow. We'll look into that in my next article.
Et voilà, I hope that this article will be of use to someone. Don't hesitate to send me email, or messages for any inquiry.