Learn data science and machine learning by building real-world projects on Jovian

Analysify

Have you ever analysed your Spotify Playlist? Not? Try it out!

But first you will need to create credentials on the Spotify Developers Website.

Go to Spotify Developers Dashboard, login with your own Spotify Account and create a new "App".

Then go to the top right corner and click on "Edit Settings" and change the "Redirect URIs" to http://localhost.

On the left side you will find the "Client ID" and the "Client Secret", which you will need for the further code shown below.

Here in this project, we will compare two different Spotify Playlists with each other.

The first Playlist is a kind of Synthwave music, the second Playlist pure Classic.

Let's start with the code!

In [1]:
import json
import time
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Connect to the Spotify restAPI

Please insert your own credentials for the client_id= and the client_secret=, otherwise we won't be able to the the Playlist details.

In [2]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id="",
                                               client_secret="",
                                               redirect_uri='http://localhost',
                                               scope="user-library-read"))

Definition of our functions to get all details from the Playlists

To be able to get the data we need, we will create 3 functions for this:

  1. Get the Track-ID's from every Track of the Playlist
  2. Get the Artist-ID's for the Tracks. With this we will get 'genres'.
  3. Get the Track detailed data. We can save it into a file.
In [3]:
def get_track_ids(playlist_id):
    music_id_list = [] #Save all Track-ID's into a list
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']: #Loop the Playlist in 'tracks', 'items' and 'track' to get the 'id'
        music_track = item['track']
        music_id_list.append(music_track['id'])
    return music_id_list
In [4]:
def get_artist_ids(playlist_id):
    artist_id_list = [] #Save all Artist-ID's into a list
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']: #Loop the Playlist in 'tracks', 'items', 'track' and 'artists' to get the 'id' for the Artist
        artist_id = item['track']['artists']
        artist_id_list.append(artist_id[0]['id']) #The 'id' is intended with an additional whitespace, so the '[0]' is standing for this whitespace
    return artist_id_list
In [5]:
def get_track_data(track_id, artist_id):
    meta_track = sp.track(track_id) #Connects to Spotify-API to get data from the Tracks
    meta_artist = sp.artist(artist_id) #Connects to Spotify-API to get data from the Artist
    meta_features = sp.audio_features(track_id) #Connects to Spotify-API to get data from the Track-Features
    track_details = {"name": meta_track['name'], "album": meta_track['album']['name'],
                     "artist": meta_track['album']['artists'][0]['name'],
                     "release_date": meta_track['album']['release_date'],
                     "duration_in_min": round((meta_track['duration_ms'] * 0.001) / 60.0, 2),
                     "track_id": meta_track['id'],
                     "genres": meta_artist['genres'], "popularity": meta_artist['popularity'],
                     "danceability": meta_features[0]['danceability'],
                     "energy": meta_features[0]['energy'],
                     "loudness": meta_features[0]['loudness'],
                     "speechiness": meta_features[0]['speechiness'],
                     "acousticness": meta_features[0]['acousticness'],
                     "instrumentalness": meta_features[0]['instrumentalness'],
                     "liveness": meta_features[0]['liveness'],
                     "valence": meta_features[0]['valence'],
                     "tempo": meta_features[0]['tempo'],
                     }
    return track_details

Input your first Playlist-ID from Spotify

The first Playlist ID used in this Project is "spotify:playlist:4UjGSynmzZu0agnX1DRW2H" but we will use only the ID itself 4UjGSynmzZu0agnX1DRW2H.

By the very first usage, you will be asked to input also your "Redirect URIs" and you will get this message:

Couldn't read cache at: .cache

Using localhost as redirect URI without a port. Specify a port (e.g. localhost:8080) to allow automatic retrieval of authentication code instead of having to copy and paste the URL your browser is redirected to.

Your standard Webbrowser will open a new Tab.

Please copy out the full URL that you will see there and paste it back here, after you input your Playlist ID. It will look like this:

http://localhost/?code=AQAQuO1NbnIj6kcKjBqQZbZMER6ONn7ygPSf6pLgHl0G_IlqxXHLoBVUUab5sgILJNPTgSX46-gl3mQ35Et6YVNI_DqAVBWQ9pH

In [6]:
playlist_id = input('Paste your Spotify Playlist ID here: ')
track_ids = get_track_ids(playlist_id)
artist_ids = get_artist_ids(playlist_id)
print(len(track_ids))
print(track_ids)
19 ['4R1P3FN4vQBCpVESEZBxIP', '4QnTtQJnAxK61zOibTJpYT', '5zZ0g9HOZqK0xtemfV82nI', '3v2jc9yGbSKKgVEn3k51AZ', '5IDLG8VXGFRHlwOqKCChfA', '24ylIO48nRsdaONlM8l2HF', '4o6Ufgnf7pT55tI4j78RkT', '49ErwcBYfYRPNBdRuPvpYA', '4wSmqFg31t6LsQWtzYAJob', '5UFXAE1QXIGnmALcrQ4DgZ', '2bHpNAMEsB3Wc00y87JTdn', '0U0ldCRmgCqhVvD6ksG63j', '4N8LAvjQRG1HTNTSwF6Deq', '5TxotO7jpRFwG1dR1suT7G', '10qbHF920zH5K8C8IcE5AL', '3BcEpBfEx2mOyCSJWIHSvu', '4V0rrbFdfzLbcV3WOYjXXa', '5Y7CCUdF87MwbmLllXplou', '72sZpOwollMmZxoRB5hEGd']

Save all data in a list and create a JSON file

Of course it's possible to work directly with the "tracks" list, but here we want to work better with a JSON file, because the most data you will find in the Internet will be files.

In [7]:
tracks = []
for i in range(len(track_ids)):
    time.sleep(.3) #Important for Spotify, otherwise it's possible to be blocked by too many requests from the API
    track = get_track_data(track_ids[i], artist_ids[i])
    tracks.append(track)
In [8]:
with open('playlist_coding.json', 'w') as outfile:
    json.dump(tracks, outfile, indent=4)
In [9]:
import pandas as pd
import numpy as np
In [10]:
pl_coding = pd.read_json('playlist_coding.json')

Full Data from your first Playlist

In [11]:
pl_coding.head()
Out[11]:

Create our second Playlist for comparison

Here's the pure Classic music Playlist: 6WR9WVxC8w1cSgAfDcezJD

In [12]:
playlist_id = input('Your playlist ID here: ')
track_ids = get_track_ids(playlist_id)
artist_ids = get_artist_ids(playlist_id)
print(len(track_ids))
print(track_ids)
40 ['3ZanucVESbJHuUtJD4iAIF', '0ifarBEWNchN37jbwBpoii', '5i6nEL431cmxuhxDeU4rYu', '0PgvxQlcMgarpMNAcUgNpZ', '5BKrnXY1Dr7PMAk0LsTQGw', '2cn6yQgxB0JFejLYjTz0Qw', '2YmctpU6kgcpp5MO9ZtFGi', '4qZqFQ2KGXio3NtDtWTy3G', '7woSMo1G2Xa6wIhikr4Ps1', '0NOiSayyUFYnLllkTdFa1k', '2xizRhme7pYeITbH1NLLGt', '7qr2R4Wc72iixUDfMfEzBT', '3C4JNyv2NAT72xm0cDKl0v', '2oLjhx7w8Hyd3gry9cCXr7', '5zWCfVmj9JHlVLNnBpIKt0', '7E1ErYYCn0lYjHODZ1qGuB', '67TCAXIe154ZGDNaWceqxC', '17i5jLpzndlQhbS4SrTd0B', '5bu9A6uphPWg39RC3ZKeku', '7h6GoPvGHC9uzZJ8bNvfIq', '2Tz7fLm0pWasWCCfJiHPlJ', '4GRDiNU8fkizXz4rDX9gS5', '7n92QzQomRCLlciO14X0kd', '0pjCsB0XNSyqM9UazlTODC', '1upQiytDIEZfl9ItruoXuC', '2e8MxBgVWMSQmxb2zcuCoq', '3sAYxq1986j3ydqLv6jwUJ', '6N7JzrteJv8lsr1GWYyu0b', '4rjnWmrSRqXVkFWdKMG3pV', '3DNRdudZ2SstnDCVKFdXxG', '2kyEgPaAW8wdpvevPnkf0Z', '6Z34YgqCJkdrliDmbcaJgy', '41ujv4mhxlqR8nlnieDpDp', '7HSs4srn1qnZhh7WRWBVOk', '6JV3m7TDJ9gsJNHp0e4MWM', '1ntwYN2nT0Dl3c8lne15ii', '3oHSL6pt9LpNrQZuQGu9wL', '2r1FiNXh5mDNEP8K07YRVp', '25ZYvZ2qFw7XCu8nzUxxmU', '7B4HbpZCSfLzKGapKzlUPD']
In [13]:
tracks = []
for i in range(len(track_ids)):
    time.sleep(.3)
    track = get_track_data(track_ids[i], artist_ids[i])
    tracks.append(track)
In [14]:
with open('playlist_classic.json', 'w') as outfile:
    json.dump(tracks, outfile, indent=4)
In [15]:
pl_classic = pd.read_json('playlist_classic.json')
In [16]:
pl_classic.head()
Out[16]:

Add additional column to identify the different Playlists by a number

We will define the Playlist Name "Coding" for our first Playlist and add the Name playlist as a new column.

For the first Playlist we will use the number 0 for identification.

The same we will do for the Playlist "Classic". Adding here a new column and define for this Playlist the number 1.

In [17]:
pl_coding['playlist'] = 0
pl_classic['playlist'] = 1

Merging both lists into one

We will merge both lists into one, so we have our "identifier" with the new column playlist.

In [18]:
pl_merged = pl_coding.append(pl_classic)
pl_merged.reset_index(inplace=True)
pl_merged.drop('index', axis = 1, inplace=True)
pl_merged.head()
Out[18]:

Data for the Analysis

Here we will set up a new function that will seperate our Audio-Features from Spotify.

For reference, you can find the Documentation here.

To show you what kind of Audio-Features Spotify provide, a brief overview here:

  1. Acousticness:

    A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

  2. Danceability:

    Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

  3. Energy:

    Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

  4. Instrumentalness:

    Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

  5. Liveness:

    Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

  6. Speechiness:

    Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

  7. Valence:

    A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [19]:
def features(pl,playlist):
    if playlist == 'both':
        features = pl.loc[: ,['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence', 'playlist']]
    elif playlist == 0 or playlist == 1:
        features = pl.loc[pl.playlist == playlist,['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence']]
    else:
        return 'Error'
    return features

Create two Data Frames for each Playlist and Audio-Features

The first two Data Frames contains only the Audio-Features.

We will need them later for the Analysis.

In [20]:
features_coding = features(pl_merged, 0)
features_classic = features(pl_merged, 1)
features_classic.head()
Out[20]:

Check which and how many different Artists we have in each Playlist

In [21]:
pl_coding.artist.value_counts()
Out[21]:
Dance With the Dead    4
Mega Drive             2
Various Artists        1
Quantic                1
Magic Sword            1
Justice                1
Lorn                   1
Kavinsky               1
Simian                 1
Hyper                  1
M|O|O|N                1
Thomas Barrandon       1
Peace Orchestra        1
Gunship                1
Carpenter Brut         1
Name: artist, dtype: int64
In [22]:
pl_classic.artist.value_counts()
Out[22]:
Ludwig van Beethoven        7
Wolfgang Amadeus Mozart     3
Pyotr Ilyich Tchaikovsky    3
Johann Sebastian Bach       3
Various Artists             2
Georges Bizet               2
Johannes Brahms             2
Antonio Vivaldi             2
Gustavo Dudamel             1
Berliner Philharmoniker     1
George Frideric Handel      1
Luciano Pavarotti           1
Franz Joseph Haydn          1
Frédéric Chopin             1
Maurice Ravel               1
Modest Mussorgsky           1
Giuseppe Verdi              1
Antonín Dvořák              1
Richard Wagner              1
Yo-Yo Ma                    1
Claude Debussy              1
Dmitri Shostakovich         1
Sir Neville Marriner        1
Sergei Prokofiev            1
Name: artist, dtype: int64

Get the Total number of each Playlist

In [23]:
coding_total_tracks = pl_coding.artist.count()
classic_total_tracks = pl_classic.artist.count()
print(f'Total of tracks:\nCoding: {pl_coding.artist.count()}\nClassic:{pl_classic.artist.count()}')
Total of tracks: Coding: 19 Classic:40

Data Analysis

Compare the means of each Playlist

I will commence by plotting a Bar chart and a Radar Chart showing the means of the Audio-Features in order to compare both of the Playlists.

The plots show that the predominant feature in my "Coding" Playlist is energy. On the other hand, looking at the "Classic" features, we can notice that acousticness and instrumentalness are the prevalent Audio-Features of the list.

In [31]:
N = len(features_coding.mean())
ind = np.arange(N)

width = 0.35
plt.barh(ind,features_coding.mean(), width, label='Coding', color = 'magenta')
plt.barh(ind + width, features_classic.mean(), width, label='Classic', color = 'lime')

plt.xlabel('Mean', fontsize = 12)
plt.title('Mean values of the audio features')
plt.yticks(ind + width / 2, (list(features_classic)[:]), fontsize = 12)
plt.legend(loc='best')
plt.rcParams['figure.figsize'] =(5,5)

plt.show()
In [33]:
labels = list(features_coding)[:]
stats = features_coding.mean().tolist()
stats2 = features_classic.mean().tolist()

angles = np.linspace(0, 2*np.pi, len(labels), endpoint=False)

stats = np.concatenate((stats,[stats[0]]))
stats2 = np.concatenate((stats2,[stats2[0]]))
angles = np.concatenate((angles,[angles[0]]))

fig = plt.figure(figsize = (20,20))

ax = fig.add_subplot(221, polar=True)
ax.plot(angles, stats, 's-', linewidth=2, label = 'Coding', color = 'magenta')
ax.fill(angles, stats, alpha = 0.25, facecolor = 'magenta')
ax.set_thetagrids(angles * 180/np.pi, labels=['fill', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence', 'acousticness'], fontsize = 10, color='black')

ax.set_rlabel_position(250)
plt.yticks([0.2, 0.4, 0.6, 0.8, 1.0], ['0.2', '0.4', '0.6', '0.8', '1.0'], color='gray', size = 12)
plt.ylim(0,1)

ax.plot(angles, stats2, 'D-', linewidth = 2, label = 'Classic', color = 'lime')
ax.fill(angles, stats2, alpha = 0.25, color = 'lime')
ax.set_title('Mean Values of the audio features')
ax.set_facecolor('whitesmoke')
ax.grid(True)

plt.legend(loc = 'best', bbox_to_anchor = (0.1, 0.1));

Tempo and Loudness

Tempo

The tempo is an important feature in terms of music analysis.

It can be as significant as melody, harmony or rhythm due to the fact that it represents the speed of a song and the mood it evokes.

For instance, the higher the BPM of a song, the faster the song is and consequently more inspiring and joyful it tends to be. On the other hand, a low BPM means that the song is slower, which can indicate sadness, romance or drama.

I never tought that Classic has such a high BPM. But now we can see it.

Loudness

The overall means of music loudness is somewhere between -5db and -15db.

I always tought that my Classic Playlist is louder than the Coding Playlist.

But if you think a little bit more about it, yes! Nowadays music is mostly produced in a studio and optimized to be loud.

So you can see that the average loudness of the Coding Playlist is about 6db, which is loud.

The Classic Playlist has a average over more than -20db.

In a Classic Track, you have long moments of silence, that's why the means of loudness is much lower.

In [36]:
tempo_coding = pl_coding.loc[pl_merged.playlist == 0, ['tempo']]
tempo_classic = pl_classic.loc[pl_merged.playlist == 1, ['tempo']]

N = len(tempo_coding.mean())

ind = np.arange(N)

plt.subplot(221)
width = 0.35
plt.bar(ind, tempo_coding.mean(), width, label = 'Coding', color = 'magenta')
plt.bar(ind + 1.1*width, tempo_classic.mean(), width, label = 'Classic', color = 'lime')

plt.ylabel('Mean [BPM]', fontsize = 12)
plt.title('Tempo Means')

plt.xticks(ind + width / 2, (list(tempo_coding)[:]), fontsize = 12)
plt.legend(loc = 'best')
plt.style.use('ggplot')

plt.subplot(222)

loud_coding = pl_merged.loc[pl_merged.playlist == 0, ['loudness']]
loud_classic = pl_merged.loc[pl_merged.playlist == 1, ['loudness']]

N = len(loud_classic.mean())

ind = np.arange(N)

width = 0.35
plt.bar(ind, loud_coding.mean(), width, label='Coding', color = 'magenta')
plt.bar(ind + 1.1*width, loud_classic.mean(), width, label = 'Classic', color = 'lime')

plt.ylabel('Mean [db]', fontsize = 12)
plt.title('Loudness Means')

plt.xticks(ind + width / 2, (list(loud_coding)[:]), fontsize = 12)
plt.legend(loc = 'best')
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13,13)
plt.tight_layout()
plt.show()

How diversified are the lists ?!

The variety of the lists can be investigated by checking the difference in genres of the songs. In case most of the songs belong to the same genre we would say that they are low varied.

The question is: How would we analyze this? Well, the answer is simple : Let's check the standard deviation of each audio variable and examine them.

Although the standard deviation of the audio features themselves do not give us much information ( as we can see in the plots below), we can sum them up and calculate the mean of the standard deviation of the lists. By doing that we get the values represented in the plot "Variety of Audio Features".

How would we interpret that?! Well, let's say we can have songs with a high value of a specific feature such as energy and other songs with a really low value for the same attribute.

In [37]:
plt.subplot(221)

features_coding.std().sort_values(ascending = False).plot(kind = 'bar', color = 'magenta')

plt.xlabel('Features', fontsize = 14)
plt.ylabel('Standard Deviation', fontsize = 14)
plt.title('Standard Deviation of "Coding" Audio Features')

plt.subplot(222)
features_classic.std().sort_values(ascending= False).plot(kind = 'bar', color = 'lime')

plt.xlabel('Features', fontsize = 14)
plt.ylabel('Standard Deviation', fontsize = 14)
plt.title('Standard Deviation of "Classic" Audio Features')
plt.rcParams['figure.figsize'] =(20,20)
In [38]:
features_merged = features(pl_merged, playlist = 'both')
features_merged.head()
Out[38]:

Correlation Between Variables

We can also build correlation plots, such as scatter plots, to show the relationship between variables.

In our case, we will correlate the feature valence which describes the musical positiveness with danceability and energy.

In order to interpret the plots below, we have to keep in mind that numbers zero (red dots) and ones (violet dots) represent the Coding tracks and Classic tracks, respectively. That said, let's check the scatter plots.

Valence and Energy

The correlation between valence and energy shows us that there is a conglomeration of songs with high energy and a low level of valence. This means that many of the Coding tracks sound more negative with feelings of sadness, anger and depression.

When we look at the violet dots we can see that as the level of valence - positive feelings increase.

In [41]:
fig, ax = plt.subplots()
plt.style.use('bmh')
pl_merged.plot(kind = 'scatter', x = 'valence', y = 'energy', ax = ax, c = 'playlist', s = 100, colormap = 'rainbow_r', title = 'Valence x Energy')
ax.set_xlabel('Valence')
plt.show()

Valence and Danceability

Now, looking at the relationship between valence and danceability we can see that the Classic tracks have low values of danceability.

On the other hand, the Coding tracks are mostly either in the third or in the first quadrant, showing a kind of variety in terms of these two features.

In [42]:
fig, ax = plt.subplots()
pl_merged.plot(kind = 'scatter', x = 'valence', y = 'danceability', c = 'playlist', ax = ax, s = 100, colormap = 'rainbow_r', title = 'Valence x Danceability')
plt.show()

Conclusion

It's amazing to "see" the differences of the both Playlists, even without hearing them.

I think it will be great to see much bigger Playlists and comparing them together.

By analyzing the music I learned a lot about the different kind of plots, which makes me very proud of that.

In [44]:
import jovian
In [50]:
jovian.commit(project='Analysify-Course-Project', filename='spotify_project.ipynb', environment=None)
[jovian] Attempting to save notebook.. [jovian] Creating a new project "morphzeus83/Analysify-Course-Project" [jovian] Uploading notebook.. [jovian] Committed successfully! https://jovian.ml/morphzeus83/analysify-course-project
In [ ]: