Clustering#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Basic Clustering Problem#

from sklearn.datasets import make_blobs
X, _ = make_blobs(random_state=11)
plt.scatter(X[:, 0], X[:, 1])
plt.title('Do you notice any groups?');
_images/9f7f4a0d409052c5f389d36d50196b3622a27a8b64b1cf82f635df519002bf92.png

There are many clustering algorithms in sklearn – let us use the KMeans and DBSCAN approach to cluster this data.

from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Setup a pipeline to fit the KMeans clustering model, fit it to the data and plot the resulting clusters.

Evaluating Clusters#

Inertia

Sum of squared differences between each point in a cluster and that cluster’s centroid.

How dense is each cluster?

low inertia = dense cluster ranges from 0 to very high values $\( \sum_{j=0}^{n} (x_j - \mu_i)^2 \)\( where \)\mu_i$ is a cluster centroid

.inertia_ is an attribute of a fitted sklearn’s kmeans object

Silhouette Score

Tells you how much closer data points are to their own clusters than to the nearest neighbor cluster.

How far apart are the clusters?

ranges from -1 to 1 high silhouette score means the clusters are well separated $\(s_i = \frac{b_i - a_i}{max\{a_i, b_i\}}\)$ Where:

\(a_i\) = Cohesion: Mean distance of points within a cluster from each other.

\(b_i\) = Separation: Mean distance from point \(x_i\) to all points in the next nearest cluster. Use scikit-learn: metrics.silhouette_score(X_scaled, labels).

Higher silhouette score is better!¶

Hidden Markov Models#

from IPython.display import Audio
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')
import yfinance as yf
btcn = yf.Ticker('BTC-USD')
btcn = btcn.history()
btcn.head()
Open High Low Close Volume Dividends Stock Splits
Date
2024-11-12 00:00:00+00:00 88705.562500 89956.882812 85155.109375 87955.812500 133673285375 0.0 0.0
2024-11-13 00:00:00+00:00 87929.968750 93434.351562 86256.929688 90584.164062 123559027869 0.0 0.0
2024-11-14 00:00:00+00:00 90574.882812 91765.218750 86682.812500 87250.429688 87616705248 0.0 0.0
2024-11-15 00:00:00+00:00 87284.179688 91868.742188 87124.898438 91066.007812 78243109518 0.0 0.0
2024-11-16 00:00:00+00:00 91064.367188 91763.945312 90094.226562 90558.476562 44333192814 0.0 0.0
#plot it
btcn['Close'].plot()
<Axes: xlabel='Date'>
_images/9203602276ed1ba5bfa55b77cb464e9b2b006f5d80598549d44ca31be2420f96.png

HMMLearn#

We will use the hmmlearn library to implement our hidden markov model. Here, we use the GaussianHMM class. Depending on the nature of your data you may be interested in a different probability distribution.

#!pip install hmmlearn
from hmmlearn import hmm
#instantiate 
model = hmm.GaussianHMM(n_components=3)
#fit
X = btcn['2021':][['Close']]
X
Close
Date
2024-11-12 00:00:00+00:00 87955.812500
2024-11-13 00:00:00+00:00 90584.164062
2024-11-14 00:00:00+00:00 87250.429688
2024-11-15 00:00:00+00:00 91066.007812
2024-11-16 00:00:00+00:00 90558.476562
2024-11-17 00:00:00+00:00 89845.851562
2024-11-18 00:00:00+00:00 90542.640625
2024-11-19 00:00:00+00:00 92343.789062
2024-11-20 00:00:00+00:00 94339.492188
2024-11-21 00:00:00+00:00 98504.726562
2024-11-22 00:00:00+00:00 98997.664062
2024-11-23 00:00:00+00:00 97777.281250
2024-11-24 00:00:00+00:00 98013.820312
2024-11-25 00:00:00+00:00 93102.296875
2024-11-26 00:00:00+00:00 91985.320312
2024-11-27 00:00:00+00:00 95962.531250
2024-11-28 00:00:00+00:00 95652.468750
2024-11-29 00:00:00+00:00 97461.523438
2024-11-30 00:00:00+00:00 96449.054688
2024-12-01 00:00:00+00:00 97279.789062
2024-12-02 00:00:00+00:00 95865.304688
2024-12-03 00:00:00+00:00 96002.164062
2024-12-04 00:00:00+00:00 98768.531250
2024-12-05 00:00:00+00:00 96593.570312
2024-12-06 00:00:00+00:00 99920.710938
2024-12-07 00:00:00+00:00 99923.335938
2024-12-08 00:00:00+00:00 101236.015625
2024-12-09 00:00:00+00:00 97432.718750
2024-12-10 00:00:00+00:00 96675.429688
2024-12-11 00:00:00+00:00 101173.031250
2024-12-12 00:00:00+00:00 101396.296875
model.fit(X)
GaussianHMM(n_components=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
#predict
model.predict(X)
array([1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0,
       2, 0, 2, 0, 2, 0, 2, 0, 2])
#look at our predictions
plt.plot(model.predict(X))
[<matplotlib.lines.Line2D at 0x13196e9c0>]
_images/860788bdfc04a668068cc33db4dca137b9f9b2565169ceeb572d4bb342450ad6.png

Looking at Speech Files#

For a deeper dive into HMM’s for speech recognition please see Rabner’s article A tutorial on hidden Markov models and selected applications in speech recognition here.

from scipy.io import wavfile
!ls sounds/apple
apple01.wav apple04.wav apple07.wav apple10.wav apple13.wav
apple02.wav apple05.wav apple08.wav apple11.wav apple14.wav
apple03.wav apple06.wav apple09.wav apple12.wav apple15.wav
#read in the data and structure
rate, apple = wavfile.read('sounds/apple/apple01.wav')
#plot the sound
plt.plot(apple)
[<matplotlib.lines.Line2D at 0x131a9e180>]
_images/3e786bd9bd75151c69faf82165993f018cd3182be22683bee07b13eedf9e7187.png
#look at another sample
rate, kiwi = wavfile.read('sounds/kiwi/kiwi01.wav')
#kiwi's perhaps
plt.plot(kiwi)
[<matplotlib.lines.Line2D at 0x131b5ab10>]
_images/734da065102c9074cb87e876cd5f1477938fb86c22be1fdf0c6404a22ef92bb1.png
from IPython.display import Audio
#take a listen to an apple
Audio('sounds/banana/banana02.wav')

Generating Features from Audio: Mel Frequency Cepstral Coefficient#

Big idea here is to extract the important elements that allow us to identify speech. For more info on the MFCC, see here.

!pip install python_speech_features
Requirement already satisfied: python_speech_features in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (0.6)
import python_speech_features as features

#extract the mfcc features
mfcc_features = features.mfcc(kiwi)

#plot them
plt.plot(mfcc_features);

#determine our x and y
X = features.mfcc(kiwi)
y = ['kiwi']
_images/28478a0a2ff24e4a2b6de4e026431b0538d2e295e36396adf258e0192774beff.png
import os
#make a custom markov class to return scores
class MakeMarkov:
    
    def __init__(self, n_components = 3):
        self.components = n_components
        self.model = hmm.GaussianHMM(n_components=self.components)
        
    def fit(self, X):
        self.fit_model = self.model.fit(X)
        return self.fit_model
    
    def score(self, X):
        self.score = self.fit_model.score(X)
        return self.score
        
    
kiwi_model = MakeMarkov()
kiwi_model.fit(X)
kiwi_model.score(X)
-731.8623250842647
hmm_models = []
labels = []
for file in os.listdir('sounds'):
    sounds = os.listdir(f'sounds/{file}')
    sound_files = [f'sounds/{file}/{sound}' for sound in sounds]
    for sound in sound_files[:-1]:
        rate, data = wavfile.read(sound)
        X = features.mfcc(data)
        mmodel = MakeMarkov()
        mmodel.fit(X)
        hmm_models.append(mmodel)
        labels.append(file)
Model is not converging.  Current: -747.3672269421642 is not greater than -747.3672268616228. Delta is -8.054132649704115e-08
Model is not converging.  Current: -767.2134107023999 is not greater than -767.2134106283881. Delta is -7.401172297250014e-08
Model is not converging.  Current: -737.257806389045 is not greater than -737.2578063691759. Delta is -1.9869048628606834e-08
Model is not converging.  Current: -1530.590252835315 is not greater than -1530.590252515925. Delta is -3.1938998290570453e-07
#write a loop that bops over the files and prints the label based on
#highest score

Making Predictions#

Now that we have our models, given a new sound we want to score these based on what we’ve learned and select the most likely example.

in_files = ['sounds/pineapple/pineapple15.wav',
           'sounds/orange/orange15.wav',
           'sounds/apple/apple15.wav',
           'sounds/kiwi/kiwi15.wav']

Further Reading#

  • Textbook: Marsland’s Machine Learning: An Algorithmic Perspective has a great overview of HMM’s.

  • Time Series Examples: Checkout Aileen Nielsen’s tutorial from SciPy 2019 and her book Practical Time Series Analysis

  • Speech Recognition: Rabiner’s A tutorial on hidden Markov models and selected applications in speech recognition

  • HMM;s and Dynamic Programming: Avik Das’ PyData Talk Dynamic Programming for Machine Learning: Hidden Markov Models