Grid Searching and Time Series Forecasting

Grid Searching and Time Series Forecasting#

Use the data below to set up a Pipeline that one hot encodes all categorical features and builds a RandomForestClassifier model. Grid search the model for an appropriate n_estimators and max_depth parameter optimizing precision. What were the parameters of the best model?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import fetch_openml

insurance = fetch_openml(data_id=45064)

insurance.frame.head()

	Upper_Age	Lower_Age	Reco_Policy_Premium	City_Code	Accomodation_Type	Reco_Insurance_Type	Is_Spouse	Health Indicator	Holding_Policy_Duration	Holding_Policy_Type	class
0	52	52	16200.0	C2	Owned	Individual	No	X4	6.0	4.0	0
1	67	67	16900.0	C17	Rented	Individual	No	X1	7.0	3.0	1
2	75	75	25668.0	C10	Owned	Individual	No	X3	3.0	1.0	0
3	60	57	17586.8	C26	Owned	Joint	Yes	X1	14+	1.0	0
4	35	35	12762.0	C12	Rented	Individual	No	X1	3.0	2.0	0

insurance.frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23548 entries, 0 to 23547
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   Upper_Age                23548 non-null  int64   
 1   Lower_Age                23548 non-null  int64   
 2   Reco_Policy_Premium      23548 non-null  float64 
 3   City_Code                23548 non-null  category
 4   Accomodation_Type        23548 non-null  category
 5   Reco_Insurance_Type      23548 non-null  category
 6   Is_Spouse                23548 non-null  category
 7   Health Indicator         23548 non-null  category
 8   Holding_Policy_Duration  23548 non-null  category
 9   Holding_Policy_Type      23548 non-null  category
 10  class                    23548 non-null  int64   
dtypes: category(7), float64(1), int64(3)
memory usage: 899.9 KB

insurance.frame.describe()

	Upper_Age	Lower_Age	Reco_Policy_Premium	class
count	23548.000000	23548.000000	23548.000000	23548.000000
mean	48.864192	46.365381	15409.000161	0.242059
std	16.021466	16.578403	6416.327319	0.428339
min	21.000000	16.000000	3216.000000	0.000000
25%	35.000000	32.000000	10704.000000	0.000000
50%	49.000000	46.000000	14580.000000	0.000000
75%	62.000000	60.000000	19140.000000	0.000000
max	75.000000	75.000000	43350.400000	1.000000

insurance.frame['class'].value_counts(normalize = True).plot(kind = 'bar', grid = True, title = 'Target Class Distribution');

../_images/ae8291b22228b98c6b488052679a79f0c32559fe69649f998b2b10c643f09b05.png

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer

Time Series#

For these problems I will reference Hyndman’s Forecasting: Principles and Practice. At a minimum, skim chapter 8.1 - 8.4 on Exponential Smoothing methods and 9.1 - 9.5 and 9.9 on ARIMA models. We will replicate some examples and problems from the text using sktime. Reference the documentation here when needed.

!pip install sktime

Requirement already satisfied: sktime in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (0.34.0)

Requirement already satisfied: joblib<1.5,>=1.2.0 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (1.3.2)
Requirement already satisfied: numpy<2.2,>=1.21 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (1.26.4)
Requirement already satisfied: packaging in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (23.2)
Requirement already satisfied: pandas<2.3.0,>=1.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (2.2.3)
Requirement already satisfied: scikit-base<0.12.0,>=0.6.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (0.6.1)
Requirement already satisfied: scikit-learn<1.6.0,>=0.24 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (1.5.2)
Requirement already satisfied: scipy<2.0.0,>=1.2 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from sktime) (1.11.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from pandas<2.3.0,>=1.1->sktime) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from pandas<2.3.0,>=1.1->sktime) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from pandas<2.3.0,>=1.1->sktime) (2023.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from scikit-learn<1.6.0,>=0.24->sktime) (3.2.0)

Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas<2.3.0,>=1.1->sktime) (1.16.0)

import sktime as skt
from sktime.utils.plotting import plot_correlations, plot_series

PROBLEM

In 8.1, a simple exponential smoothing model is applied to the algerian export data, and a forecast is made for 5 time steps. Use sktime and the global_economy data below to replicate this and evaluate the mean absolute percent error.

from sktime.forecasting.exp_smoothing import ExponentialSmoothing
from sktime.performance_metrics.forecasting import MeanAbsolutePercentageError

global_economy = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/global_economy.csv', index_col = 0)

algeria = global_economy.loc[global_economy['Country'] == 'Algeria']

algeria.head(3)

	Country	Code	Year	GDP	Growth	CPI	Imports	Exports	Population
117	Algeria	DZA	1960	2.723649e+09	NaN	NaN	67.143632	39.043173	11124888.0
118	Algeria	DZA	1961	2.434777e+09	-13.605441	NaN	67.503771	46.244557	11404859.0
119	Algeria	DZA	1962	2.001469e+09	-19.685042	NaN	20.818647	19.793873	11690153.0

PROBLEM

Use the data on the Australian population to replicate the exponential smoothing model with a trend from 8.2 here.

aus_economy = global_economy.loc[global_economy['Country'] == 'Australia']

PROBLEM

Use the data below on Australian tourism to fit a Holt Winters model with additive and multiplicative seasonality. Compare the performance using mape and plot the results with plot_series.

aus_tourism = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/aus_holidays.csv', index_col = 0)
aus_tourism.head()

	Quarter	Trips
1	1998 Q1	11.806038
2	1998 Q2	9.275662
3	1998 Q3	8.642489
4	1998 Q4	9.299524
5	1999 Q1	11.172027

PROBLEM

An example of non-stationary data are stock prices. Use the stock dataset below to plot the daily closing price for Amazon. Use differencing to make the series stationary and compare the resulting autocorrelation plots.

stocks = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/gafa_stock.csv', index_col = 0)
stocks.head()

	Symbol	Date	Open	High	Low	Close	Adj_Close	Volume
1	AAPL	2014-01-02	79.382858	79.575714	78.860001	79.018570	66.964325	58671200.0
2	AAPL	2014-01-03	78.980003	79.099998	77.204285	77.282860	65.493416	98116900.0
3	AAPL	2014-01-06	76.778572	78.114288	76.228569	77.704285	65.850533	103152700.0
4	AAPL	2014-01-07	77.760002	77.994286	76.845711	77.148575	65.379593	79302300.0
5	AAPL	2014-01-08	76.972855	77.937141	76.955711	77.637146	65.793633	64632400.0

PROBLEM

Use the data on australian air passengers below to fit an AutoARIMA model with sktime. What parameters were chosen? Plot the model and evaluate its predictions on 10 time steps.

aus_air = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/aus_air.csv', index_col = 0)
aus_air.head()

	Year	Passengers
1	1970	7.3187
2	1971	7.3266
3	1972	7.7956
4	1973	9.3846
5	1974	10.6647

Grid Searching and Time Series Forecasting

Contents

Grid Searching and Time Series Forecasting#

Time Series#