Appliances Energy Prediction data
import warnings
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import numpy as np
import pandas as pd
pd.set_option("display.max.columns", None)
pd.set_option("display.max_colwidth", None)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import (LinearRegression,
Ridge,
Lasso)
from sklearn.metrics import (r2_score,
mean_absolute_error,
mean_squared_error)
The dataset is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters).
energy = pd.read_csv("datasets/energydata_complete.csv")
energy.head()
The attribute information can be seen below.
Attribute Information:
Attribute | Description | Units |
---|---|---|
Date | time | year-month-day hour\:minute:second |
Appliances | energy use | in Wh |
lights | energy use of light fixtures in the house | in Wh |
T1 | Temperature in kitchen area | in Celsius |
RH_1 | Humidity in kitchen area | in % |
T2 | Temperature in living room area | in Celsius |
RH_2 | Humidity in living room area | in % |
T3 | Temperature in laundry room area | |
RH_3 | Humidity in laundry room area | in % |
T4 | Temperature in office room | in Celsius |
RH_4 | Humidity in office room | in % |
T5 | Temperature in bathroom | in Celsius |
RH_5 | Humidity in bathroom | in % |
T6 | Temperature outside the building (north side) | in Celsius |
RH_6 | Humidity outside the building (north side) | in % |
T7 | Temperature in ironing room | in Celsius |
RH_7 | Humidity in ironing room | in % |
T8 | Temperature in teenager room 2 | in Celsius |
RH_8 | Humidity in teenager room 2 | in % |
T9 | Temperature in parents room | in Celsius |
RH_9 | Humidity in parents room | in % |
To | Temperature outside (from Chievres weather station) | in Celsius |
Pressure | (from Chievres weather station) | in mm Hg |
RH_out | Humidity outside (from Chievres weather station) | in % |
Wind speed | (from Chievres weather station) | in m/s |
Visibility | (from Chievres weather station) | in km |
Tdewpoint | (from Chievres weather station) | Â °C |
rv1 | Random variable 1 | nondimensional |
rv2 | Random variable 2 | nondimensional |
energy.describe()
energy.info()
There are no missing values in the dataset.
scaler = MinMaxScaler()
normalised_df = pd.DataFrame(scaler.fit_transform(energy.drop(columns=['date', 'lights'])),
columns=energy.drop(columns=['date', 'lights']).columns)
features_df = normalised_df.drop(columns=['Appliances'])
energy_target = normalised_df.Appliances
X_train, X_test, y_train, y_test = train_test_split(features_df, energy_target, test_size=.3, random_state=42)
From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6).
lin_reg = LinearRegression()
lin_reg.fit(X_train[['T2']], X_train.T6)
T6_pred = lin_reg.predict(X_test[['T2']])
print(f'r^2 score: {round(r2_score(X_test.T6, T6_pred), 2)}')
print(f'MAE: {round(mean_absolute_error(X_test.T6, T6_pred), 2)}')
print(f'Residual Sum of Squares: {round(np.sum(np.square(X_test.T6 - T6_pred)), 2)}')
print(f'Root Mean Squared Error: {round(np.sqrt(mean_squared_error(X_test.T6, T6_pred)), 3)}')
energy.drop(columns=['date', 'lights']).max().sort_values()
energy.drop(columns=['date', 'lights']).min().sort_values()
def get_weights_df(model, feat, col_name):
#this function returns the weight of every feature
weights = pd.Series(model.coef_, feat.columns).sort_values()
weights_df = pd.DataFrame(weights).reset_index()
weights_df.columns = ['Features', col_name]
weights_df[col_name].round(3)
return weights_df
ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(X_train, y_train)
lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(X_train, y_train)
model = LinearRegression()
model.fit(X_train, y_train)
linear_model_weights = get_weights_df(model, X_train, 'Linear_Model_Weight')
ridge_weights_df = get_weights_df(ridge_reg, X_train, 'Ridge_Weight')
lasso_weights_df = get_weights_df(lasso_reg, X_train, 'Lasso_weight')
final_weights = pd.merge(linear_model_weights, ridge_weights_df, on='Features')
final_weights = pd.merge(final_weights, lasso_weights_df, on='Features')
final_weights.sort_values("Linear_Model_Weight", ascending=False)
y_pred_lg = model.predict(X_test)
y_pred_r = ridge_reg.predict(X_test)
y_pred_l = lasso_reg.predict(X_test)
print(f'Root Mean Squared Error: {round(np.sqrt(mean_squared_error(y_test, y_pred_r)), 3)}')
print(f'Root Mean Squared Error: {round(np.sqrt(mean_squared_error(y_test, y_pred_l)), 3)}')