EazyML Data Quality Template¶
Define Imports¶
In [ ]:
!pip install --upgrade eazyml-data-quality
!pip install --upgrade eazyml-automl
!pip install gdown python-dotenv
In [1]:
import os
from eazyml_data_quality import (
ez_init,
ez_data_quality
)
from eazyml import ez_display_df, ez_display_json
import gdown
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
Out[1]:
True
1. Initialize EazyML¶
The ez_init function uses the EAZYML_ACCESS_KEY environment variable for authentication. If the variable is not set, it defaults to a trial license.
In [2]:
ez_init(access_key=os.getenv('EAZYML_ACCESS_KEY'))
Out[2]:
{'success': True,
'message': 'Initialized successfully. You may revoke your consent to sharing usage stats anytime. You have exclusive paid access.'}
2. Define Dataset Files and Outcome Variable¶
In [ ]:
gdown.download_folder(id='16LfwRMjchrPgdbsgPHr79AHvNCHsL5Is')
In [3]:
# Names of the files that will be used by EazyML APIs
train_file_path = os.path.join('data', "walmart_train.csv")
test_file_path = os.path.join('data', "walmart_test.csv")
# The column name for outcome of interest
outcome = "Weekly_Sales"
3. Dataset Information¶
The dataset used in this notebook is the Walmart Dataset, which contains data related to sales at Walmart stores. It includes various features such as store, fuel price, sales data, and other metrics over a specified period of time.
You can find more details and download the dataset from Kaggle using the following link:
Columns in the Dataset:¶
- Store: The store number.
- Weekly_Sales: Sales for the given store.
- IsHoliday: Whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week.
- Temperature: Temperature on the day of sale.
- Fuel_Price: Cost of fuel in the region.
- CPI: Prevailing consumer price index.
- Unemployment: Prevailing unemployment rate,
3.1 Display the Dataset¶
Below is a preview of the dataset:
In [4]:
# Load the dataset from the provided file
train = pd.read_csv(train_file_path)
# Display the first few rows of the dataset
ez_display_df(train.head())
| Weekly_Sales | IsHoliday | Temperature | Fuel_Price | CPI | Unemployment | |
|---|---|---|---|---|---|---|
| 0 | 22516.313699 | 0.000000 | 42.310000 | 2.572000 | 211.096358 | 8.106000 |
| 1 | 22804.964444 | 1.000000 | 38.510000 | 2.548000 | 211.242170 | 8.106000 |
| 2 | 22081.755753 | 0.000000 | 39.930000 | 2.514000 | 211.289143 | 8.106000 |
| 3 | 19579.549861 | 0.000000 | 46.630000 | 2.561000 | 211.319643 | 8.106000 |
| 4 | 21298.721644 | 0.000000 | 46.500000 | 2.625000 | 211.350143 | 8.106000 |
4. EazyML Data Quality Assessment¶
4.1 Call ez_data_quality API, Perform All Checks¶
In [5]:
options = {
"data_shape": "yes",
"data_emptiness": "yes",
"impute": "yes",
"data_outliers": "yes",
"remove_outliers": "yes",
"outcome_correlation": "yes",
"data_drift": "yes",
"model_drift": "yes",
"prediction_data": test_file_path,
"data_completeness": "yes",
"data_correctness": "yes"
}
res = ez_data_quality(train_file_path, outcome, options)
4.2 Data Quality Assessment Results¶
4.2.1 Data Quality Alerts: Check if Any Alerts Are True¶
In [6]:
alerts = res['data_bad_quality_alerts']
ez_display_json(alerts)
{ 'data_shape_alert': 'false',
'data_emptiness_alert': 'false',
'data_outliers_alert': 'true',
'data_drift_alert': 'false',
'model_drift_alert': 'false',
'data_correctness_alert': 'Please infer the correctness through logical '
'inspection of the insights',
'data_completeness_alert': 'false',
'data_correlation_alert': 'true'}
4.2.2 Data Completeness?¶
In [7]:
ez_display_json(res['data_completeness_quality'])
{ 'completeness': True,
'decision_threshold': 0.6,
'insight_information': 'The uploaded dataset is complete at confidence '
'level of 0.6',
'top_insight': [ '0.8756',
'Unemployment in ( 9.01, 10.23 ),\n'
'CPI is greater than 191.02,\n'
'Fuel_Price is less than equal to 3.56,\n'
'Temperature is greater than 39.23'],
'top_score': '0.8756'}
4.2.3 Data Correctness?¶
In [8]:
ez_display_json(res['data_correctness_quality'])
{ 'insights': {'0': [], '1': []},
'message': 'Please infer the correctness through logical inspection of the '
'insights',
'quality_alert': 'Please verify that the above rules are making sense or '
'not. In case there are one or more rules which appear '
'incorrect from an expert perspective, please double '
'check your files for the variables, their correct '
'values, in the offending rules'}
4.2.4 Data Correlations? Look for Strongly Correlated Features¶
In [9]:
feat_list = list(res['data_correlation_quality']['data_correlation'].keys())
df_corr = pd.DataFrame(columns=feat_list)
corr_dict = dict()
for feat in res['data_correlation_quality']['data_correlation']:
corr_list = [0.0000 for i in range(len(df_corr.columns))]
corr_val = dict()
corr_list[feat_list.index(feat)] = 1.0000
for another_feat in res['data_correlation_quality']['data_correlation'][feat]:
corr_list[feat_list.index(another_feat)] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
if res['data_correlation_quality']['data_correlation'][feat][another_feat] > 0.90:
corr_val[another_feat] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
df_corr.loc[feat] = corr_list
if len(corr_val) != 0:
corr_dict[feat] = corr_val
In [10]:
ez_display_json(corr_dict)
{ 'CPI': {'CPI': 1.0},
'Fuel_Price': {'Fuel_Price': 1.0},
'IsHoliday': {'IsHoliday': 1.0},
'Temperature': {'Temperature': 1.0},
'Unemployment': {'Unemployment': 1.0},
'Weekly_Sales': {'Weekly_Sales': 1.0}}
4.2.5 Data Emptiness?¶
In [11]:
ez_display_json(res['data_emptiness_quality'])
{ 'message': 'There are no missing values present in the training data that '
'was uploaded. Hence no records were imputed.',
'success': True}
4.2.6 Data Dimension? Is it Adequate?¶
In [12]:
ez_display_json(res['data_shape_quality'])
{ 'alert': 'false',
'dataset_shape': [6210, 6],
'message': 'Dataset dimension is adequate for further processing',
'success': True}
4.2.7 Data Outliers?¶
In [13]:
try:
outlier_df = pd.DataFrame(data=res['data_outliers_quality']['outliers']['data'], \
columns=res['data_outliers_quality']['outliers']['columns'], \
index=res['data_outliers_quality']['outliers']['indices'])
ez_display_df(outlier_df.head())
except:
print ("no outlier")
| Weekly_Sales | IsHoliday | Temperature | Fuel_Price | CPI | Unemployment | |
|---|---|---|---|---|---|---|
| 4096 | 8695.460192307692 | 1.0 | 56.43 | 3.236 | 218.1130269 | 7.441000000000001 |
| 1 | 22804.96444444444 | 1.0 | 38.51 | 2.548 | 211.2421698 | 8.106 |
| 4101 | 7387.793137254902 | 1.0 | 45.16 | 3.129 | 219.1773063 | 7.441 |
| 3175 | 19109.55540540541 | 1.0 | 25.94 | 2.94 | 131.5866129 | 8.326 |
| 5700 | 11354.276304347826 | 1.0 | 55.33 | 3.162 | 126.6692667 | 9.003 |
4.2.8 Data Drift (Between Train and Test Datasets)¶
In [14]:
ez_display_json(res['drift_quality']['data_drift_analysis'])
{ 'ks_data_drift_analysis': { 'analysis': 'no significant drift',
'data_drift': False,
'decision threshold': 0.05,
'feature : p_value': { 'CPI': 1.0,
'Fuel_Price': 1.0,
'IsHoliday': 1.0,
'Temperature': 1.0,
'Unemployment': 1.0}}}
In [15]:
ks_drift = res['drift_quality']['data_drift_analysis']['ks_data_drift_analysis']['feature : p_value']
drift_columns = []
for feature in ks_drift:
if ks_drift[feature] < 0.05:
drift_columns.append(feature)
print(feature, ks_drift[feature])
4.2.9 Model Drift (Between Train and Test Datasets)¶
In [16]:
ez_display_json(res['drift_quality']['model_drift_analysis'])
{ 'distributional_model_drift_analysis': { 'OF': 1.0,
'OF_min': 1.0,
'OF_prod': 1.0,
'decision threshold': 0.5,
'feature : OF_I': { 'CPI': 1.0,
'Fuel_Price': 1.0,
'IsHoliday': 1.0,
'Temperature': 1.0,
'Unemployment': 1.0},
'model_drift': False},
'interval_model_drift_analysis': { 'OF': 1.0,
'OF_min': 1.0,
'OF_prod': 1.0,
'decision threshold': 0.5,
'feature : OF_I': { 'CPI': 1.0,
'Fuel_Price': 1.0,
'IsHoliday': 1.0,
'Temperature': 1.0,
'Unemployment': 1.0},
'model_drift': False}}
In [17]:
interval_drift = res['drift_quality']['model_drift_analysis']['interval_model_drift_analysis']['feature : OF_I']
model_drift_columns = []
for feature in interval_drift:
if interval_drift[feature] < 0.05:
model_drift_columns.append(feature)
print(feature, interval_drift[feature])
In [ ]: