EazyML Data Quality Template¶
Define Imports¶
In [ ]:
!pip install --upgrade eazyml-data-quality
!pip install --upgrade eazyml-automl
!pip install gdown python-dotenv
In [1]:
import os
from eazyml_data_quality import (
ez_init,
ez_data_quality
)
from eazyml import ez_display_df, ez_display_json
import gdown
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
Out[1]:
True
1. Initialize EazyML¶
The ez_init function uses the EAZYML_ACCESS_KEY environment variable for authentication. If the variable is not set, it defaults to a trial license.
In [2]:
ez_init(access_key=os.getenv('EAZYML_ACCESS_KEY'))
Out[2]:
{'success': True,
'message': 'Initialized successfully. You may revoke your consent to sharing usage stats anytime. You have exclusive paid access.'}
2. Define Dataset Files and Outcome Variable¶
In [ ]:
gdown.download_folder(id='1EobxYR3pg_Z3Sd4sETfe4aJLAsT98fL2')
In [3]:
# Names of the files that will be used by EazyML APIs
train_file_path = os.path.join('data', "Heart_Attack_traindata.csv")
test_file_path = os.path.join('data', "Heart_Attack_testdata.csv")
# The column name for outcome of interest
outcome = "class"
3. Dataset Information¶
The dataset used in this notebook is the Heart Attack Dataset, which is a well-known dataset in machine learning and statistics. It contains data about patients, with several features (such as age, gender, blood pressure levels, and heart-related measurements) to predict the likelihood of a heart attack.
Columns in the Dataset:¶
- age: The age of the patient, measured in years.
- gender: The gender of the patient, represented as a categorical variable (e.g., 1 = male, 0 = female).
- impulse: Refers to the patient's pulse rate, measured in beats per minute (bpm).
- pressurehight: Refers to systolic blood pressure, the higher number in a blood pressure reading (e.g., 120/80 mmHg).
- pressurelow: Refers to diastolic blood pressure, the lower number in a blood pressure reading (e.g., 120/80 mmHg).
- glucose: A measurement related to the heart, likely referring to potassium (K) concentration.
- kcm: This refer to a measurement related to the heart, related to potassium (K) concentration.
- troponin: A protein found in the heart muscle, measured to assess heart damage (especially after a heart attack).
- class: The target variable, indicating the presence or absence of a condition or disease (e.g., 1 = heart attack, 0 = no heart attack).
3.1 Display the Dataset¶
Below is a preview of the dataset:
In [4]:
# Load the dataset from the provided file
train = pd.read_csv(train_file_path)
# Display the first few rows of the dataset
ez_display_df(train.head())
| age | gender | impluse | pressurehight | pressurelow | glucose | kcm | troponin | class | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 64 | 1 | 66 | 160 | 83 | 160.000000 | 1.800000 | 0.012000 | negative |
| 1 | 21 | 1 | 94 | 98 | 46 | 296.000000 | 6.750000 | 1.060000 | positive |
| 2 | 55 | 1 | 64 | 160 | 77 | 270.000000 | 1.990000 | 0.003000 | negative |
| 3 | 64 | 1 | 70 | 120 | 55 | 270.000000 | 13.870000 | 0.122000 | positive |
| 4 | 55 | 1 | 64 | 112 | 65 | 300.000000 | 1.080000 | 0.003000 | negative |
4. EazyML Data Quality Assessment¶
4.1 Call ez_data_quality API, Perform All Checks¶
In [5]:
options = {
"data_shape": "yes",
"data_emptiness": "yes",
"data_balance": "yes",
"impute": "yes",
"data_outliers": "yes",
"remove_outliers": "yes",
"outcome_correlation": "yes",
"data_drift": "yes",
"model_drift": "yes",
"prediction_data": test_file_path,
"data_completeness": "yes",
"data_correctness": "yes"
}
res = ez_data_quality(train_file_path, outcome, options)
4.2 Data Quality Assessment Results¶
4.2.1 Data Quality Alerts: Check if Any Alerts Are True¶
In [6]:
alerts = res['data_bad_quality_alerts']
ez_display_json(alerts)
{ 'data_shape_alert': 'false',
'data_balance_alert': 'false',
'data_emptiness_alert': 'false',
'data_outliers_alert': 'true',
'data_drift_alert': 'true',
'model_drift_alert': 'false',
'data_correctness_alert': 'Please infer the correctness through logical '
'inspection of the insights',
'data_completeness_alert': 'false',
'data_correlation_alert': 'true'}
4.2.2 Data Completeness?¶
In [7]:
ez_display_json(res['data_completeness_quality'])
{ 'completeness': True,
'decision_threshold': 0.6,
'insight_information': 'The uploaded dataset is complete at confidence '
'level of 0.6',
'top_insight': [ '0.9623',
'pressurelow is less than equal to 80.5,\n'
'troponin is less than equal to 0.01,\n'
'kcm is less than equal to 6.29'],
'top_score': '0.9623'}
4.2.3 Data Balanced?¶
In [8]:
ez_display_json(res['data_balance_quality'])
{ 'data_balance': { 'data_balance_analysis': { 'balance_score': 0.962,
'data_balance': True,
'decision_threshold': 0.5,
'quality_message': 'Uploaded '
'data '
'is '
'balanced '
'because '
'the '
'balance '
'score '
'is '
'greater '
'than '
'0.5'}},
'message': 'Data balance has been checked successfully',
'success': True}
4.2.4 Data Correctness?¶
In [9]:
ez_display_json(res['data_correctness_quality'])
{ 'insights': {'0': [], '1': []},
'message': 'Please infer the correctness through logical inspection of the '
'insights',
'quality_alert': 'Please verify that the above rules are making sense or '
'not. In case there are one or more rules which appear '
'incorrect from an expert perspective, please double '
'check your files for the variables, their correct '
'values, in the offending rules'}
4.2.5 Data Correlations? Look for Strongly Correlated Features¶
In [10]:
feat_list = list(res['data_correlation_quality']['data_correlation'].keys())
df_corr = pd.DataFrame(columns=feat_list)
corr_dict = dict()
for feat in res['data_correlation_quality']['data_correlation']:
corr_list = [0.0000 for i in range(len(df_corr.columns))]
corr_val = dict()
corr_list[feat_list.index(feat)] = 1.0000
for another_feat in res['data_correlation_quality']['data_correlation'][feat]:
corr_list[feat_list.index(another_feat)] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
if res['data_correlation_quality']['data_correlation'][feat][another_feat] > 0.90:
corr_val[another_feat] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
df_corr.loc[feat] = corr_list
if len(corr_val) != 0:
corr_dict[feat] = corr_val
In [11]:
ez_display_json(corr_dict)
{ 'age': {'age': 1.0},
'gender': {'gender': 1.0},
'glucose': {'glucose': 1.0},
'impluse': {'impluse': 1.0},
'kcm': {'kcm': 1.0},
'pressurehight': {'pressurehight': 1.0},
'pressurelow': {'pressurelow': 1.0},
'troponin': {'troponin': 1.0}}
4.2.6 Data Emptiness?¶
In [12]:
ez_display_json(res['data_emptiness_quality'])
{ 'message': 'There are no missing values present in the training data that '
'was uploaded. Hence no records were imputed.',
'success': True}
4.2.7 Data Dimension? Is it Adequate?¶
In [13]:
ez_display_json(res['data_shape_quality'])
{ 'alert': 'false',
'dataset_shape': [1319, 9],
'message': 'Dataset dimension is adequate for further processing',
'success': True}
4.2.8 Data Outliers?¶
In [14]:
try:
outlier_df = pd.DataFrame(data=res['data_outliers_quality']['outliers']['data'], \
columns=res['data_outliers_quality']['outliers']['columns'], \
index=res['data_outliers_quality']['outliers']['indices'])
ez_display_df(outlier_df.head())
except:
print ("no outlier")
| age | gender | impluse | pressurehight | pressurelow | glucose | kcm | troponin | class | |
|---|---|---|---|---|---|---|---|---|---|
| 1028 | 68 | 1 | 89 | 145 | 68 | 134.0 | 0.706 | 10.0 | positive |
| 7 | 63 | 1 | 60 | 214 | 82 | 87.0 | 300.0 | 2.37 | positive |
| 12 | 64 | 1 | 60 | 199 | 99 | 92.0 | 3.43 | 5.37 | positive |
| 530 | 31 | 0 | 64 | 130 | 70 | 263.0 | 142.6 | 0.003 | positive |
| 1047 | 55 | 0 | 96 | 105 | 70 | 66.0 | 300.0 | 0.003 | positive |
4.2.9 Data Drift (Between Train and Test Datasets)¶
In [15]:
ez_display_json(res['drift_quality']['data_drift_analysis'])
{ 'ks_data_drift_analysis': { 'analysis': 'significant drift',
'data_drift': True,
'decision threshold': 0.05,
'feature : p_value': { 'age': 0.472,
'gender': 1.0,
'glucose': 0.087,
'impluse': 0.009,
'kcm': 0.172,
'pressurehight': 0.141,
'pressurelow': 0.371,
'troponin': 0.241}}}
In [16]:
ks_drift = res['drift_quality']['data_drift_analysis']['ks_data_drift_analysis']['feature : p_value']
drift_columns = []
for feature in ks_drift:
if ks_drift[feature] < 0.05:
drift_columns.append(feature)
print(feature, ks_drift[feature])
impluse 0.009
4.2.10 Model Drift (Between Train and Test Datasets)¶
In [17]:
ez_display_json(res['drift_quality']['model_drift_analysis'])
{ 'distributional_model_drift_analysis': { 'OF': 1.0,
'OF_min': 1.0,
'OF_prod': 1.0,
'decision threshold': 0.5,
'feature : OF_I': { 'age': 1.0,
'gender': 1.0,
'glucose': 1.0,
'impluse': 1.0,
'kcm': 1.0,
'pressurehight': 1.0,
'pressurelow': 1.0,
'troponin': 1.0},
'model_drift': False},
'interval_model_drift_analysis': { 'OF': 1.0,
'OF_min': 1.0,
'OF_prod': 1.0,
'decision threshold': 0.5,
'feature : OF_I': { 'age': 1.0,
'gender': 1.0,
'glucose': 1.0,
'impluse': 1.0,
'kcm': 1.0,
'pressurehight': 1.0,
'pressurelow': 1.0,
'troponin': 1.0},
'model_drift': False}}
In [18]:
interval_drift = res['drift_quality']['model_drift_analysis']['interval_model_drift_analysis']['feature : OF_I']
model_drift_columns = []
for feature in interval_drift:
if interval_drift[feature] < 0.05:
model_drift_columns.append(feature)
print(feature, interval_drift[feature])
In [ ]: