EazyML Data Quality Template¶

Define Imports¶

In [ ]:
!pip install --upgrade eazyml-data-quality
!pip install --upgrade eazyml-automl
!pip install gdown python-dotenv
In [1]:
import os
from eazyml_data_quality import (
    ez_init,
    ez_data_quality
)

from eazyml import ez_display_df, ez_display_json
import gdown
import pandas as pd


from dotenv import load_dotenv
load_dotenv()
Out[1]:
True

1. Initialize EazyML¶

The ez_init function uses the EAZYML_ACCESS_KEY environment variable for authentication. If the variable is not set, it defaults to a trial license.

In [2]:
ez_init(access_key=os.getenv('EAZYML_ACCESS_KEY'))
Out[2]:
{'success': True,
 'message': 'Initialized successfully. You may revoke your consent to sharing usage stats anytime. You have exclusive paid access.'}

2. Define Dataset Files and Outcome Variable¶

In [ ]:
gdown.download_folder(id='1EobxYR3pg_Z3Sd4sETfe4aJLAsT98fL2')
In [3]:
# Names of the files that will be used by EazyML APIs
train_file_path = os.path.join('data', "Heart_Attack_traindata.csv")
test_file_path = os.path.join('data', "Heart_Attack_testdata.csv")

# The column name for outcome of interest
outcome = "class"

3. Dataset Information¶

The dataset used in this notebook is the Heart Attack Dataset, which is a well-known dataset in machine learning and statistics. It contains data about patients, with several features (such as age, gender, blood pressure levels, and heart-related measurements) to predict the likelihood of a heart attack.

Columns in the Dataset:¶

  • age: The age of the patient, measured in years.
  • gender: The gender of the patient, represented as a categorical variable (e.g., 1 = male, 0 = female).
  • impulse: Refers to the patient's pulse rate, measured in beats per minute (bpm).
  • pressurehight: Refers to systolic blood pressure, the higher number in a blood pressure reading (e.g., 120/80 mmHg).
  • pressurelow: Refers to diastolic blood pressure, the lower number in a blood pressure reading (e.g., 120/80 mmHg).
  • glucose: A measurement related to the heart, likely referring to potassium (K) concentration.
  • kcm: This refer to a measurement related to the heart, related to potassium (K) concentration.
  • troponin: A protein found in the heart muscle, measured to assess heart damage (especially after a heart attack).
  • class: The target variable, indicating the presence or absence of a condition or disease (e.g., 1 = heart attack, 0 = no heart attack).

3.1 Display the Dataset¶

Below is a preview of the dataset:

In [4]:
# Load the dataset from the provided file
train = pd.read_csv(train_file_path)

# Display the first few rows of the dataset
ez_display_df(train.head())
  age gender impluse pressurehight pressurelow glucose kcm troponin class
0 64 1 66 160 83 160.000000 1.800000 0.012000 negative
1 21 1 94 98 46 296.000000 6.750000 1.060000 positive
2 55 1 64 160 77 270.000000 1.990000 0.003000 negative
3 64 1 70 120 55 270.000000 13.870000 0.122000 positive
4 55 1 64 112 65 300.000000 1.080000 0.003000 negative

4. EazyML Data Quality Assessment¶

4.1 Call ez_data_quality API, Perform All Checks¶

In [5]:
options = {
    "data_shape": "yes",
    "data_emptiness": "yes",
    "data_balance": "yes",
    "impute": "yes",
    "data_outliers": "yes",
    "remove_outliers": "yes",
    "outcome_correlation": "yes",
    "data_drift": "yes",
    "model_drift": "yes",
    "prediction_data": test_file_path,
    "data_completeness": "yes",
    "data_correctness": "yes"
}

res = ez_data_quality(train_file_path, outcome, options)

4.2 Data Quality Assessment Results¶

4.2.1 Data Quality Alerts: Check if Any Alerts Are True¶

In [6]:
alerts = res['data_bad_quality_alerts']
ez_display_json(alerts)
{   'data_shape_alert': 'false',
    'data_balance_alert': 'false',
    'data_emptiness_alert': 'false',
    'data_outliers_alert': 'true',
    'data_drift_alert': 'true',
    'model_drift_alert': 'false',
    'data_correctness_alert': 'Please infer the correctness through logical '
                              'inspection of the insights',
    'data_completeness_alert': 'false',
    'data_correlation_alert': 'true'}

4.2.2 Data Completeness?¶

In [7]:
ez_display_json(res['data_completeness_quality'])
{   'completeness': True,
    'decision_threshold': 0.6,
    'insight_information': 'The uploaded dataset is complete at confidence '
                           'level of 0.6',
    'top_insight': [   '0.9623',
                       'pressurelow is less than equal to 80.5,\n'
                       'troponin is less than equal to 0.01,\n'
                       'kcm is less than equal to 6.29'],
    'top_score': '0.9623'}

4.2.3 Data Balanced?¶

In [8]:
ez_display_json(res['data_balance_quality'])
{   'data_balance': {   'data_balance_analysis': {   'balance_score': 0.962,
                                                     'data_balance': True,
                                                     'decision_threshold': 0.5,
                                                     'quality_message': 'Uploaded '
                                                                        'data '
                                                                        'is '
                                                                        'balanced '
                                                                        'because '
                                                                        'the '
                                                                        'balance '
                                                                        'score '
                                                                        'is '
                                                                        'greater '
                                                                        'than '
                                                                        '0.5'}},
    'message': 'Data balance has been checked successfully',
    'success': True}

4.2.4 Data Correctness?¶

In [9]:
ez_display_json(res['data_correctness_quality'])
{   'insights': {'0': [], '1': []},
    'message': 'Please infer the correctness through logical inspection of the '
               'insights',
    'quality_alert': 'Please verify that the above rules are making sense or '
                     'not. In case there are one or more rules which appear '
                     'incorrect from an expert perspective, please double '
                     'check your files for the variables, their correct '
                     'values, in the offending rules'}

4.2.5 Data Correlations? Look for Strongly Correlated Features¶

In [10]:
feat_list = list(res['data_correlation_quality']['data_correlation'].keys())
df_corr = pd.DataFrame(columns=feat_list)
corr_dict = dict()

for feat in res['data_correlation_quality']['data_correlation']:
    corr_list = [0.0000 for i in range(len(df_corr.columns))]
    corr_val = dict()
    corr_list[feat_list.index(feat)] = 1.0000
    for another_feat in res['data_correlation_quality']['data_correlation'][feat]:
        corr_list[feat_list.index(another_feat)] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
        if res['data_correlation_quality']['data_correlation'][feat][another_feat] > 0.90:
            corr_val[another_feat] = res['data_correlation_quality']['data_correlation'][feat][another_feat]
    df_corr.loc[feat] = corr_list
    if len(corr_val) != 0:
        corr_dict[feat] = corr_val
In [11]:
ez_display_json(corr_dict)
{   'age': {'age': 1.0},
    'gender': {'gender': 1.0},
    'glucose': {'glucose': 1.0},
    'impluse': {'impluse': 1.0},
    'kcm': {'kcm': 1.0},
    'pressurehight': {'pressurehight': 1.0},
    'pressurelow': {'pressurelow': 1.0},
    'troponin': {'troponin': 1.0}}

4.2.6 Data Emptiness?¶

In [12]:
ez_display_json(res['data_emptiness_quality'])
{   'message': 'There are no missing values present in the training data that '
               'was uploaded. Hence no records were imputed.',
    'success': True}

4.2.7 Data Dimension? Is it Adequate?¶

In [13]:
ez_display_json(res['data_shape_quality'])
{   'alert': 'false',
    'dataset_shape': [1319, 9],
    'message': 'Dataset dimension is adequate for further processing',
    'success': True}

4.2.8 Data Outliers?¶

In [14]:
try:
    outlier_df = pd.DataFrame(data=res['data_outliers_quality']['outliers']['data'], \
                              columns=res['data_outliers_quality']['outliers']['columns'], \
                              index=res['data_outliers_quality']['outliers']['indices'])
    ez_display_df(outlier_df.head())
except:
    print ("no outlier")
  age gender impluse pressurehight pressurelow glucose kcm troponin class
1028 68 1 89 145 68 134.0 0.706 10.0 positive
7 63 1 60 214 82 87.0 300.0 2.37 positive
12 64 1 60 199 99 92.0 3.43 5.37 positive
530 31 0 64 130 70 263.0 142.6 0.003 positive
1047 55 0 96 105 70 66.0 300.0 0.003 positive

4.2.9 Data Drift (Between Train and Test Datasets)¶

In [15]:
ez_display_json(res['drift_quality']['data_drift_analysis'])
{   'ks_data_drift_analysis': {   'analysis': 'significant drift',
                                  'data_drift': True,
                                  'decision threshold': 0.05,
                                  'feature : p_value': {   'age': 0.472,
                                                           'gender': 1.0,
                                                           'glucose': 0.087,
                                                           'impluse': 0.009,
                                                           'kcm': 0.172,
                                                           'pressurehight': 0.141,
                                                           'pressurelow': 0.371,
                                                           'troponin': 0.241}}}
In [16]:
ks_drift = res['drift_quality']['data_drift_analysis']['ks_data_drift_analysis']['feature : p_value']
drift_columns = []
for feature in ks_drift:
    if ks_drift[feature] < 0.05:
        drift_columns.append(feature)
        print(feature, ks_drift[feature])
impluse 0.009

4.2.10 Model Drift (Between Train and Test Datasets)¶

In [17]:
ez_display_json(res['drift_quality']['model_drift_analysis'])
{   'distributional_model_drift_analysis': {   'OF': 1.0,
                                               'OF_min': 1.0,
                                               'OF_prod': 1.0,
                                               'decision threshold': 0.5,
                                               'feature : OF_I': {   'age': 1.0,
                                                                     'gender': 1.0,
                                                                     'glucose': 1.0,
                                                                     'impluse': 1.0,
                                                                     'kcm': 1.0,
                                                                     'pressurehight': 1.0,
                                                                     'pressurelow': 1.0,
                                                                     'troponin': 1.0},
                                               'model_drift': False},
    'interval_model_drift_analysis': {   'OF': 1.0,
                                         'OF_min': 1.0,
                                         'OF_prod': 1.0,
                                         'decision threshold': 0.5,
                                         'feature : OF_I': {   'age': 1.0,
                                                               'gender': 1.0,
                                                               'glucose': 1.0,
                                                               'impluse': 1.0,
                                                               'kcm': 1.0,
                                                               'pressurehight': 1.0,
                                                               'pressurelow': 1.0,
                                                               'troponin': 1.0},
                                         'model_drift': False}}
In [18]:
interval_drift = res['drift_quality']['model_drift_analysis']['interval_model_drift_analysis']['feature : OF_I']
model_drift_columns = []
for feature in interval_drift:
    if interval_drift[feature] < 0.05:
        model_drift_columns.append(feature)
        print(feature, interval_drift[feature])
In [ ]: