EazyML Counterfactual Template¶

Define Imports¶

In [ ]:
!pip install --upgrade eazyml-counterfactual
!pip install gdown python-dotenv
In [1]:
import os
import numpy as np
import pandas as pd
import eazyml as ez
from eazyml_counterfactual import (
        ez_cf_inference,
        ez_init        
)
import gdown

from dotenv import load_dotenv
load_dotenv()

# Scikit-learn libraries for model building
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

1. Initialize EazyML¶

The ez_init function uses the EAZYML_ACCESS_KEY environment variable for authentication. If the variable is not set, it defaults to a trial license.

In [2]:
ez_init(os.getenv('EAZYML_ACCESS_KEY'))
Out[2]:
{'success': True,
 'message': 'Initialized successfully. You may revoke your consent to sharing usage stats anytime. You have exclusive paid access.'}

2. Define Dataset Files and Outcome Variable¶

In [ ]:
gdown.download_folder(id='1p7Udh2MjKyJPxI47FS89VowAz9ZEq_hG')
In [3]:
# Defining file paths for training and test datasets and specifying the outcome variable
train_file = os.path.join('data', "House Price Prediction - Train Data.xlsx")
test_file = os.path.join('data', "House Price Prediction - Test Data.xlsx")
outcome = "House_Price"

# Loading the training dataset and the test dataset
train_df = pd.read_excel(train_file)
test_df = pd.read_excel(test_file)

3. Dataset Information¶

The dataset used in this notebook is the Housing Price Prediction Dataset, which is a well-known dataset in machine learning and data science. It contains data about various house features and their corresponding sale prices. The goal is to predict the sale price of a house based on its attributes.

Columns in the Dataset:¶

  • Square_Footage: Total area of the house in square feet; larger homes typically have higher prices.
  • Num_Bedrooms: Number of bedrooms in the house; more bedrooms usually increase the value.
  • Num_Bathrooms: Number of bathrooms in the house; more bathrooms often correlate with higher prices.
  • Year_Built: The year the house was built; newer homes may have higher prices due to modern features.
  • Lot_Size: Size of the property in square feet; larger lots can increase the property's value.
  • Garage_Size: Size of the garage (e.g., number of cars it can hold); larger garages may increase value.
  • Neighborhood_Quality: Qualitative rating of the neighborhood; higher quality usually means higher prices.
  • House_Price: The selling price of the house; this is the target variable for prediction models.

3.1 Display the Dataset¶

Below is a preview of the dataset:

In [4]:
# Display the first few rows of the training DataFrame for inspection
ez.ez_display_df(train_df.head())
  Square_Footage Num_Bedrooms Num_Bathrooms Year_Built Lot_Size Garage_Size Neighborhood_Quality House_Price
0 4235 3 3 2000 1.911679 1 8 917235.410532
1 4006 4 2 2003 1.092441 2 4 871566.562740
2 785 5 3 1995 3.823276 2 3 262707.278933
3 2827 3 1 1977 3.213678 2 4 605143.959115
4 2219 4 1 1965 0.725965 0 4 470083.290367

4. Custom Modeling with Scikit-learn¶

4.1 Unified Preprocessing Class for Regression¶

In [5]:
class UnifiedRegressorPreprocessor:
    """Preprocessor for handling numerical and categorical features, 
    including scaling, encoding, and missing value imputation."""

    def __init__(self):
        self.numerical_imputer = SimpleImputer(strategy="mean")
        self.scaler = StandardScaler()
        self.categorical_encoder = OneHotEncoder(drop="first", sparse=False)
        self.target_scaler = StandardScaler()
        self.fitted = False

    def fit(self, X, y=None):
        """Fits preprocessing transformations on numerical & categorical features and target variable (if provided)."""
        self.numerical_columns = X.select_dtypes(include=[np.number]).columns
        self.categorical_columns = X.select_dtypes(include=[object]).columns

        self.numerical_imputer.fit(X[self.numerical_columns])
        self.scaler.fit(X[self.numerical_columns])
        self.categorical_encoder.fit(X[self.categorical_columns])

        if y is not None:
            self.target_scaler.fit(np.array(y).reshape(-1, 1))

        self.fitted = True

    def transform(self, X, y=None):
        """Applies fitted transformations to the dataset."""
        if not self.fitted:
            raise ValueError("Preprocessor not fitted. Call 'fit' first.")

        X_num = self.scaler.transform(self.numerical_imputer.transform(X[self.numerical_columns]))
        X_cat = self.categorical_encoder.transform(X[self.categorical_columns])
        feature_names = list(self.numerical_columns) + list(self.categorical_encoder.get_feature_names_out())

        X_transformed_df = pd.DataFrame(np.hstack((X_num, X_cat)), columns=feature_names, index=X.index)

        if y is not None:
            y_transformed = self.target_scaler.transform(np.array(y).reshape(-1, 1)).flatten()
            return X_transformed_df, y_transformed

        return X_transformed_df

    def inverse_transform_outcome(self, y):
        """Reverts the target variable to its original scale."""
        return self.target_scaler.inverse_transform(np.array(y).reshape(-1, 1)).flatten()

    def fit_transform(self, X, y=None):
        """Combines fit and transform steps."""
        self.fit(X, y)
        return self.transform(X, y)

4.2 Train and Evaluate Linear Regression Model¶

In [6]:
# Prepare training and test datasets
X_train, y_train = train_df.drop(columns=[outcome]), train_df[outcome]
X_test, y_test = test_df.drop(columns=[outcome]), test_df[outcome]

# Initialize and apply preprocessing
preprocessor = UnifiedRegressorPreprocessor()
X_train_transformed, y_train_transformed = preprocessor.fit_transform(X_train, y_train)
X_test_transformed, y_test_transformed = preprocessor.transform(X_test, y_test)

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_transformed, y_train_transformed)

# Generate predictions and revert scaling
y_pred_transformed = model.predict(X_test_transformed)
y_pred = preprocessor.inverse_transform_outcome(y_pred_transformed)

# Add predictions to test DataFrame
predicted_df = test_df.copy()
predicted_df[f"Predicted {outcome}"] = y_pred

# Display sample predictions
print("\nTest DataFrame with Predictions:")
display(predicted_df.head(10))

# Evaluate model performance
metrics = {
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
    "MAE": mean_absolute_error(y_test, y_pred),
    "R2 Score": r2_score(y_test, y_pred),
}

print("\nModel Performance Metrics:")
for metric, value in metrics.items():
    print(f"{metric}: {value:.2f}")
    
Test DataFrame with Predictions:
Square_Footage Num_Bedrooms Num_Bathrooms Year_Built Lot_Size Garage_Size Neighborhood_Quality House_Price Predicted House_Price
0 4012 3 1 2016 2.098092 1 5 9.010005e+05 8.684582e+05
1 2310 3 1 1988 1.369622 1 4 4.945375e+05 4.901319e+05
2 4708 1 3 1962 1.792970 1 8 9.494042e+05 9.456196e+05
3 4932 2 1 1972 4.479598 1 2 1.040389e+06 1.033595e+06
4 3646 1 1 1994 3.980987 0 9 7.940100e+05 7.764987e+05
5 3586 2 2 1964 2.568429 0 10 7.240336e+05 7.323173e+05
6 4638 4 3 2000 1.490399 1 3 9.984392e+05 9.951187e+05
7 4127 5 2 1992 1.026156 2 1 9.097134e+05 8.852059e+05
8 3781 2 1 1989 3.164076 0 9 7.926815e+05 7.965207e+05
9 4243 2 1 2002 4.498088 2 7 9.474908e+05 9.316157e+05
Model Performance Metrics:
RMSE: 10367.90
MAE: 8185.60
R2 Score: 1.00

5. EazyML Counterfactual Inference¶

5.1 Define Counterfactual Inference Configuration¶

In [7]:
# Define the selected features for prediction
selected_features = ['Square_Footage', 'Num_Bedrooms', 'Num_Bathrooms', 'Year_Built', 
                     'Lot_Size', 'Garage_Size', 'Neighborhood_Quality']

# Define variant (modifiable) features
invariants = ['Year_Built']
variants = [feature for feature in selected_features if feature not in invariants]

# Define configurable parameters for counterfactual inference
cf_options = {   
    "variants": variants,  
    "outcome_ordinality": "maximize",  # Desired action 
    "train_data": train_file,
    "preprocessor": preprocessor,
}

5.2 Perform Counterfactual Inference¶

In [8]:
# Specify the index of the test record for counterfactual inference
test_index_no = 0  
test_data = predicted_df.loc[[test_index_no]]  

# Perform Inference 
result, optimal_transition_df = ez_cf_inference(
    test_data=test_data,  
    outcome=outcome,  
    selected_features=selected_features,  
    model_info=model,  
    options=cf_options  
)

5.3 Display Results¶

In [9]:
# Summarizes whether an optimal transition was found.
ez.ez_display_json(result)
{   'success': True,
    'message': 'Optimal transition found',
    'summary': {'Actual Outcome': 868458.17, 'Optimal Outcome': 1058596.79}}
In [10]:
# Details the feature changes needed to achieve the optimal outcome.
ez.ez_display_df(optimal_transition_df)
  Feature Actual Optimal Percentage Change Absolute Change
0 Square_Footage 4012.000000 4814.400000 20.000000 802.400000
1 Num_Bedrooms 3.000000 4.000000 33.300000 1.000000
2 Num_Bathrooms 1.000000 2.000000 100.000000 1.000000
3 Year_Built 2016.000000 2016.000000 0.000000 0.000000
4 Lot_Size 2.100000 2.520000 20.000000 0.420000
5 Garage_Size 1.000000 2.000000 100.000000 1.000000
6 Neighborhood_Quality 5.000000 6.000000 20.000000 1.000000
In [ ]: