Prince Grover
8 min readApr 5, 2021

Satnadander — Product Recommendation

The data set is available on Kaggle for Product recommendation

Problem Statement: Santander has provided with 1.5 years of customers behaviour data from Santander bank to predict what new products customers will purchase. The data starts at 2015–01–28 and has monthly records of products a customer has, such as “credit card”, “savings account”, etc.

The details of the dataset can be found in:

https://www.kaggle.com/c/santander-product-recommendation/data

Points to be considered :

  1. The training dataset is of 1.5 yrs and the size is approx. 2.5 GB of CSV files
  2. It has 24 features which contributes to approx. 24 Products
  3. Overall training dataset has 48 columns i.e, 24 features and 24 Product Recommendations
  4. The dataset provided has the column/feature name not in standard form which need to be changed as per description given in Kaggle

Design/ Approach:

  1. The major problem with the dataset is the size (~2.5 GB) , the code is needed to optimize the size without loosing any data/ feature and use the minimal memory consumption so as pre-processing and model can be created using multiple Algos
  2. The optimize code is defined where memory consumption was reduced to 68%
  3. The Basic EDA was done on the training and test dataset where analysis of the NULL values, features and Product recommendation was done
  4. After data is imported, the consolidation of the last 24 columns was done using the comma separated values (column names) , in this approach wherever the value 1 was given is replaced with the column name and others were replaced with Null. Post this, the single new column was created where consolidation of all 24 columns was done using the comma separated values.
  5. All the columns were renamed with the reference given in the Kaggle
  6. It was also analysed how different features are correlated with the Product recommendation new column, I.e., which all features are highly correlated with the Product Recommendation.
  7. Analyse the Null values and replace with the meaningful values , there were some assumptions taken into account (for ex- Gross Income is replaced with the mean value which may be not be correct assumption as correct values should be given by Business Stakeholders, similarly there are other 2–3 features where some assumptions were taken like age )

Modularization of the code :

The practice of modularization is adopted so as to minimize the CPU/ Memory utilization. To adopt the same, following steps are taken :

  1. The training dataset is consumed , pre-processed in a given Jupyter notebook, at the end of the same , the pre-processed file is saved in local
  2. As next steps , in a new jupyter notebook, the pre-processed training dataset is consumed (Again the to lower the memory consumption the Optimized definition created as a part of design is called)
  3. The Pre-processed training dataset is convert to flot64 type so as no issues for consumption by the Algorithms
  4. Below Algorithms were tried :
  5. Linear Regression : RMSE was ~525 and R square was ~0.19
  6. Decision Tree Regression : Score of ~78%
  7. Gradient Boosting : Score of ~ 26%
  8. Light Gradient Boosting : RMSE was ~456 and R square was ~0.28
  9. Random Forest was tried (only 15% of total training dataset was fed and reached approx. 40% score but the file size of model was of 9.5 GB thus does not make sense to continue)

The Best model created was with Decision tree Regressor of 78% score

  1. All the above models were saved using pickle.

Life Cycle flow diagram of the Model Creation :

Testing Dataset:

Using the modularization technique described above, the models saved were consumed in a different jupyter notebook where testing dataset was consumed and pre-processed with the similar patterns and dictionaries created from the training dataset.

Once pre-processed the prediction were created for the testing dataset and saved in the CSV Format.

Code Snippets Elaboration:

  1. Memory Usage Reduction for Loading Dataset:
  1. Bassic EDA :
  1. Renaming the columns as per reference from Kaggle:
  1. Consolidate the last 24 cloumns in one for Product Recommendation (to be used as target in Algos):
  1. Preprocessing of the data for identifying the NULL and replacement with meaningful values:
  1. Create Dictionaries:

Check Correlation :

Save the Preprocessing dataset:

Apply Algorithms :

Random Forest : The split was done for training and testing data set and model was tried to train by retraining for approx., 500,000 records in iterative manner but model created after approx. 30,00,000 records was 9.5 GB and started throwing Out of memory Error.

Light Gradient Boosting Model :

Gradient Boosting Model : This model only had 26% score

Linear Regression :

Decision Tree Regressor :

Reference Code can be seen in below Git Repo:

https://github.com/PRINCEMAIN/santander-starter-Dataset-ProductRecommendation

AWS Deployment:

AWS Sagemaker was used for deployment of the above model.

In AWS Sagemaker, the above notebooks were uploaded by cloning the Git Repo where code was uploaded. In below image, the instance created is ‘MysantanderstarterNotebook

Please note that I have used ‘ml.t3.medium’ instance (Free tier) for my notebook but for training the dataset and creating the model I had used ‘ml.m5.12xlarge’ but reverted after training was completed.

You will also have to define the Git Repo and the IAM Role so as the note book has access to the S3 Bucket where you will upload the training, test, Pre-processed dataset, Dictionaories, Model and the predictions.

See below the structure created in the S3 Bucket

Model created are saved as below:

Predictions as below (I have saved these with the timestamp):

Machine Learning Pipeline :

As next steps we have to make sure that the model created should auto trigger in a pipeline where Predictions should be uploaded in the S3 bucket or an API should be exposed to process the predictions and show it on UI to Users.

There are multiple ways to perform this as given below and has its own Pros and Cons

i) Sagemaker Studio : Highly recommended option suggested by AWS. The option suggests creating a ML Pipeline with the help of the prefilled- template / BYO template.

Using this option an Auto Pipeline is created, where Highly available EC2 instances are created with the defined segregation of the training, testing dataset in S3, Model package creation and End Point Configuration.

Please refer below for more details on the same:

https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/

https://www.youtube.com/watch?v=KFuc2KWrTHs

Pros : The Auto Pipeline creation without the worry of the basic setup of Servers (in AWS) and process of Pipeline creation

Cons:

a.) Cost can be High as minimal config launched in ml.m5.large for the training and inference creation

b.) Knowledge of Sagemaker specific code config in Jupyter is needed

c.) Not Suitable for training (because of unknowns)

d.) Bringing your own trained model shall also need sagemaker specific coding skills for deployment

In our scenario of Product recommendation, this is the not the right fit as we dont have major training requirement and cost can also go High

ii) Sagemaker End Point Configurations:

Using this option, AMI needs to be created in the AWS Instance .

the above created AMI shall need to be registered in ECR which shall then be used for the Model registered in Sagemaker

Endpoint configurations to create the End Point.

This end point can be used by the different Applications to predict the recommendation using the deployed models

The option is tried but it requires basic knowledge of selection of EC2 AMI and container creation.

During container creation, it failed multiple times as seems like Image recipe created was not compatible with the chosen AMI

This option can also lead to cost and performance lapse as everytime a EC2 instance will be spinned up to predict the product recommendations.

iii) Use AWS Lambda

AWS lambda is the ServerLess option where we can write the code to call the Sagemaker notebook instance.

This is the cost effective option for the given scenario as a S3 trigger can be created using the Lambda Trigger event to call the Lambda function where notebook instance shall be called .

This process need to be created in two steps for our use case :

a.) Preprocess the raw User data : The User given data uploaded in the given format (as in Kaggle) in a folder and Notebook instance for preprocessing shall be called using the Lambda function. As an output the preprocessed data shall be uploaded in a identified folder of the S3 bucket

b.) Predict using the Preprocessed Testing Data : Using the similar approach as in the above step, a new lambda function will be triggered to call the different Notebook Instance which will be used for Prediction. The output of the prediction shall be uploaded in the identified S3 Bucket folder.

Below is the code for triggering the Notebook Instance :

The bash script shall be used in the LifeCycle configuration of the NoteBook Instance to Start and Stop the Notebook automatically.

The reference for the same can be found below:

https://towardsdatascience.com/automating-aws-sagemaker-notebooks-2dec62bc2c84#:~:text=Use%20CloudWatch%20to%20trigger%20the,shuts%20down%20the%20notebook%20instance.