Why We Build ML Model on Top of Credit Reports?

What Is a Good Credit Score? - Experian

Why are you building another machine learning model on top of the credit reports?

Didn’t the credit reports already tell everything about your customers ?”

Someone from device financing background

Someone asked me that yesterday. Well, my short answer is, “segmentation“.

A brief about the story. We are in micro loan financing business. For each applicant, we will check his credit report from the credit bureau in real time. Even though he or she has a good credit score, we still feed all the information to a machine learning model in order to get a final decision.

Yes, the credit reports tell a lot about your customers. But these information are too general and unable to advise you for the next action — they do not know who and where your ideal customers are. How much to charge for the interest rate ? Will they repay on time ?


  • Someone who owe PTPTN repayment does not mean he will not pay for the car installment.
  • Someone who pay for car installment does not mean he will pay for the device financing.

Thus, by learning from the historical data, your model further segments the customers that fits your business objective optimally.

They may not be someone that from strong credit score. They are those group of folks that willing to take up the offer and pay on time.

Original Post at https://www.linkedin.com/posts/zan-kai-chong_fintech-creditscore-activity-6689691964625055744-tUN7


Share Puchong Properties Price (2019 April) at github

Real estate is one of the favorite investments for working folks in Malaysia. As part of the data savvy company, the data engineering team have harvested the prices of Puchong properties from one of the famous properties website and the extracted data (Puchong area only*) are shared at github  https://github.com/zkchong/puchong-properties-2019-april .

Like you guys, we have a lot of questions in our mind:

    1. Do all the properties share about the same price per square foot (psf) if they are located at about the same area?
    2. Will the psf be slightly higher if they are located near to the main road?
    3. Which properties should I invest ? ^o^

Feel free to clone the data fromgithub.

May all the mighty data scientists plot their best insight.

We plot the bar chart for Puchong Housing Price (Apr, 2019) at here.
Original linkedin post @ https://www.linkedin.com/feed/update/urn:li:activity:6520490026600624128

Transfer Learning – Makes the Machine Learning Models Works Even with Insufficient* Labelled Data

1. Introduction

Let us start the story with a data science project that predicts users credit scoring using the telco data at country A. It is a successful machine learning (ML) project as we have sufficient large and comprehensive labelled data.

Then, the business team expects the data science team to duplicate the same model for a new market at country quickly. However, we can’t as the currency and consumers behaviour of country B are different from country A.

I believe the story is common in data science companies. Generally, the ML model is built on the assumption that the training and test data are extracted from same feature space and same distribution. In other words, once the distribution shifts, the model fails.

Once the distribution shifts, the model fails.

Researchers have a long thought of this problem with the solution called transfer learning. In a layman term, we have labelled data from the source domain and we would like to build a ML model for the target domain of different tasks or distribution than the source domain (Pan and Yang, 2010).

In this article, we will experiment on a transfer learning method that proposed by Hal Daume III (2006), named easy adaptation (this name is coined in his later paper). In the followings, we will briefly explain easy adaptation in Section 2 and the experiment in Section 3. Finally, the conclusion is drawn in Section 4.

2. Transfer Learning with Easy Adaptation

Easy adaptation has a simple construction method in Daumé III paper of title “Frustratingly Easy Domain Adaptation”. Say that we have labelled data of similar feature space (attributes of x0 and x1 with output y) in both source and target domains but in different distribution (refer to Figure 1). For instance, the second record from source domain data is (x0=2, x1=20, y=2) but the output becomes y=1 in the first record of target domain data.

Figure 1: Easy adaptation on purported tables from source and target domains.

Continue reading “Transfer Learning – Makes the Machine Learning Models Works Even with Insufficient* Labelled Data”

How powerful is data?

How powerful is data?

The following map presents a very small samples of the distribution of students’ accommodation of one of the universities in Malaysia based on ada mobile ad exchange data. The students and their corresponding approximate residence spots are identified by geofencing the day and night time location updates.

Some highlights:

  • Obviously most students stay near to campus. Some prefer driving and that explain why the car park of that university always full house.
  • The full detail information of the map (not for view here) can be used to plan the university shuttle bus route and to identify the students’ favourite hangout spots.
  • We do not unlock the identities of the mobile phone owners, a.k.a. the Pandora box. We know where they are but not who they are.

The work is credited to ada Data Science and Engineering team.

Original post at https://www.linkedin.com/feed/update/urn:li:activity:6412478258041970688


The Things about Job Title

[My original post from linkedin https://www.linkedin.com/pulse/things-job-title-zan-kai-chong/%5D

Switching from the job title data scientist to machine learning engineer amuses a lot of my friends. They wondering since when I become a lecturer again (note : machine “learning”). Despite my wrongful explanation, I start thinking what is my real job title other than those words printed on my name cards.

Analytically, I should list down all my job functions. Then build the heat map or histogram from all the words in the description and then identify the common words by applying the max(count) or corr function. Okay, sounds right. Here we go.

First, I work on AWS platform. As a trustworthy power user, I stress-test the costly computing instance and provide my helpful IT support to new comers (lady is preferable). I also use EMR (very expensive computer clusters) like-a-boss occasionally for big data stuff. Occasionally, I speak AWS jargon as if I am real AWS engineer.

Okay. You saw the word “big data”. Of cause , I am (acting like) a big data engineer as I work on peta-ful (new word to describe peta bytes) data. These petaful data are our asset to track you. We may not be as good as Cambridge Analytica. But we know many things about you and what you did last weekend. The more you attach to your phone, the more we know you.

In short , it is fun to work in analytics company. Well, my real job title? I am wondering as well . Perhaps , I should just call myself “engineer“.

I was in a great self-involved until my civil engineer friends start mocking me with a photo.

Venturing into Data Science

After seven years of academy life at UTAR, I decided to move on to the data science industry to explore the opportunity in big data transformation.

It is a hard but necessary move to me. I will leave the full story to offline face to face discussion if our frequency and space-time are right.

Here are my observation after six months working in data science industry. Majority of Malaysia industries are business-driven entities — business comes first and research be the second (or last). Usually, R&D or r&D departments are hardly survive in the evolution (a.k.a company restructure / reorganization) considering the output are always less convincing in the board meeting. One of the common practice is to embedded the R element as part of the product development such that some tangible output are there.

Another interesting thing is, the term research varies a lot in industry. It can refer to operational research, product research, applied research, etc. Definitely it is not the research that allows you to sit down to for the whole month just to derive an elegant but less useful equation to them.

After all, I am the  latter type of person. I guess it gonna takes another few months before my boss realizes that I am working on a niche research topic instead of building the requested machine learning model.