Why We Build ML Model on Top of Credit Reports?

What Is a Good Credit Score? - Experian

Why are you building another machine learning model on top of the credit reports?

Didn’t the credit reports already tell everything about your customers ?”

Someone from device financing background

Someone asked me that yesterday. Well, my short answer is, “segmentation“.

A brief about the story. We are in micro loan financing business. For each applicant, we will check his credit report from the credit bureau in real time. Even though he or she has a good credit score, we still feed all the information to a machine learning model in order to get a final decision.

Yes, the credit reports tell a lot about your customers. But these information are too general and unable to advise you for the next action — they do not know who and where your ideal customers are. How much to charge for the interest rate ? Will they repay on time ?


  • Someone who owe PTPTN repayment does not mean he will not pay for the car installment.
  • Someone who pay for car installment does not mean he will pay for the device financing.

Thus, by learning from the historical data, your model further segments the customers that fits your business objective optimally.

They may not be someone that from strong credit score. They are those group of folks that willing to take up the offer and pay on time.

Original Post at https://www.linkedin.com/posts/zan-kai-chong_fintech-creditscore-activity-6689691964625055744-tUN7


Share Puchong Properties Price (2019 April) at github

Real estate is one of the favorite investments for working folks in Malaysia. As part of the data savvy company, the data engineering team have harvested the prices of Puchong properties from one of the famous properties website and the extracted data (Puchong area only*) are shared at github  https://github.com/zkchong/puchong-properties-2019-april .

Like you guys, we have a lot of questions in our mind:

    1. Do all the properties share about the same price per square foot (psf) if they are located at about the same area?
    2. Will the psf be slightly higher if they are located near to the main road?
    3. Which properties should I invest ? ^o^

Feel free to clone the data fromgithub.

May all the mighty data scientists plot their best insight.

We plot the bar chart for Puchong Housing Price (Apr, 2019) at here.
Original linkedin post @ https://www.linkedin.com/feed/update/urn:li:activity:6520490026600624128

Transfer Learning – Makes the Machine Learning Models Works Even with Insufficient* Labelled Data

1. Introduction

Let us start the story with a data science project that predicts users credit scoring using the telco data at country A. It is a successful machine learning (ML) project as we have sufficient large and comprehensive labelled data.

Then, the business team expects the data science team to duplicate the same model for a new market at country quickly. However, we can’t as the currency and consumers behaviour of country B are different from country A.

I believe the story is common in data science companies. Generally, the ML model is built on the assumption that the training and test data are extracted from same feature space and same distribution. In other words, once the distribution shifts, the model fails.

Once the distribution shifts, the model fails.

Researchers have a long thought of this problem with the solution called transfer learning. In a layman term, we have labelled data from the source domain and we would like to build a ML model for the target domain of different tasks or distribution than the source domain (Pan and Yang, 2010).

In this article, we will experiment on a transfer learning method that proposed by Hal Daume III (2006), named easy adaptation (this name is coined in his later paper). In the followings, we will briefly explain easy adaptation in Section 2 and the experiment in Section 3. Finally, the conclusion is drawn in Section 4.

2. Transfer Learning with Easy Adaptation

Easy adaptation has a simple construction method in Daumé III paper of title “Frustratingly Easy Domain Adaptation”. Say that we have labelled data of similar feature space (attributes of x0 and x1 with output y) in both source and target domains but in different distribution (refer to Figure 1). For instance, the second record from source domain data is (x0=2, x1=20, y=2) but the output becomes y=1 in the first record of target domain data.

Figure 1: Easy adaptation on purported tables from source and target domains.

Continue reading “Transfer Learning – Makes the Machine Learning Models Works Even with Insufficient* Labelled Data”

How powerful is data?

How powerful is data?

The following map presents a very small samples of the distribution of students’ accommodation of one of the universities in Malaysia based on ada mobile ad exchange data. The students and their corresponding approximate residence spots are identified by geofencing the day and night time location updates.

Some highlights:

  • Obviously most students stay near to campus. Some prefer driving and that explain why the car park of that university always full house.
  • The full detail information of the map (not for view here) can be used to plan the university shuttle bus route and to identify the students’ favourite hangout spots.
  • We do not unlock the identities of the mobile phone owners, a.k.a. the Pandora box. We know where they are but not who they are.

The work is credited to ada Data Science and Engineering team.

Original post at https://www.linkedin.com/feed/update/urn:li:activity:6412478258041970688


The Things about Job Title

[My original post from linkedin https://www.linkedin.com/pulse/things-job-title-zan-kai-chong/%5D

Switching from the job title data scientist to machine learning engineer amuses a lot of my friends. They wondering since when I become a lecturer again (note : machine “learning”). Despite my wrongful explanation, I start thinking what is my real job title other than those words printed on my name cards.

Analytically, I should list down all my job functions. Then build the heat map or histogram from all the words in the description and then identify the common words by applying the max(count) or corr function. Okay, sounds right. Here we go.

First, I work on AWS platform. As a trustworthy power user, I stress-test the costly computing instance and provide my helpful IT support to new comers (lady is preferable). I also use EMR (very expensive computer clusters) like-a-boss occasionally for big data stuff. Occasionally, I speak AWS jargon as if I am real AWS engineer.

Okay. You saw the word “big data”. Of cause , I am (acting like) a big data engineer as I work on peta-ful (new word to describe peta bytes) data. These petaful data are our asset to track you. We may not be as good as Cambridge Analytica. But we know many things about you and what you did last weekend. The more you attach to your phone, the more we know you.

In short , it is fun to work in analytics company. Well, my real job title? I am wondering as well . Perhaps , I should just call myself “engineer“.

I was in a great self-involved until my civil engineer friends start mocking me with a photo.

Venturing into Data Science

After seven years of academy life at UTAR, I decided to move on to the data science industry to explore the opportunity in big data transformation.

It is a hard but necessary move to me. I will leave the full story to offline face to face discussion if our frequency and space-time are right.

Here are my observation after six months working in data science industry. Majority of Malaysia industries are business-driven entities — business comes first and research be the second (or last). Usually, R&D or r&D departments are hardly survive in the evolution (a.k.a company restructure / reorganization) considering the output are always less convincing in the board meeting. One of the common practice is to embedded the R element as part of the product development such that some tangible output are there.

Another interesting thing is, the term research varies a lot in industry. It can refer to operational research, product research, applied research, etc. Definitely it is not the research that allows you to sit down to for the whole month just to derive an elegant but less useful equation to them.

After all, I am the  latter type of person. I guess it gonna takes another few months before my boss realizes that I am working on a niche research topic instead of building the requested machine learning model.


Hire Research Assistant

We are looking for ONE candidate that is

  • Good in programming, mathematics, microcontroller system, and principle of network communication.
  • Good command of English
  • Discipline and independent

to work with us to improves the performance of Internet-of-Things (IoT) with locally decodable code. The successful candidate will be paid with RM 2,500 for 12 months and renewable to another year (1+1 policy). He/She is expected to register for Master of Engineering Science in Lee Kong Chian Faculty of Engineering Science (LKC FES) and complete the study in 24 months.

Knowledge in network communication and coding theory are preferable, but not a must. The successful candidate must register for Master of Engineering Science in Lee Kong Chian Faculty of Engineering Science (LKC FES).


We consider a mobile wireless sensor network (MWSN) that consists of thousands of static sensor nodes with one or multiple mobile sinks (mobile base stations). Such dynamic network is commonly found in the IoT applications such as users with wireless wearable devices walking on streets or shopping at outlets – the wearable devices acting as the mobile sinks that continuously fetching the environment sensory data in order to provide ubiquitous services to users.

The candidate will work together with the team to design the communication protocol, implement the testbed on Raspberry Pi, etc. Minimum logistic work may be required.

The team members are Dr. Chong Zan Kai, Prof. Ir. Dr. Goi Bok Min, Prof. Ir. Dr. Ewe Hong Tat, Dr. Lai An Chow and Dr. Goh Hock Guan and Ms. Tan Lyk Yin. This is also a collaboration project with researchers from Kwansei Gakuin University, Japan and Victoria University of Wellington, New Zealand.

The interested candidates should send their resumes to Dr. Chong Zan Kai chongzk@utar.edu.my.

Note: The calling is closed. Thank you.

Calling for Research Assistant at UTAR

We are looking for ONE candidate that is

  • Good in programming and mathematics
  • Willing to learn
  • Good English
  • Discipline and independent

to work on a research project that improves the future network throughput with computer accelerator. The successful candidate will be paid with RM 2,500 for 12 months (renewable to another year). Knowledge in parallel processing and coding theory are preferable, but not a must. The successful candidate is expected to register for Master of Engineering Science in Lee Kong Chian Faculty of Engineering Science (LKCFES). LKCFES FYP-2 students are encouraged to apply.

Description of the Project Rateless erasure code is a kind of error-correction code, where the original message can be reconstructed from the fractional encoded message. The emergence of rateless erasure code promises a better network throughput, but constrained by the bottleneck in the corresponding encoding and decoding speed.

The candidate needs to improve the encoding and decoding speed of the rateless erasure code with graphical processing unit (GPU) and to apply it in network communication. Some logistic work may be required.

The team members includes Chong Zan Kai, Prof. Goi Bok Min, Prof. Ewe Hong Tat, Dr. Lai An Chow and Yap Wun She.

The interested candidates should send their resumes to Chong Zan Kai chongzk+ra@utar.edu.my.

Download the PDF here Call for RA in UTARRF (2015).

Note: The calling is closed. Thank you.

Tutorial on Sage mathematics software system / Python Programming

Dear UTAR Students,

I am giving a 3 hours tutorial on Sage on the coming Friday (23-Jan-2015), 10am-1pm at SE203 computer lab.

Sage (http://www.sagemath.org/) is an open-source mathematics software system that is derived from Python programming language. Unlike C / C++programming , Sage enables the users to focus only on the problem solving instead of dealing with the computer resources and settings (e.g. memory architecture, pointers, variable types, brackets etc.).

Basically, the tutorial is meant for the students of UEET2533 Information Theory and Coding to kick-start their assignment. Following the common practice in teaching this subject, the tutorial will be opened to public (UTAR). Students from other courses are welcome to join as the tutorial will be general enough for all the students with little knowledge in C programming (not a must, though).

We will do the programming using the cloud service at http://cloud.sagemath.org and the slides can be found in http://1drv.ms/1ynqb4r . We will spend the first hour to learn programming in the cloud; the second hour on the syntax and control logic and the last hour on solving some simple math problems (e.g. 1+1=2).

No registration is required but do let me know if you are coming as the computer lab can only accommodate limited number of students.

Let’s have fun in programming!