My colleagues asked me to tell a joke. Okay, here you go.
Many know that porn websites are collecting the information of the visitors and sell them to data science companies. When you received a call from agents that offer a good package that meet your need — it could be because of when, where and what you surf in the Internet. And the you are ignorant about it .
Let us start the story with a data science project that predicts users credit scoring using the telco data at country A. It is a successful machine learning (ML) project as we have sufficient large and comprehensive labelled data.
Then, the business team expects the data science team to duplicate the same model for a new market at country B quickly. However, we can’t as the currency and consumers behaviour of country B are different from country A.
I believe the story is common in data science companies. Generally, the ML model is built on the assumption that the training and test data are extracted from same feature space and same distribution. In other words, once the distribution shifts, the model fails.
Once the distribution shifts, the model fails.
Researchers have a long thought of this problem with the solution called transfer learning. In a layman term, we have labelled data from the source domain and we would like to build a ML model for the target domain of different tasks or distribution than the source domain (Pan and Yang, 2010).
In this article, we will experiment on a transfer learning method that proposed by Hal Daume III (2006), named easy adaptation (this name is coined in his later paper). In the followings, we will briefly explain easy adaptation in Section 2 and the experiment in Section 3. Finally, the conclusion is drawn in Section 4.
2. Transfer Learning with Easy Adaptation
Easy adaptation has a simple construction method in Daumé III paper of title “Frustratingly Easy Domain Adaptation”. Say that we have labelled data of similar feature space (attributes of x0 and x1 with output y) in both source and target domains but in different distribution (refer to Figure 1). For instance, the second record from source domain data is (x0=2, x1=20, y=2) but the output becomes y=1 in the first record of target domain data.
Continue reading “Transfer Learning – Makes the Machine Learning Models Works Even with Insufficient* Labelled Data”
How powerful is data?
The following map presents a very small samples of the distribution of students’ accommodation of one of the universities in Malaysia based on ada mobile ad exchange data. The students and their corresponding approximate residence spots are identified by geofencing the day and night time location updates.
- Obviously most students stay near to campus. Some prefer driving and that explain why the car park of that university always full house.
- The full detail information of the map (not for view here) can be used to plan the university shuttle bus route and to identify the students’ favourite hangout spots.
- We do not unlock the identities of the mobile phone owners, a.k.a. the Pandora box. We know where they are but not who they are.
The work is credited to ada Data Science and Engineering team.
Original post at https://www.linkedin.com/feed/update/urn:li:activity:6412478258041970688
[My original post from linkedin https://www.linkedin.com/pulse/things-job-title-zan-kai-chong/%5D
Switching from the job title data scientist to machine learning engineer amuses a lot of my friends. They wondering since when I become a lecturer again (note : machine “learning”). Despite my wrongful explanation, I start thinking what is my real job title other than those words printed on my name cards.
Analytically, I should list down all my job functions. Then build the heat map or histogram from all the words in the description and then identify the common words by applying the max(count) or corr function. Okay, sounds right. Here we go.
First, I work on AWS platform. As a trustworthy power user, I stress-test the costly computing instance and provide my helpful IT support to new comers (lady is preferable). I also use EMR (very expensive computer clusters) like-a-boss occasionally for big data stuff. Occasionally, I speak AWS jargon as if I am real AWS engineer.
Okay. You saw the word “big data”. Of cause , I am (acting like) a big data engineer as I work on peta-ful (new word to describe peta bytes) data. These petaful data are our asset to track you. We may not be as good as Cambridge Analytica. But we know many things about you and what you did last weekend. The more you attach to your phone, the more we know you.
In short , it is fun to work in analytics company. Well, my real job title? I am wondering as well . Perhaps , I should just call myself “engineer“.
I was in a great self-involved until my civil engineer friends start mocking me with a photo.
After seven years of academy life at UTAR, I decided to move on to the data science industry to explore the opportunity in big data transformation.
It is a hard but necessary move to me. I will leave the full story to offline face to face discussion if our frequency and space-time are right.
Here are my observation after six months working in data science industry. Majority of Malaysia industries are business-driven entities — business comes first and research be the second (or last). Usually, R&D or r&D departments are hardly survive in the evolution (a.k.a company restructure / reorganization) considering the output are always less convincing in the board meeting. One of the common practice is to embedded the R element as part of the product development such that some tangible output are there.
Another interesting thing is, the term research varies a lot in industry. It can refer to operational research, product research, applied research, etc. Definitely it is not the research that allows you to sit down to for the whole month just to derive an elegant but less useful equation to them.
After all, I am the latter type of person. I guess it gonna takes another few months before my boss realizes that I am working on a niche research topic instead of building the requested machine learning model.