This article was originally published on Miquido.com on Dec. 13, 2019.
They say Data Scientist is the sexiest job of the 21st Century (and all Data Scientists whom I have met at various conferences know that). But when they talk only about the theoretical part of machine learning, I sometimes wonder if they know why their work is hot. The reason is that a Data Scientist knows how to combine data, technical skills, and statistics’ knowledge to achieve business goals. So to do Data Science well, you need to think about the business first.
I know cases in which companies added analytic tools to track every user’s touch without any consideration on what they actually want to accomplish. They gathered a lot of data which they did not understand and could not use to advance their business.
Do not make such mistakes! Think about your objectives and the industry specificity on each step of the Data Science process. The more creative you are, the better your chance for success is. To prove it, I will show you some inspiring examples of Data Science in the giants’ applications…
How to Start Your Data Science Adventure
You have heard that many companies use ML to increase their income, but you have no idea how to start? Not to end up with expensive infrastructure and unhelpful (in fulfilling your business needs) data, you should start with providing answers to the following questions:
What are the client’s business goals?
How can we use data to achieve them?
Then you can start planning what data can be tracked and used.
Data Gathering
What data should we gather? The answer to this question might actually surprise you. According to Todd Yellin (Netflix’s VP of Product Innovation), there are two types of data which can be used: explicit and implicit [1]. In the Netflix case, the explicit is when user literally rates a movie. Implicit, on the other hand, is behavioral data – based on user clicks and usage of the app.
Which type is more valuable?
There is no universal answer to this questions, but in most cases, the implicit data would be more useful. And that is because… people lie.
Consider the example of the man who says he loves documentaries and who rates them 5/5. But, as data shows, he watches this genre once a year. At the same time, he watches popular series each Friday evening. And it is because he is tired after work and just wants to unwind on the couch. So what data should be used to prepare such a recommendation system: rating or user’s behavior?
To answer this question, we need to think about the business goal of its development. Netflix’s goal is to encourage a user to watch more movies. They have started with the popular five-star rating system. When they realized it is more probable that the mentioned users would see Friends instead of a movie about World War II, they have developed the recommendation system based on user behavior. They have also dropped the five-star rating and replaced it with a simpler, binary thumbs-up, thumbs-down system.
As this example shows, gathered data should be selected with consideration of industry specificity and should bring enough information to understand users’ decisions and needs. But here we encounter another problem: behavioral data, texts, and other unstructured data are more difficult to analyze and use in Machine Learning models than the structured ones. So now it is time to talk about the feature engineering.
Feature Engineering
To show how important feature engineering in Data Science is, I would like to quote Andrew Ng – Google Brain co-founder and founder of deeplearning.ai:
Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering. [2].
An interesting example of a purpose-driven approach to data processing is Booking.com, where users can rate hotels from 0 to 10. But if a party animal rates the hotel highly, is it a good choice for families with children? Not necessarily.
Fortunately, there are also users’ comments which contain more information we need. Booking.com uses sentiment analysis and topic modeling to extract the strengths and weaknesses of the commented hotel, and users’ preferences regarding accommodation.
Let’s consider this example:
A topic Room facilities has negative sentiment (the user complains about shower, bed, wifi and air conditioning). At the same time, this user praises the Value for the price of the hotel, staff, and food. The system also analyses what was not mentioned in the comment and therefore is probably not important for the user – in our example that can be nightlife.
With these insights, the platform can offer hotels more suited for users with a similar profile, in this case, a family with children looking for a place to spend holidays in a peaceful hotel for a reasonable price. What’s more, Booking.com sorts comments to show the most interesting information for the viewer at the top.
This leads to a win-win situation: users can find offers tailored for their specific needs quicker and more easily, and the platform makes a profit because these offers are the ones users are more likely to purchase.
Data Product
You have deployed data product with satisfactory results? It is not the time to be complacent. As Netflix example shows [3] , continuous work on improving the system can bring significant gains. Is a proper movie recommendation enough? What more could we do?
One of the Netflix out-of-the-box approaches is not only to recommend movies but also illustrate them with an image that would be most appealing to a given user. Let’s say that they recommend you Good Will Hunting. If you watched a lot of romcoms in the past, you might see an image of a kissing couple, whereas if you are a comedy fan, you will most likely get a shot of a popular American comedian.
With this approach, a user scrolling through a myriad of choices is much more likely to spot a movie that grabs their attention.
This and other recommendation strategies have astonishing results – more than 80% of the platform’s content is based on algorithmic recommendations. It means that it is hard for a user to run out of things to watch. When one show is over, Netflix is there to suggest the next one.
In their business that gives a competitive edge because users are much less likely to cancel their subscriptions. This extremely successful application of Data Science was accomplished mostly by the good understanding of their business and app’s users.
The Summary
At one of this year’s Data Science conferences, a speaker engaged in credit risk predictions said:
When people ask me what is basically my job, I answer: I bring business values basing on data.
For me, this is one of the best definitions of Data Science. It should not be oriented only on its theoretical foundations, but especially on business. If you want to create a good Machine Learning application, you need to think about how users behave in your system and what they need. With that in mind, you will achieve your business goals successfully.