Forecasting the world through data
Introduction
“Data Scientist: The Sexiest Job of the 21st Century” was the headline of an article by HBR in 2012 [1]. Now, a little over 10 years later, the digital world has developed tremendously, especially with the recent developments in machine learning and artificial intelligence. And (big) data is the foundation to all of this, together with the knowledge workers treating these large data sets: Data professionals.
These professionals are big assets to organizations. In short, they analyze large data sets and build models. These models can predict trends, future events, and are a source of evidence to base strategic decision making. The range of sectors of business, where these tools can be used, is endless, as long as there is data being produced. [3] For example, one could analyze the sentiment in social media posts about a company and use it, together with the time stamp, to model and predict the development of a company’s stock value. Similar cases exist in finance, marketing, real estate, pandemic spreading. Large data sets and models build on them are also the foundation of basically an application of artificial intelligence and automation, e.g., in self-driving cars, these models are used to decide when a car break should or make a turn.
The application of big data is at such a hype that the professional roles are rapidly evolving and novel specialized roles are appearing. Nowadays, we can distinguish as main roles among data professionals: Data Scientists, Data Analysts, Data Engineers, and Machine Learning Specialists. [4] Depending on the organization and responsibilities, sometimes these profiles can intermingle and are difficult to differentiate. Hence, a same person can be a data scientist and a data analyst at the same time. In general data engineer roles have a stronger technical component, i.e., they oversee data management and data infrastructure and potentially the data (pre-) processing pipeline. Data analysts may have a stronger focus on the business side with visualizations and storytelling. Machine learning specialists focus on the modeling aspect and build the basis for artificial intelligence. Data scientists are high-level positions, their roles could include all the previously mentioned responsibilities as well as complex model building. In this use case we focus on data scientist, as their role is the most general and comprehensive one.
Writing code and basic understanding of math and statistics are necessary skills to work as data scientist. Curiosity and the strive to understand a data set and gain knowledge from it make a data scientist successful; this often includes a deep interest into the subject or the willingness to read into it [1]. However, being able to effectively communicate their results to stakeholders and organize your work around your team are equally important competencies and often underestimated. You will most likely be working within a group of data professionals in a corporative environment whose rules and methods you will need to adapt to.
Practice Case
Ready to get a taste of what the work of a Data Scientist could look like?
Get ready for the Data Scientist REBECA Practice case! Remember, this practice case does not prepare you to become a Data Scientist; it only helps you decide if this is the profession for you.
After completing the case, please do the reflection exercises. They will help you clarify what you have experienced and make informed decisions.
Requisites to perform this practice case
You should have (fundamental) Python programming knowledge. Ideally, you know how to use Numpy, Pandas and Matplotlib for data analysis and visualization. Seaborn is optional for appealing figures.
Preparational work
We recommend using Google Colab as an online interpreter. This way you do not need to install anything on your computer.You can also use as an alternative Cocalc.
ADVANCED ALTERNATIVE: Use Visual Studio Code or Jupyter Notebook.
Who are you?
Imagine you are a data scientist working in a team of 4 data scientists within the marketing department of company in the film industry. The “Global and Strategy” department needs to understand which factors have the largest effect on the revenue of a film. Your team has been selected to develop this project. And your team’s results could increase the return of interest (ROI) of the movie production firm.
Your team goal: Use data to understand which parameters effect the economic potential of a specific movie. “How can we achieve the largest revenue?”
Very important: you are in a team working for a corporative business, all the steps you are going to perform need to be well documented in the platform determined by the company to do so. This is important for the well-functioning of your team, and for the company; not doing so could cost a lot of money to the company in the future.
As we are working in the marketing department of a movie production firm, our problem is to maximize the revenue of our film productions. To solve this, your team needs consequently acquire the necessary data. These are the first two steps in the workflow of a data scientist team: defining the problem and acquiring data which allows to solve it.
NOTE: we have streamlined the process for acquisition for you, and you only need to import the data.
ACTION: Import the vega_datasets library, a repository of example data sets. Load the movies data from the library and have a first look at the features. Use the initial Problem “How can we achieve the largest revenue?” and reformulate it to define a more specific problem which you want to solve. Which variable do you choose as a target, which as features? Make an initial guess: Which variables have the greatest effect on your target variable?
TIP FORM THE EXPERT: Acquiring the data can be a long and cumbersome problem and often the availability of data poses constraints in the exact and specific definition of the problem. In this respect, acquiring the data and fine-tuning the problem statement is an iterative process.
After acquiring (here: loading) the data, you need to clean and prepare the data set for further processing. Experience shows that this crucial step in the data analysis workflow can be the most challenging one.
Please, note that not all observations are complete, many of them do not contain information about all features, sometimes it appears a “None” or “NaN”. This missing data can affect your data analysis and later modeling, so you need to find a strategy to deal with it.
You discuss with your team how to proceed: one simple option would be to exclude observations with missing features from further analysis, but you need to check if this strategy will end up deleting many of the observations if most of them are incomplete. Another option would be to exclude features instead of observations, but none or only very few features are contained in all observations, so again you could end up deleting too much information. This means you must compromise: Select few entries with more attributes or more entries with fewer attributes.
NOTE: Creating a model later usually requires numerical data as input. This means other data formats, such as strings, need to be mapped to numeric values. For simplicity in this exercise, we restrict our further analysis to the numeric data type features of the data set.
ACTION: Clean and prepare the data by selecting the columns you want to analyze and drop incomplete entries following the comprise strategy. Limit yourself to numeric columns. Play around with selecting fewer or more (numeric) columns and remove missing observations. Balance the use of as many features as possible while dropping only as few entries as needed. During all of this, keep in mind your specific problem which you are planning to solve.
Once you are happy with your cleaning and preparation of the data set you can move on to the next task.
TIP FROM THE EXPERT: Alternative methods to deal with missing data exist. For example, you could infer the missing information, e.g., set to the mean. We refrain from these more complex ways to deal with missing data here though, but bear in mind that they also exist.
NOTE: Remember! have you documented all this?
Now that you have cleaned and prepared your data set. Before continuing, you can have a first look and understand the data better. Get a first feeling of its statistics, i.e., distributions of features and correlations between distinct features.
ACTION: Calculate the mean and standard deviation of each of the features using the Dataframe.describe() in Pandas.
- Calculate all correlation coefficient using the Dataframe.corr() method in Pandas.
- Finally, create a histogram of the features of interest and scatter plots of the correlation using the Seaborn.pairplot() function.
To make predictions about the target (which is unknown) of an observation based on its features, it is needed to create a model. This model could range from a simple linear fit to a complex deep neural network. The product owner of the project has assigned this task to the new junior Machine Learning Expert in your team.
You received the code snippet, which is used to model the data form your colleague. He claims that a Linear Regression is sufficient to model this data. Before you assemble the mode to the whole pipeline, you need to review the code and the produced result.
ACTION: Before you incorporate the model, review the code and the produced result. Check the code for bugs. If there is any, draft an email explaining to your college the error, and what you are going to do to solve it.
TIP FROM THE EXPERT: not only code is reviewed among members of a team, but also code style. In companies, there are style guidelines that are mandatory to be follow. Your team will also review how the code compiles the company’s style guidelines. This is a regular procedure in companies, where mistakes are part of the norm and team, not individuals, resolve them.
Now that your team has created a model you can make predictions about events that are not contained in the data set or even lie in the future.
ACTION: Let’s assume a putative movie production of your stakeholder with a production budget of 0.5B USD, an IMDB Rating of 6.8 and 30000 IMDB votes.
- What is the expected Worldwide Gross?
- How would a 10% increase in each one of the three variables Production Budget, IMDB Rating, IMDB Votes affect the Worldwide Gross?
Now you need you to evaluate and interpret your findings. Out of the python code and numbers you want to draw conclusions, which can be understood by your stakeholders and help them in their decision making.
ACTION: Look at your model and the prediction. Try to draw conclusions and formulate a take home message. Make a statement like
- “If we increase the production budget by 10%, we expect an increase of the revenue by …”,
- “If we can achieve an increase in IMDB rating by 10%, we expect an increase of the revenue by …” or “The return of interest is not dependent on variable X, but strongly depends on variable Y”.
Now you just need to sell it to your stakeholders! The Global and Strategy Department likes visuals!
ACTION: Create an appealing figure which your team will use together with your previously formulated key take home message. Remember, the visualization has the main goal to support your message.
The “Quality Control Department” wants to check the whole process. Have you documented all?
TIP FROM THE EXPERT: The standard tool for documentation and version control of (Python) code is Git. Github is a famous provider and platform for Git-based version control.
END
Acknowledgements
This practice case hase been created thanks to the input of three data scientists:
- Jose Querales, Data Scientist and AI 2 at Carelon Global Solutions Ireland
- Milan Zdravkovic, Research Lead at ZENPULSAR
- Héctor Diez Nuño, Bioinformatician and data scientist at QGENOMICS
To ellaborate this practice data, organization hired the service of Alexander Britz and Daniel Mertens, founder of Scientists need more!
Guided reflection
After this experience, we suggest you reflect on the following questions:
- Did you find the practice case easy or difficult to accomplish?
- What was the most engaging task for you? Was it difficult or easy?
- What was the most challenging task for you? Did you enjoy performing it? Would you see yourself getting better at it?
- Have you found something new about this profession? What was it? Did it surprise you? Did you like it or dislike it?
- Do you feel like contacting a scientific journalist in your network and research a little bit about the profession? Where would you find it?
Further information
[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
[2] Several formalizations of workflows in data science exist, which are in their essence similar. See for example https://www.datascience-pm.com/data-science-workflow/,
https://www.knowledgehut.com/blog/data-science/data-science-workflow, https://www.kdnuggets.com/2020/07/laymans-guide-data-science-workflow.html
[3] https://github.com/altair-viz/vega_datasets. Other online resources with free data sets exist. Under www.kaggle.com you can find various data sets with a large range of complexity and subject fields. Scientific data repositories exist for example under https://archive.ics.uci.edu. You could also search online for public data provided by your local or national governments.
[4] https://www.mastersindatascience.org/careers/data-analyst-vs-data-scientist/
[7] The online documentation of the widely used Python libraries are often fantastic and provide tutorials, detailed explanations of functionalities, example and much more. See for example https://numpy.org, https://matplotlib.org, https://pandas.pydata.org, https://seaborn.pydata.org, https://scikit-learn.org/
[8] For inspirations of distinct visualizations see https://www.data-to-viz.com