REBECA Practice: Data Scientist

REBECA Practice: Data Scientists

Forecasting the world through data

Introduction

“Data Scientist: The Sexiest Job of the 21st Century” was the headline of an article by HBR in 2012 [1]. Now, a little over 10 years later, the digital world has developed tremendously, especially with the recent developments in machine learning and artificial intelligence. And (big) data is the foundation to all of this, together with the knowledge workers treating these large data sets: Data professionals.

These professionals are big assets to organizations. In short, they analyze large data sets and build models. These models can predict trends, future events, and are a source of evidence to base strategic decision making. The range of sectors of business, where these tools can be used, is endless, as long as there is data being produced. [3] For example, one could analyze the sentiment in social media posts about a company and use it, together with the time stamp, to model and predict the development of a company’s stock value. Similar cases exist in finance, marketing, real estate, pandemic spreading. Large data sets and models build on them are also the foundation of basically an application of artificial intelligence and automation, e.g., in self-driving cars, these models are used to decide when a car break should or make a turn.

The application of big data is at such a hype that the professional roles are rapidly evolving and novel specialized roles are appearing. Nowadays, we can distinguish as main roles among data professionals: Data Scientists, Data Analysts, Data Engineers, and Machine Learning Specialists. [4] Depending on the organization and responsibilities, sometimes these profiles can intermingle and are difficult to differentiate. Hence, a same person can be a data scientist and a data analyst at the same time. In general data engineer roles have a stronger technical component, i.e., they oversee data management and data infrastructure and potentially the data (pre-) processing pipeline. Data analysts may have a stronger focus on the business side with visualizations and storytelling. Machine learning specialists focus on the modeling aspect and build the basis for artificial intelligence. Data scientists are high-level positions, their roles could include all the previously mentioned responsibilities as well as complex model building. In this use case we focus on data scientist, as their role is the most general and comprehensive one.

Writing code and basic understanding of math and statistics are necessary skills to work as data scientist. Curiosity and the strive to understand a data set and gain knowledge from it make a data scientist successful; this often includes a deep interest into the subject or the willingness to read into it [1]. However, being able to effectively communicate their results to stakeholders and organize your work around your team are equally important competencies and often underestimated. You will most likely be working within a group of data professionals in a corporative environment whose rules and methods you will need to adapt to.

Practice Case

Ready to get a taste of what the work of a Data Scientist could look like?
Get ready for the Data Scientist REBECA Practice case! Remember, this practice case does not prepare you to become a Data Scientist; it only helps you decide if this is the profession for you.

After completing the case, please do the reflection exercises. They will help you clarify what you have experienced and make informed decisions.

Requisites to perform this practice case

You should have (fundamental) Python programming knowledge. Ideally, you know how to use Numpy, Pandas and Matplotlib for data analysis and visualization. Seaborn is optional for appealing figures.

Preparational work

We recommend using Google Colab as an online interpreter. This way you do not need to install anything on your computer.You can also use as an alternative Cocalc.
ADVANCED ALTERNATIVE: Use Visual Studio Code or Jupyter Notebook.

Who are you?

Imagine you are a data scientist working in a team of 4 data scientists within the marketing department of company in the film industry. The “Global and Strategy” department needs to understand which factors have the largest effect on the revenue of a film. Your team has been selected to develop this project. And your team’s results could increase the return of interest (ROI) of the movie production firm.

Your team goal: Use data to understand which parameters effect the economic potential of a specific movie. “How can we achieve the largest revenue?”

Very important: you are in a team working for a corporative business, all the steps you are going to perform need to be well documented in the platform determined by the company to do so. This is important for the well-functioning of your team, and for the company; not doing so could cost a lot of money to the company in the future.

After acquiring (here: loading) the data, you need to clean and prepare the data set for further processing. Experience shows that this crucial step in the data analysis workflow can be the most challenging one.

Please, note that not all observations are complete, many of them do not contain information about all features, sometimes it appears a “None” or “NaN”. This missing data can affect your data analysis and later modeling, so you need to find a strategy to deal with it.

You discuss with your team how to proceed: one simple option would be to exclude observations with missing features from further analysis, but you need to check if this strategy will end up deleting many of the observations if most of them are incomplete. Another option would be to exclude features instead of observations, but none or only very few features are contained in all observations, so again you could end up deleting too much information. This means you must compromise: Select few entries with more attributes or more entries with fewer attributes.

NOTE: Creating a model later usually requires numerical data as input. This means other data formats, such as strings, need to be mapped to numeric values. For simplicity in this exercise, we restrict our further analysis to the numeric data type features of the data set.

ACTION: Clean and prepare the data by selecting the columns you want to analyze and drop incomplete entries following the comprise strategy. Limit yourself to numeric columns. Play around with selecting fewer or more (numeric) columns and remove missing observations. Balance the use of as many features as possible while dropping only as few entries as needed. During all of this, keep in mind your specific problem which you are planning to solve.
Once you are happy with your cleaning and preparation of the data set you can move on to the next task.

TIP FROM THE EXPERT: Alternative methods to deal with missing data exist. For example, you could infer the missing information, e.g., set to the mean. We refrain from these more complex ways to deal with missing data here though, but bear in mind that they also exist.

NOTE: Remember! have you documented all this?

SOLUTION 2

Acknowledgements

This practice case hase been created thanks to the input of three data scientists:

Jose Querales, Data Scientist and AI 2 at Carelon Global Solutions Ireland
Milan Zdravkovic, Research Lead at ZENPULSAR
Héctor Diez Nuño, Bioinformatician and data scientist at QGENOMICS

To ellaborate this practice data, organization hired the service of Alexander Britz and Daniel Mertens, founder of Scientists need more!

Guided reflection

After this experience, we suggest you reflect on the following questions:

Did you find the practice case easy or difficult to accomplish?
What was the most engaging task for you? Was it difficult or easy?
What was the most challenging task for you? Did you enjoy performing it? Would you see yourself getting better at it?
Have you found something new about this profession? What was it? Did it surprise you? Did you like it or dislike it?
Do you feel like contacting a scientific journalist in your network and research a little bit about the profession? Where would you find it?

Further information

[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

[2] Several formalizations of workflows in data science exist, which are in their essence similar. See for example https://www.datascience-pm.com/data-science-workflow/,

https://www.knowledgehut.com/blog/data-science/data-science-workflow, https://www.kdnuggets.com/2020/07/laymans-guide-data-science-workflow.html

[3] https://github.com/altair-viz/vega_datasets. Other online resources with free data sets exist. Under www.kaggle.com you can find various data sets with a large range of complexity and subject fields. Scientific data repositories exist for example under https://archive.ics.uci.edu. You could also search online for public data provided by your local or national governments.

[4] https://www.mastersindatascience.org/careers/data-analyst-vs-data-scientist/

[7] The online documentation of the widely used Python libraries are often fantastic and provide tutorials, detailed explanations of functionalities, example and much more. See for example https://numpy.org, https://matplotlib.org, https://pandas.pydata.org, https://seaborn.pydata.org, https://scikit-learn.org/

[8] For inspirations of distinct visualizations see https://www.data-to-viz.com

REBECA Practice: Data Scientist

Introduction

Practice Case

Requisites to perform this practice case

Preparational work

Who are you?

Define the Problem and Acquire Data

Clean and Prepare the Data

Exploratory Data Analysis

Creating a model and reviewing code

Make a prediction about the future: Inferences

Evaluate and interpret your findings

Presenting data

Finish the work

Acknowledgements

Guided reflection

Further information