Skip to main content
Spain

REBECA Practice: Data Scientist

REBECA Practice: Data Scientists

Forecasting the world through data

Introduction

“Data Scientist: The Sexiest Job of the 21st Century” was the headline of an article by HBR in 2012 [1]. Now, a little over 10 years later, the digital world has developed tremendously, especially with the recent developments in machine learning and artificial intelligence. And (big) data is the foundation to all of this, together with the knowledge workers treating these large data sets: Data professionals.

These professionals are big assets to organizations. In short, they analyze large data sets and build models. These models can predict trends, future events, and are a source of evidence to base strategic decision making. The range of sectors of business, where these tools can be used, is endless, as long as there is data being produced. [3] For example, one could analyze the sentiment in social media posts about a company and use it, together with the time stamp, to model and predict the development of a company’s stock value. Similar cases exist in finance, marketing, real estate, pandemic spreading. Large data sets and models build on them are also the foundation of basically an application of artificial intelligence and automation, e.g., in self-driving cars, these models are used to decide when a car break should or make a turn.

The application of big data is at such a hype that the professional roles are rapidly evolving and novel specialized roles are appearing. Nowadays, we can distinguish as main roles among data professionals: Data Scientists, Data Analysts, Data Engineers, and Machine Learning Specialists. [4] Depending on the organization and responsibilities, sometimes these profiles can intermingle and are difficult to differentiate. Hence, a same person can be a data scientist and a data analyst at the same time. In general data engineer roles have a stronger technical component, i.e., they oversee data management and data infrastructure and potentially the data (pre-) processing pipeline. Data analysts may have a stronger focus on the business side with visualizations and storytelling. Machine learning specialists focus on the modeling aspect and build the basis for artificial intelligence. Data scientists are high-level positions, their roles could include all the previously mentioned responsibilities as well as complex model building. In this use case we focus on data scientist, as their role is the most general and comprehensive one.

Writing code and basic understanding of math and statistics are necessary skills to work as data scientist. Curiosity and the strive to understand a data set and gain knowledge from it make a data scientist successful; this often includes a deep interest into the subject or the willingness to read into it [1]. However, being able to effectively communicate their results to stakeholders and organize your work around your team are equally important competencies and often underestimated. You will most likely be working within a group of data professionals in a corporative environment whose rules and methods you will need to adapt to.

Practice Case

Ready to get a taste of what the work of a Data Scientist could look like?
Get ready for the Data Scientist REBECA Practice case! Remember, this practice case does not prepare you to become a Data Scientist; it only helps you decide if this is the profession for you.

After completing the case, please do the reflection exercises. They will help you clarify what you have experienced and make informed decisions.

Requisites to perform this practice case

You should have (fundamental) Python programming knowledge. Ideally, you know how to use Numpy, Pandas and Matplotlib for data analysis and visualization. Seaborn is optional for appealing figures.

Preparational work

We recommend using Google Colab as an online interpreter. This way you do not need to install anything on your computer.You can also use as an alternative Cocalc.
ADVANCED ALTERNATIVE: Use Visual Studio Code or Jupyter Notebook.

Who are you?

Imagine you are a data scientist working in a team of 4 data scientists within the marketing department of company in the film industry. The “Global and Strategy” department needs to understand which factors have the largest effect on the revenue of a film. Your team has been selected to develop this project. And your team’s results could increase the return of interest (ROI) of the movie production firm.

Your team goal: Use data to understand which parameters effect the economic potential of a specific movie. “How can we achieve the largest revenue?”

Very important: you are in a team working for a corporative business, all the steps you are going to perform need to be well documented in the platform determined by the company to do so. This is important for the well-functioning of your team, and for the company; not doing so could cost a lot of money to the company in the future.

Acknowledgements

This practice case hase been created thanks to the input of three data scientists:

  • Jose Querales, Data Scientist and AI 2 at Carelon Global Solutions Ireland
  • Milan Zdravkovic, Research Lead at ZENPULSAR
  • Héctor Diez Nuño, Bioinformatician and data scientist at QGENOMICS

To ellaborate this practice data, organization hired the service of Alexander Britz and Daniel Mertens, founder of Scientists need more!

Guided reflection

 After this experience, we suggest you reflect on the following questions:

  • Did you find the practice case easy or difficult to accomplish?
  • What was the most engaging task for you? Was it difficult or easy?
  • What was the most challenging task for you? Did you enjoy performing it? Would you see yourself getting better at it?
  • Have you found something new about this profession? What was it? Did it surprise you? Did you like it or dislike it?
  • Do you feel like contacting a scientific journalist in your network and research a little bit about the profession? Where would you find it?

Further information

[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

[2] Several formalizations of workflows in data science exist, which are in their essence similar. See for example https://www.datascience-pm.com/data-science-workflow/,

https://www.knowledgehut.com/blog/data-science/data-science-workflow, https://www.kdnuggets.com/2020/07/laymans-guide-data-science-workflow.html

[3] https://github.com/altair-viz/vega_datasets. Other online resources with free data sets exist. Under www.kaggle.com you can find various data sets with a large range of complexity and subject fields. Scientific data repositories exist for example under https://archive.ics.uci.edu. You could also search online for public data provided by your local or national governments.

[4] https://www.mastersindatascience.org/careers/data-analyst-vs-data-scientist/

[7] The online documentation of the widely used Python libraries are often fantastic and provide tutorials, detailed explanations of functionalities, example and much more. See for example https://numpy.org, https://matplotlib.org,  https://pandas.pydata.org, https://seaborn.pydata.org, https://scikit-learn.org/

[8] For inspirations of distinct visualizations see https://www.data-to-viz.com