notes

My Guest Today

Principal Data Scientist and Jack of all Trades Sabina Stanescu

Data Science Hierarchy of Needs

Data Science Innovation vs Impact

Notes, Links, and Corrections

The image we can’t show in a podcast is just above. It came from: https://towardsdatascience.com/prioritizing-data-science-work-936b3765fd45#093e.
Points Travel is a great example of adding business value without machine learning. After making big initial improvements, Points did further improve the system with machine learning later on. Everyone who worked on the project was really proud of it. Shameless promotion & shout-out to my (and Sabina’s former) employer!
Multi-armed Bandit.
NLP is Natural Language Processing. It’s the practice of extracting meaningful data from natural written or verbal communication.
The coworker Sabina mentions is Maliha Islam, a rising star in Toronto’s tech industry. Sabina and I have both had the pleasure of working with her.
The Data Science Hierarchy of Needs was explained in this video.
You can play with Jupiter Notebooks from your browser for Python, R, Ruby, etc.
Python, R, Scala, and Julia are all programming languages, each with unique benefits and drawbacks.
Pickling is the process of saving Python objects to a file. For example object() cannot be saved as XML or JSON, but it can be pickled.
Harvard Business Review from 2012, Data Scientist: The Sexiest Job of the 21st Century.
Sabina learned Natural Language Processing (NLP) so that she could QA (quality assurance check) someone else’s work. Imagine learning to assemble a bicycle so you could inspect a bike manufacturer before they started shipping. What a badass!
Data Science are people of all backgrounds. A report from 365DataScience found 42% of Data Scientists had previously held a Data Science role, meaning 58% came from other jobs. source.
R programming language, designed and used primarily for statistical computing.
Python programming language, designed to be general-purpose. Many great libraries exist to help data scientists get things done.
SAS (Statistical Analysis System) is a language for data prep and statistical analysis.
PySpark (Python + Apache Spark) is an analytics engine for large-scale data processing. See Getting Started with Pyspark.
CRISP-DM - Cross-Industry Standard Process for Data Mining was created by IBM and is the most widely used analytics model.
Dear Sergio, it’s pronounced statistical significance.
What is a Docker Container.
What is a Python Virtualenv. Now Pipenv is the industry standard.
What is a CI/CD pipeline (Continuous Integration, Continuous Deployment).
Kaggle is an online community of data scientists. They regularly host competitions.
The perfect place to start is Kaggle’s Titanic machine learning competition. It has been attempted by 11,800 groups/individuals at the time of writing.
Hyper-Parameter Tuning means adjusting your model to optimize it (without overfitting your training data). A Hyper-Parameter is anything decided before training begins, for example deciding your model’s decision tree will have no more than 5 layers.
Keras is TensorFlow’s high-level API for building and training deep learning models.
TensorFlow also has a great getting-started tutorial: https://www.tensorflow.org/tutorials/keras/classification.

Closing music: Sun K - High in the City