PyCon Canada 2019

From hot mess to information, or why you should spend more time processing your data

by Serena Peruzzo

Community, Social, Ethics, and Education Machine Learning & Data Science

Over the last few years machine learning has drawn a lot of attention from both inside and outside the data science community. The internet is flooded with articles on the latest or coolest algorithms. What these articles often don’t cover is that at the beginning of your project, you'll be spending a lot of time collecting, cleaning and otherwise pre-processing your data, no matter what type of project or model you’re working on. There’s a tendency to dismiss this first stage as mundane, but this couldn’t be further from the truth. This first, exploratory, stage of the analysis is when you'll learn most about the information that is available for solving your problem and how to harness it. In this talk, I’ll use practical examples to describe some of the statistical techniques that I've found most useful over the years. For instance, box plots offer a simple way to detect outliers and inconsistencies. Others, like imputation, are more complex and can even leverage machine learning. These methods can be combined in multiple ways to create useful representations of data, making building a good model a whole lot easier.

About the Author

Serena is a senior data scientist at the analytics consultancy Bardess, currently based in Toronto, Canada. Before joining Bardess, she has worked both in academia as an ML researcher and in the industry as a data science consultant on the Australian, British and Canadian markets. Serena is passionate about education, community and tech for good and she splits her free time between mentoring data science students, organizing meetups and volunteering.

Talk Details

Date: Saturday Nov. 16

Location: Sky Room

Begin time: 15:00

Duration: 25 minutes

If you are the author of this talk and want to make an edit, feel free to send us a PR!