Monday 11 a.m.–11:30 a.m. in PyData
Tidy Data in Python
Aviv Rotman
- Audience level:
- Intermediate
Abstract
If you ask any data scientist what is the most frustrating and time consuming part of a data science project, surprisingly they won't say visualization, neural network architecture, or feature engineering, they will most probably say cleaning and shaping data. The struggle to work with messy data is what can make or break a project and sometimes hide the real gems the data has to show us. Many junior data practitioners shrug off this stage as mechanic and boring, and tend to put little thought towards it. It turns out that there is a "right" way to tidy data that allows for easy analysis and visualization down the line Tidy data has a specific structure, which can be summarized in two sentences: each variable is a column; each observation is a row. The simplicity of this strategy makes it easy to understand how to tidy data, and only requires a small set of tools to deal with a wide range of messy datasets. These tools have been developed in the popular r packages dplyr and tidyr. Alas, this is not an r conference, and we are but hapless python developers. Is our fate to be left out in the cold with all our messy data?!? Not on my watch! In this talk we will learn about "tidy data", a strategy formulated by Hadley Wickham in 2014. We will also go over common cases of messy data and how to tidy them with python tools, and we will see how using this system we can quickly achieve complex analyses and intuitive visualizations.
Relevant article and blog posts: