CS5304 Data Science in the Wild

Course Description

Massive amounts of data are collected by many companies and other organizations, creating new opportunities for data scientists, but also raising several interesting challenges in extracting meaningful and actionable knowledge from data. Creating efficient and impactful data science processes is not an easy task: forming analysis questions is hard, data is messy, the volume and dimensionality of data are massive, and closing the loop in business and research operations is tough. The course aims to provide a comprehensive set of tools for extracting knowledge from data: forming analysis questions and measures; data manipulation, extraction, and labeling; efficient data analysis; and reporting and visualizing conclusions. This course will focus on the unique challenges that arise from the practical aspects of the field, relying on business and research case studies to highlight the full process of data science.

Prerequisites

CS 5785 or equivalent and experience programming with Python, or permission of the instructor.

Room & Time

Mondays and Wednesdays, 4:45PM-6:00PM, 131 Bloomberg Center, Cornell Tech

Class number: 12791

Links: CMS for homework submission, wild-data-science.slack.com for discussions.

Reading (not mandatory)

Jake VanderPlas, Python Data Science Handbook, O'Reilly Media; 1 edition (2016) - Free book
Russell Jurney, Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark, O'Reilly Media; 1st edition (2017).
Foster Provost and Tom Fawcett, Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking, O'Reilly Media; 1st edition (2013)
A. Rajaraman, J. Leskovec and J. Ullman, Mining of Massive Datasets, Cambridge University Press, 3rd version.