Notes: You can't do data science in a GUI
Challenge of DS: Taking a vague question and making it precise enough that you can answer and use it qualitatively
Import data -> make it tidy -> Job as DS is to understand the data
Understand is a loop of:
- Transforming: Create new variable & summaries
- Visualizing: Surprises but doesn’t scale
- You can start with a vague question and the viz will help you
- You can see something you don’t expect
- Don’t scale because there is a human in the loop
- Model: Scales but doesn’t fundamentally surprises
- Makes assumptions that cannot break, so it cannot surprise you
After understanding you want to:
- Communicate - with a supervisor or whatever
- Automate - how to deploy
Why program if i want to do DS?
- What do you do in a DS problem: Think it (what are you going to do) -> Describe (precisely) it (The computer can understand) -> Do it (Computer runs it)
Two extremes:
- Excel: you can see what you can do but you are constrained
- You don’t see whats possible (blink cursor) but you are free to
Programming languages are languages - you can express your thoughts in it It can be hard to express thoughts (code/text makes it easier)
Coding is just text:
- Allows you to copy paste. Readable, Diffable, Reproducible, Open, Sharable, share (email, tweet it).
- Thanks to that tools like Stack Overflow and Github exist and are amazing
- Narratives (using code) using Notebooks and Rmarkdown
- Combine pieces to solve bigger problems
As data changes code just need to run again and everything feels into place.
- If you use word, you have to re run the analysis and copy-paste images and make everything again by hand
In a GUI: You live in fear of clicking the wrong thing and making it permanent since its hard to roll back
Why use R
R is a vector language
Missing values included
- This logic can be tricky
Data Frame (table) included
Functional programming
- You work with it by concatenating functions and doing stuff with functions
- This is a good fit for DS
Meta-programming
Allows you to create DSLs that allows to express thing in different ways for different tasks (e.g. plotting and cleaning data)
No matter how complex and polished the individual operation are, it is often the quality of the glue that most directly determines the power of the system - Hal Abelson
GUI
Some things are hard to express in code (or maybe just a painful): like changing variables names UI plugins that are tied into code (or produce code) are useful for this can of stuff
Autocomplete is key
Summary
- Huge advantages to code
- R provides great environment - Doesn’t mean Python sucks! :)
- DSLs help express your thoughts
- Code should be the primary artifact (but might be generated other than typing)