What is Data Science? - GitHub Pagessrdas.github.io/Presentations/DataScienceLandscape.pdf Resources...
Transcript of What is Data Science? - GitHub Pagessrdas.github.io/Presentations/DataScienceLandscape.pdf Resources...
What is Data Science?
Three Economic Stages
A. The Producer EconomyB. The Consumer EconomyC. Creator Economy:
i. We produce and consumeii. The currency is information
(Paul Saffo, Stanford)
• Data + Algorithms (ML) + Talent• Talent = Business + Math/Stat + Code + Communication
•
•
•
•
••
•
••
•
•
•
•
Big Data – the Human Side
•
•
•
•–
––
•
Big Data – The Human Side
•
•
•
•
•
•
Who is a Data Scientist?
How Do You Become a Data Scientist?
Skills: Programming, Math, Stats
● Python ● R● Database Management (SQL, noSQL)● Visualization (Tableau)● Cloud computing: Hadoop, Spark, AWS, etc.● Econometrics● Machine Learning
○ Supervised vs Unsupervised learning○ Regression vs Classification techniques
● Business skills● Advanced skills, e.g., Deep Learning: TensorFlow, Caffe,
Torch, Theano.
https://www.youtube.com/watch?v=oZikw5k_2FM&feature=youtu.be
Meetup Groups● http://www.meetup.com/R-Users/● http://www.meetup.com/BAyPIGgies/● http://www.meetup.com/SF-Bay-ACM/● http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/
ResourcesBeginnners Introduction to coding in R: http://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html
My free book on data science may be downloaded here: http://algo.scu.edu/~sanjivdas/WebBook/DSA_Book.pdfThis is a work in progress, so keep the link as I keep adding more to this book from time to time. I also have lots of cleanup to finish it, but you can use the code to try out analyses as you go. The book uses the R programming language which you should find useful and a solid complement to Python.
The best Python book for data science is by Wes McKinney: http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf
You might also want to take the online course on Data Science offered by Coursera: https://www.coursera.org/specializations/jhu-data-science
Good R tutorial: http://www.cyclismo.org/tutorial/R/
What is Different About Data Science?Core Ideas
Econometrics Machine Learning
Statistics Algorithms
Regression, Vector AutoRegression, etc
K-MeansDeep Learning, Support Vector Machines
•
•
•
Machine Learning = {Representation, Evaluation, Optimization}
Important Concepts● Training data (in-sample), Test data (out-of-sample).
● Cross-validation
Once these steps have been undertaken and the best ML algorithm is chosen on the training data, we may validate the model on out-of-sample data, or the test data set. One may randomly choose a fraction of the data sample to hold out for validation. Repeating this process by holding out different parts of the data for testing, and training on the remainder, is a process known as cross-validation and is strongly recommended.
● Over-fitting
If it turns out that repeated cross-validation results in poor results, even though in-sample testing does very well, then it is possible evidence of over-fitting. Over-fitting usually occurs when the model is over-parameterized in-sample, so that it fits very well, but then it becomes less useful on new data. This is akin to driving by looking in the rear-view mirror, which does not work well when the road does not remain straight going forward. Therefore, many times, simpler and less parameterized models tend to work better in forecasting and prediction settings. If the model performs pretty much the same in-sample and out-of-sample, it is very unlikely to be overfit. The argument that simpler models overfit less is often made with Occam’s Razor in mind, but is not always an accurate underpinning, so simpler may not always be better.
Feature Selection
● Wikipedia defines feature selection as – “In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.”
● Feature selection subsets the variable space. If there are pp columns of data, then we choose q << p variates. Feature extraction on the other hand refers to transformation of the original variables to create new variables, i.e., functionals of p, such as g(p). We will encounter these topics later on, as we work through various ML techniques.
Ensemble Learning● Bagging: calibrate the same model to different subsamples of the
training data, delivering multiple similar, but different models. Each of these models is then used to classify out-of-sample, and the decision is made by voting across models.
● Boosting: After one pass of calibration, training examples are reweighted such that the cases where the ML algorithm made errors (as in a classification problem) are given higher weight in the loss function. By penalizing these observations, the algorithm learns to prevent those mistakes as they are more costly.
● Stacking: Models are chained to each other, so that the output of low-level models becomes the input of another higher-level model. Here models are vertically integrated in contrast to bagging, where models are horizontally integrated.
•
••
••
Dark Side 2: Privacy• Social interaction vs solitude and privacy.
Technology makes this trade off steeper.• Profiling (e.g., Groupon)
• Price discrimination
• Security vs Efficiency
•
•
•
•
•
•
From Cirrus Shakeri, “From Big Data to Intelligent Applications,” post, January 2015https://www.linkedin.com/pulse/from-big-data-intelligent-applications-cirrus-shakeri
http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/
……
…
…
…
…
…
…
…………