Put Down That Checkbook! - Big Data without the Big Bucks
-
Upload
charlie-greenbacker -
Category
Data & Analytics
-
view
107 -
download
0
description
Transcript of Put Down That Checkbook! - Big Data without the Big Bucks
Put Down That Checkbook! Big Data without the Big Bucks
Charlie Greenbacker Director of Data Science
Altamira Technologies Corporation
Agenda
• What is a Data Scientist? • Why use Open Source Software (OSS)? • Survey of OSS Tools for Data Science
About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
photo: Columbia Pictures
Best reason for not finishing PhD
@ExploreAltamira
WHAT IS A DATA SCIENTIST?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
Paul Cooper, ITProPortal.com http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
Stat
istic
al A
naly
sis
Dat
a M
inin
g
Mac
hine
Lea
rnin
g
Nat
ural
Lan
guag
e Pr
oces
sing
Soci
al N
etw
ork
Ana
lysis
Dat
a V
isual
izat
ion
Domain Knowledge & Communication Skills
etc.
Altamira Technologies Corporation 2014
WHY USE OSS?
What is Open Source Software (OSS)?
The Open Source Definition:
1. Free Redistribution 2. Source Code 3. Derived Works
more: opensource.org
WHY USE OSS?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
SURVEY OF OSS TOOLS FOR DATA SCIENCE
Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features:
– Language & environment for statistical computing & viz – Linear and nonlinear modeling, classical statistical tests, time-series
analysis, graphical techniques, and more… – 5000+ packages available in CRAN repository
Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features:
– Data analysis workflow in Python – DataFrame object for fast manipulation & indexing – Tools for reading & writing data between formats – Label-based slicing, indexing, and subsetting of data
Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features:
– MPP query engine implemented on Hadoop – Low latency, high concurrency SQL & BI queries – Same interfaces as Apache Hive, but ~24x faster – Written in C++; does not use MapReduce
Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features:
– Distributed/scalable ML library for Hadoop – Classification, Clustering, Collaborative filtering – Logistic regression, naïve Bayes, random forest, neural networks, HMM,
k-means, SVD, PCA, ALS, LDA, etc.
Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features:
– ML library for Python built on NumPy, SciPy, matplotlib – Support for classification, clustering, dimensionality reduction,
regression, model selection, preprocessing – SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features:
– Java-based “Machine Learning for Language Toolkit” – Document classification, clustering, topic modeling, information
extraction & sequence tagging, etc. – Efficient implementation of LDA for topic modeling
Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features:
– Natural Language Toolkit for Python – Built-in support for dozens of corpora & trained models – Libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning
Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features:
– Suite of high-quality, Java-based NLP tools – Includes POS tagger, named entity recognizer, parser, coreference
resolution, sentiment analysis, SUTime, etc. – Includes models for English, Chinese, Arabic, German
NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features:
– Extracts location names from text, resolves to gazetteer – Employs context-based geospatial entity resolution – ~75% accuracy, processes 1M documents per hour – Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Social Network Analysis Name: NetworkX Creator: Los Alamos National Lab License: BSD 3-Clause License Website: networkx.github.io Source: github.com/networkx/networkx Features:
– Python structures for graphs, digraphs, & multigraphs – Support for creating, manipulating, & analyzing the structure, dynamics,
& functions of complex networks – Provides standard graph algorithms & analysis metrics
Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features:
– Network analysis and visualization package for Java – Dynamic network analysis with temporal filtering – Metrics include: community detection, betweenness, closeness,
clustering coefficient, PageRank, etc.
Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features:
– JavaScript library based on HTML, SVG, and CSS – Binds data to DOM & enables transformations – ~200 examples, including: force-directed graphs, choropleths,
treemaps, dendrograms, animations, etc.
Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features:
– Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. – Integrates structured data, text, images, video – Cell-level security & access controls – Live, shared collaborative workspaces
Final Thought…
Save your $$$ for: People
– salaries, training, etc.
Resources – hardware, AWS, etc.
Proprietary software – if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker @greenbacker | oss4ds.com