Data Science and [email protected] Harokopio University of ......Key-value stores (e.g. Redis,...

14
Data Science and Open Source Software Iraklis Varlamis Assistant Professor Harokopio University of Athens [email protected]

Transcript of Data Science and [email protected] Harokopio University of ......Key-value stores (e.g. Redis,...

Page 1: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

Data Science and Open Source Software

Iraklis VarlamisAssistant ProfessorHarokopio University of [email protected]

Page 2: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

What is data science?

2

Page 3: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

Why data science is important?More data (volume, variety,...) ⇒ better analysis ⇒ better decision making

New data (trends, events) ⇒ faster analysis ⇒ new opportunities3

Page 4: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

The skills of a data scientist *

● Education: Computer Science, Mathematics, Statistics

● Domain Knowledge● Intellectual Curiosity● Communication Skills

● Balance between technical, communication and presentation skills

*According to Matthew Mayo @ KDNuggets: https://www.kdnuggets.com/2016/05/10-must-have-skills-data-scientist.html

● Programming: R, Python● Machine Learning/Data

Mining:scikit-learn, KNIME, TensorFlow, Theano, PyTorch

● Big Data Processing: Hadoop, Spark, Flink or Storm

● Data management: SQL, noSQL ● Data Visualization: D3.js, Chart.js,

Tableau, ggplot or lattice in R, matplotlib,plotly or seaborn in Python

Non-technical skills Technical skills

4

Page 5: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

There is no “one ring to rule them all”Statisticians choose R:

● Many statistical tests and models ● Better in ad hoc analysis● New statistical models can be written

in a few lines● Comprehensive R Archive Network

(12576 packages)■ https://cran.r-project.org/

Programmers choose Python:

● Lots of code in stackoverflow● Better data manipulation● Supports scripting website or other

applications● 140,784 projects in Python Package

Index (PyPI)■ https://pypi.org/

Run Python from Rhttps://github.com/rstudio/reticulate

Run R from Pythonhttps://pypi.org/project/rpy2/

5

Page 6: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

● Document databases (e.g. MongoDB, CouchDB): free-form JSON documents

● Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents

● Wide column stores (e.g. Cassandra, HBase): data in columns instead of rows

● Graph databases (e.g. Neo4j): networks or graphs of related entities

● Search engines (e.g. Elasticsearch, Solr): allow full-text or faceted search, search by synonyms etc.

Crunching Big Data and Analytics

● Distributed processing and cluster-computing

● Hadoop Common, Hadoop Distributed File System, Hadoop YARN (job scheduling and cluster resource management), Hadoop MapReduce (parallel processing)

● Apache Spark ○ With native bindings for Java, Scala,

Python and R ○ Supports SQL, streaming data,

machine learning, and graph processing

Store Process

6

Page 7: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

Process on the fly - Stream analytics

● Orchestrate (messaging): Kafka

● Stream data (or in mini-batches): Flink,

Storm or Spark streaming

● Process and predict: Spark MLLib

● Store results: MongoDB, Cassandra

● Visualize: D3.js, Plotly

Crunching Big Data and AnalyticsAnalyze and Visualize

● ETL

● Explore: Examine data sets to gain insights

● Learn and predict: Find trends, predict future

activity, Predictive analytics

● Visualize

7

Page 8: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

When it comes to data streams (e.g. IoT)

Source: InfoQ.com

8

Page 9: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

When it comes to social networksNetworks are graphs, and social network analysis requires graph operations at large scale!

Distributed storage and processing is the only solution.

9

TITANOLTP OLAP

ETL as graph traversal

Storage Indexing

Cassandra, HBase ,.. Elasticsearch, Solr

Gephi

Distributed notebook

Page 10: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

When it comes to document streams (e.g. tweets)The ELK stack

Text Mining NLP

N. Tsirakis, V. Poulopoulos, P. Tsantilas, I. Varlamis. (2017). Large scale opinion mining for social, news and blog data. Journal of Systems and Software 127: 237-248

10

Page 11: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

When it comes to Machine Learning and AI

Social Media

Data lake

Event Streaming

content fetched

(re)fetchcontent

archivecontent (ETL)

annotatecontent

Process Content

TranslateNERSentimentImage/Video tagging

annotatedcontent

Searchable Content

indexcontent

VisualizationSelected

data

Visualize

Collect

Store

ProcessSearch

Deliver Deliver

11

Page 12: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

“Open Source Is The New Normal... ...In Data and Analytics”, Scott Gnau, CTO at Hortonworks, @ Forbes, July 2017

● And how Open Source makes money?○ Customization of Code○ Sell Value-Added Enhancements○ Provide Support○ Sell Documentation○ Sell Binaries○ Offer OSS

● Open Source S...

12

Page 13: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

Open Source Services● Accelerated product development by eliminating product dependencies● Agile, Innovative and cost-effective development● Cloud services (Amazon, Microsoft, Google, IBM, Salesforce

13

Page 14: Data Science and varlamis@hua.gr Harokopio University of ......Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra,

OSS 2018, Athens, GreeceJune 8th 2018

Thank you and

questions

14