Exploring language classification with spark and the spark notebook

Exploring Language ClassificationWith Apache Spark and the Spark Notebook

A practical introduction to interactive Data Engineering

Gerard Maas

Gerard MaasLead Engineer @ Kensu

Computer EngineerScala ProgrammerEarly Spark AdopterSpark Notebook Dev

Cassandra MVP (2015, 2016)

Stack Overflow Top Contributor(Spark, Spark Streaming, Scala)

Wannabe IoT HackerArduino Enthusiast

@maasg

https://github.com/maasg

https://www.linkedin.com/in/gerardmaas/

https://stackoverflow.com/users/764040/maasg

DATA SCIENCE GOVERNANCE

Adalog helps enterprises to ensure that data pipelines continually deliver

their value by combining the contextual information when the pipeline was

created with the evolving environment where the pipelines execute.

CONNECT - COLLECT - LEARN

Language Classification

Language ClassificationSome inspiration...

What’s is a language? How is it composed?

Letter FrequencyCould we characterize a language by calculating the relative frequency of letters in some text ?

Spanish vs English letter frequency

n-grams

"cavnar and trenkle"

bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_

tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_

quad-grams: cavn,...

http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf

Could we characterize a language by calculating the relative frequency of sequence of letters in some text ?

Spark APIs

RDD -> Resilient Distributed Datasets

- Lazy, functional-oriented, low level API- Basis for execution of all high-level libraries

Dataframes

- Column-oriented, SQL-inspired DSL- Many optimizations under the hood (Catalyst, Tungsten)

Dataset

- Best of both worlds (except …)

Spark NotebookA dynamic and visual web-based notebook for Spark with Scala

Spark Notebook - Open Source Roadmap

GIT KerberosProject Generator

Q1 Q2 Q3

Announcements: blog.kensu.io

Notebooks

Notebooks for this presentation are located at:

https://github.com/maasg/spark-notebooks

- have fun!

https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb

Implements the idea of using a letter frequency model to classify the language in a doc.

Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/

It produces a training set of sampled strings that will be used also for the n-gram classifier

(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)

Notebook 1 : Naive Language Classification

Notebook 2 : n-gram Language Classification

https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb

Implements the n-gram algorithm described in the paper.

Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/

Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity.

(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)

Exploring language classification with spark and the spark notebook

Software

Transcript of Exploring language classification with spark and the spark notebook

EXPLORING CREATION WITH BIOLOGY• Advantage Set: Exploring Creation with Biology, Solutions and Tests packet, and Student Notebook * • Lab Materials: Dissection Kit and specimens

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Spark & Spark SQL

· on-demand Spark (and Hadoop) clusters in IBM cloud Watson Studio Spark Environments cloud tools for data sc'entists and appl'cation developers dedcated Spark cluster per Notebook

H1 Samsung 5530 Notebook GT GT6000 Notebook M M40 ... cargador...H1 Samsung 5530 Notebook GT GT6000 Notebook GT GT7000 Notebook M M40 Notebook P NP-P28 Notebook P NP-P28SE Notebook

Spark Platform Spark Core Spark Extensions Using … Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks 20 years

Quick Start: Using Apache Spark for Large-Scale Data ...export PYSPARK_PYTHON="$(which python)" fi • On Cooley, interactive Spark jobs setup IPython notebook by defaults. You can

[Spark meetup] Spark Streaming Overview

Getting Started with Apache Spark - Big Data · PDF fileGetting Started with Apache Spark ... Exploring and Querying the eBay Auction Data 28 ... R and Scala. Spark is often used alongside

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser

Learning spark ch09 - Spark SQL

Chapter 1: Exploring Geography - Hazleton Area … 36 into your notebook under notes. 1. Geographic Tools 2. Physical Characteristics 3. Physical Processes 4. ... Chapter 1: Exploring

Guiding Unit Planner · - 3 - Art (See also Crafts) Being A Spark: (Make a rainbow picture with crayons, draw picture of you in your Spark uniform) In My Community (Art Outing) Exploring

Exploring Adverse Drug Effect Data with Apache Spark ... · 23/09/2015 · Apache Spark Overview Open source, Java-based alternative to Hadoop MapReduce platform Execution engine

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Exploring the Performance of Spark for a Scientiﬁc Use Caseweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16... · 2016-05-27 · the Spark implementation as compared with

Jupyter and Spark on Mesos: Best Practices · 2017-12-14 · Contributor @ Apache Mesos & Apache Spark. 1 Apache Spark ... HBase, Cassandra, etc. 1 Jupyter Notebook Server ... No

Science Notebook - Student Edition - Glencoeglencoe.com/sites/florida/student/science/assets/pdfs/g6scintbk2.pdf · Using Your Science Notebook ..... vi Chapter 1 Exploring and Classifying

Exploring the prolific Rona Ridge West ... - Spark Exploration