Big Data Praktikum SS 2018 - uni-leipzig.dedbs.uni-leipzig.de/file/Intro_bdprak_final.pdf ·...

37
Big Data Praktikum SS 2018 Universität Leipzig, Institut für Informatik Abteilung Datenbanken Prof. Dr. E. Rahm

Transcript of Big Data Praktikum SS 2018 - uni-leipzig.dedbs.uni-leipzig.de/file/Intro_bdprak_final.pdf ·...

Big Data Praktikum

SS 2018

Universität Leipzig, Institut für Informatik

Abteilung Datenbanken

Prof. Dr. E. Rahm

Ziel: Entwurf und Realisierung einer Anwendung / eines

Algorithmus unter Verwendung existierender Big Data

Frameworks

Ablauf

Anwesenheitspflicht der Gruppe zu allen Testaten

Bis Anfang Mai Erstes Treffen mit Betreuer (Terminanfrage per Mail)

Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze

Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen

Anfang August Testat 3: Präsentation

15 Minuten pro Gruppe

Anwesenheitspflicht aller Praktikumsteilnehmer

Organisation

Quellcode: GitHub Repository Gruppe => Collaborators

Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked

Java: Apache Maven 3 für Projekt Management

Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks

Quellcode Dokumentation zwingend erforderlich!

Stabile Versionen verwenden (ggf. Rücksprache) z. B. Flink 1.4.2

Lokal lauffähige Lösungen können auf dediziertem Cluster

ausgeführt werden Terminabsprache Anfang Juli mit [email protected]

Datensätze

z. B. https://github.com/caesar0301/awesome-public-datasets

Technische Details

PPRL with Bloom Filters

Projects:

1.) Analyzing different BitSet Implementations for Bloom-Filter-based PPRL

2.) Analyzing different lengths of Bloom Filters

3.) Analyzing XOR-Folding for Bloom Filters

Privacy-Preserving Record Linkage (PPRL)

Find records in different databases that refer to

the same real world object

No disclosure of sensitive personal information

Privacy technique: Bloom Filter

Analyzing different BitSet

Implementations for Bloom-

Filter-based PPRL

Martin Franke

BitSet Implementations for Bloom Filters

Problems: Different BitSet implementations usable as basis for Bloom Filter

java.util.BitSet

OpenBitSet

boolean[]

No or outdated benchmarks

Task: Development of three Bloom Filter implementations

Performance benchmark (runtime, memory)

Proof of claims, e. g. “OpenBitSet is faster than java.util.BitSet in most operations

and *much* faster at calculating cardinality of sets and results of set operations.”

Technologies: Java

Apache Flink

Analyzing different lengths of

Bloom Filters

Marcel Gladbach

Lengths of Bloom Filters

Problems: PPRL Applications use given length of Bloom Filters for encoded records (usually

1000)

Better performance is expected with shorter Bloom Filters

But how does the length of the Bloom Filter effect the quality of the results?

Task: Encoding of given data sets with different parameters:

Lengths of Bloom Filters

Number of hash functions

Evaluation of quality (recall, precision) of PPRL processing based on parameters

Exploration of practical boundaries

Technologies: Java

Apache Flink

Analyzing XOR-Folding for

Bloom Filters

Ziad Sehili

XOR-Folding for Bloom Filters

Problems: Goal of PPRL is to hide personal data in the matching process by encoding the

fields in a Bloom Filter. BUT some cryptanalysis methods can disclose original data

The main weakness of Bloom Filters is the frequency of some tokens (“er” is a

frequent bigram many bloom filters will have some same position set to 1).

Is it possible to hide or obfuscate these frequencies by XOR-folding the Bloom

Filter?

How is the impact of the folding operation on the linkage quality?

Task: Implementation of some folding operations

Evaluation of quality (recall, precision)

Technologies: Java

OSTMap

Open Source Tweet MapMatthias Kricke & Martin Grimmer

• https://github.com/IIDP/OSTMap

• OSTMap development started as a project at the IT-Ringvorlesung 2016.

• A team of six students (and some help of two big data experts) implements

OSTMap over a period of 6 weeks.

• OSTMap reads geotagged data from the twitter stream.

• We store tweets in a hadoop cluster running Apache Accumulo and Apache Flink.

OSTMap - Open Source Tweet Map

Efficient Termindex for Twitter Data and Trend Visualization

• Part 1:

• Currently the term search supports lookups for exactly one term eg. „bigdata“

• We want to support fast queries like: „the“ „white“ „house“

• Key word: Document-Partitioned Indexing

• Part 2:

• We want to visualize current trends…

• With their geographic distribution and

• Their temporal spread.

Sentiment Analysis for Twitter Data

• Part 1:

• Use of Java-based libraries for in-stream sentiment analysis of twitter data

• Batch-based sentiment analysis, e.g. with SparkMLlibs Naïve BayseClassifier

• Write data to a table for sentiment analysis results for each approach

• Part 2:

• Build a frontend in OSTMap for users to decide the sentiment of randomly drawn tweets are done

• Use the information for quality analysis of sentiment analysis procedures and visualize the results in OSTMap

Polyglot DBJohannes Zschache

Polyglot DB

• Verschiedene Anwendungen erfordern versch. Typen von Datenbanken: Relational, Key-Value, Document, Graph, …

• In der Praxis: Gleichzeitige Verwendung versch. Typen

• Vorteil: Optimale DB für jeden Anwendungsfall

• Beispiel: • Relational: Sicherheit, homogene Daten

• Key-Value: Schneller Zugriff, einfache Datenstruktur

• Document: Flexibles Schema, Suchfunktionen

• Graph: Beziehungen, Traversal

• Aufgabe: Welchen Vorteil hat die Verwendung einer Graphdatenbank gegenüber einer Dokumenten-Datenbank?

Anwendung

• Yelp Dataset

• Dokument-DB: MongoDB

• Infos zu Unternehmen

• Speicherung der Reviews

• Suche nach Kategorie

• Geospatial Query

• Empfehlungen: ähnliche Restaurants, z.B. Welche Restaurants wurden vom selben Reviewer gleich gut/schlecht bewertet?

• Graphdatenbank (Neo4j) schneller als MongoDB?

• Trotz Synchronisation?

Bolt-on causal consistencyJohannes Zschache

• Kausale Konsistenz ist Kompromiss zwischen sequentieller Konsistenz und Eventual Consistency

• Reihenfolge der Operationen wird eingehalten, aber beschränkt auf kausal

verbundene Operationen (happened-before relation)

• Weniger Koordination erhöht Verfügbarkeit• Stärkste Konsistenz, welche Verfügbarkeit (insb. Schreib-Operationen)

trotz Netzwerkpartitionierung erlaubt• Nur wenige NoSQL-DB unterstützen kausale Konsistenz• Bolt-on = Clientseitige Umsetzung• Paper: Bailis et al (2013), http://www.bailis.org/papers/bolton-

sigmod2013.pdf• Prototype (github): Java, Cassandra

Aufgabe• Umsetzung mit JavaScript, PouchDB und CouchDB

Bolt-on causal consistency

Creation and visualization of

temporal graphsChristopher Rost

Creation and visualization of temporal graphs

[1] Aynaud, Thomas & Fleury, Eric & Guillaume, Jean-Loup & Wang, Qinna. (2013). Communities in Evolving Networks: Definitions, Detection, and Analysis

Techniques. Modeling and Simulation in Science, Engineering and Technology. 2. 159-200. 10.1007/978-1-4614-6729-8_9.

„Graphs are everywhere“: friendship networks on Facebook, community

interactions at Stackoverflow, video-likes and channel-abo‘s on YouTube,

citation networks

Real-world graphs change over time – additions, deletions and updates of

edges, vertices and their properties

Much work done to analyse and visualize static graphs

„How communities evolve over a specific time range?“

„At which time the number of citations is growing rapidly? Did other citations

influence that?“

Creation and visualization of temporal graphs

Tasks

Create a temporal EPGM from a network dataset

Query graph data by time range

Visualize the graph in an interactive web application

[2] A. Beveridge and J. Shan, „Network of Thrones“ Math Horizons Magazine , Vol. 23, No. 4 (2016), pp. 18-22.

[3] Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and

Data Mining, 2017.

Now1990

Stackoverflow

temporal network

dataset:

2,601,977 nodes

63,497,050 edges

Size of Graph

[3]

FastText on Spark

Victor Christen

FastText on SparkWord2Vec

• Words are represented by a vector

• Trained by a large corpus considering context of words

Skip-gram Model

FastText on SparkIssues

• Unknown words in test corpus Missing fuzzy component

Solution

• FastText

• Using n-gram sequences for representing words

• Utilized to generate embeddings even for words that are not included in the

vocabulary

Task

• Understanding FastText

• Representation of words

• Neural Network

• Implementation with DeepLearning4j

Distributed FastText on TensorFlowIssues

• Unknown words in test corpus Missing fuzzy component

Solution

• FastText

• Using n-gram sequences for representing words

• Utilized to generate embeddings

Task

• Distributed Implementation of FastText on TensorFlow

Farberkennung von Produkten

Eric Peukert

Farberkennung von Produkten

• Zur Identifikation von Duplikaten in Produktkatalogen können Bilder sehr hilfreich sein

• Ziel:

• Extraktion der Farbinformation von Produktbildern

• Segementierung und Annotation von Vorder und Hintergrund – ggf- andere Kategorien

• Technologie:

• Convolutional Neuronal Networks (nutzbar z.B. über TensorFlow)

• Daten

• 90000 Produktbilder der WebDataSolutions GmbH

Analytics of

BitCoin Transaction DataEric Peukert

Analytics ofBitCoin Transaction Data

• Parsen der Bitcoin Blockchain

• Verarbeitung von Updates durch neue Transaktionen

• Erstellung eines Graphen in Gradoop

• Analyse mittels Gradoop

• Max 2 Studenten

• mit guten Java Programmierkenntnissen

• Flink-Erfahrung oder VL Cloud Data Management

als Voraussetzung

Analytical

Workflows

Webgraph Analysis with

GRADOOPMoritz Wilke

Webgraph Analysis with GRADOOP

• commoncrawl.org: three-monthly snapshots of web graph on host-level

• Questions:

• How is the{University of Leipzig, Bach Digital project} interlinked with other institutions and research projects?

• How did this change over time?

• Are there interesting structures or missing links (e.g. triangle closing)?

• Tasks:

• Data Import to GRADOOP, Preprocessing

• Data exploration

• Development of analytical questions

• Data Analysis with GRADOOP operators

• Visualization / Reporting

Thema FW #Studenten Betreuer

PPRL: Analyzing different BitSet

Implementations for Bloom-Filter-based

PPRL

Java / Apache Flink2

Franke

PPRL: Analyzing different lengths of Bloom

FiltersJava / Apache Flink 2 Gladbach

PPRL: Analyzing XOR-Folding for Bloom

FiltersJava / Apache Flink 2 Sehili

OSTMap: Efficient Termindex for Twitter

Data and Trend Visualization

Java / Apache Flink /

Apache Accumulo /

JavaScript

2Grimmer

OSTMap: Sentiment Analysis for Twitter

Data

Java / Apache Flink /

Apache Accumulo /

JavaScript

2Kricke

Creation and visualization of temporal

graphs

Java / Apache Flink /

Gradoop / JavaScript2 Rost

Polyglot DB Java, MongoDB, Neo4j 2 Zschache

Bolt-on causal consistencyJavaScript, CouchDB,

PouchDB2 Zschache

FastText on Spark Spark, DeepLearning4j 2 Christen

Distributed FastText on TensorFlow TensorFlow 2 Alkhouri

Farberkennung von Produkten TensorFlow 2 Peukert

Analysis of the BitCoin-BlockchainJava / Apache Flink /

Gradoop2 Peukert

Webgraph AnalysisJava / Apache Flink /

Gradoop / (JavaScript)2 Wilke