Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine...

70
Prof. Dr. Taysir Hassan A. Soliman Vice Dean for Graduate Studies & Research Faculty of Computers & Information, Assiut University Assiut University BioDialog PI Nov. 16, 2016 Big Data Analytics for BioDiversity

Transcript of Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine...

Page 1: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Prof. Dr. Taysir Hassan A. Soliman Vice Dean for Graduate Studies & Research Faculty of Computers & Information, Assiut University Assiut University BioDialog PI Nov. 16, 2016

Big Data Analytics for BioDiversity

Page 2: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Outline

• About Assiut • Assiut University • Faculty of Computers & Information • Research Interests • Biodiversity Informatics Previous Activities at

Assiut University • Visits and examples of Biodiversity in Egypt • Big data research and bidiversity

2

Page 3: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

3

Page 5: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

A Few Pictures From Assiut City

5

The Dam

A Walk beside the Nile

Assiut University Entrance

The Nile

Page 6: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Assiut University Map

6

Page 7: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Assiut University

• Assiut University was established in October 1957 as the first university in Upper Egypt to prepare highly qualified graduates with the basic specialized academic knowledge and training expertise on the various necessary skills.

7

Page 8: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Faculties & Institutes

• Faculties: 18 • Institutes: 2 (Sugar Industry, Oncology

institute) • International Students: Yemen, Malaysia,

Kuwait, Iraq http://www.aun.edu.eg/

Page 9: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Faculty of Computers & Information Assiut University

9

Lab Building Administrative Building

Established in 2001

Page 10: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Faculty of Computers & Information Assiut University (Staff)

• Information Systems (1 professor), 1 assistant professor, 4 teaching assistants, 7 demonstrators)

• Information Technology (1 professor & 3 assistant professors), 2 TA, 6 D)

• Computer Science (2 professors, 3 associate professors, 3 associate lecturers) 6 TA, 10 D)

• Multimedia Systems (1 associate professor)

10

Page 11: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Faculty of Computers & Information Assiut University (Facilities)

• Undergraduate labs: 9 • Lecture Halls: 9 • Specialized labs 5: (GIS, Multimedia, HP, Big

Data, and Bioinformatics) • Research labs: 5

11

Page 12: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Geographic Information System Labs GIS Lab consists of three modules : GIS Undergraduate Lab GIS Research Unit GIS Servers Unit

Page 13: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Geographic Information System Labs Contents :

Number of (20) computer device from module (Dell OptiPlex 380) which specifications (intel Core2Duo ,2GB of Ram)

Number of (1) Plotter device from module (HP Designjet T1200) to print a geographical maps . Number of (1) Data Show device in addition to show board for it .

Page 14: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Asyut Medical & Public Services Application

Clinics Medical Centers pharmacies

Medical Labs

Ambulance

Public Services

Page 15: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Multimedia Lab

Page 16: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Multimedia Production Unit

Multimedia Production Unit

Page 17: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

)Voice Recording Unit(

Multimedia Research Unit

Page 18: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Bioinformatics Research Lab & Big Data Labs

Page 19: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Information Systems Dept. Research Directions

Big Data Analytics

BioDiversity Informatics

Database Management

Data Mining

Semantic Data

Integration

Recommender Systems

Bioinformatics GIS Health Informatics

Page 20: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Computer Science Dept. Research Directions

Software Engineering

Distributed Computing

Computer Vision

Image Processing

High Performance Computing

Cloud Computing

Artificial Intelligence

Page 21: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Information Technology Dept. Research Directions

Ad Hoc Networks

Internet of Things

Mobile Computing

Vision and Robotics

Network Security

Cloud Computing

Broadcasting and media

technologies

Page 22: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Biodiversity Informatics Previous Activities at Assiut University

Page 23: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 24: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 25: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

BioDiversity Informatics Workshop at Faculty of Computers and

Information, Assiut University • Number of scientists: 34 (Faculty of Science,

Computers and Information, Agriculture, EELU) and 17 (teaching assistants) Number of undergraduate students: 156

• Number of employees: 9 • A total of 216 attendees

Page 26: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

BioDiversity Informatics Research Group

Prof. Dr.Taysir Hassan Vice Dean for Faculty of Computers & Information for Graduate Studies & Research, Assiut University PI

Prof. Dr. Medhat Moreed Vice Dean for Societal Services and Environmental Development Faculty of Science, Assiut University

Prof. Dr. Adel AbuElmagd Dean of Faculty of Faculty of Computers & Information, Assiut University

Prof. Dr. Ahmed Moharam Vice President of Fungi Research Institute Assiut University

Page 27: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Marwa Hussein Assistant Lecturer Information Systems Department Faculty of Computers and Information Assiut University

Majid Askar Assistant Lecturer Computer Science Department Faculty of Computers and Information Assiut University

Dr. Ahmed Taloba Assistant Professor, IS Department, FCI, Assiut University

Dr. Ahmed Albanhawy Assistant Professor, Botany Department Faculty of Science, Suez Canal

Page 28: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

From AinShams Workshop Sept. 2016

Wady El-Hetan

Page 29: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 30: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Why Big Data ?

• We need big data to the distribution of biodiversity

• Once scientific data becomes an essential transparency will be a must (publications and accessibility) … Ecological data access

• Science-driven data . • In global ecology, we go with problems that

Page 31: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Global Environmental Changes

• Habitat loss and species extinction, • Where willanimals move to survive? • Will human development prevent them from

getting there? Solution: conservation strategies are a crucial step toward minimizing biodiversity loss. • • Oceans acidification and land use

Page 32: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Global BioDiversity and Human Health

Fresh Water

Infectious Diseases

Air Quality

Agriculture

Role of Plants Pharmaceuticals

WHO Report

Page 33: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

• Measuring traits of individual organisms (nitrogen concentrations)

• Species distribution dataset (Flora, phona, geographic associations with museum data)

Page 34: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Questions ???

• Is it a “Data-driven” or a “knowledge-driven” science ?

• Examples of research questions we can solve through relating big data to biodiversity informatics?

• In which part of big data life cycle phases we can extract research questions for biodiversity informatics?

Page 35: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Example 1: Identify Biodiversity Hostpots

• It is widely acknowledged that biodiversity is much more than just the number of species in a region and a conservation strategy cannot be based merely on the number of taxa presenting an ecosystem.

• Therefore ,the idea that strongly emerges is the need to reconsider conservation priorities and to go to ward an interdisciplinary approach through the creation of science-policy partnerships.

Page 36: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Is it just point distributions ?????? Have a HYPOTHESIS

Page 37: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Other Examples

ICUN Redlists?

Page 38: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Other Examples

Page 39: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 40: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Biodiversity Data Characteristics • Voluminous • Incremental • Complex • Scalability • Heterogeneity • Has a taxonomy type • Distribution --- Global Biodiversity Information Facility

(GBIF) currently holds over 577 million occurrence records in the areas of climate change, human health, food and security, biofuels, ecosystem services.

• Genetic/ Genomic Information – environmental genomics, including metagenomics and metabarcoding

Page 41: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Heterogeneous Data Types

Page 42: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Technical & Non-technical Priority Areas for Biodiversity Informatics Research

Technical Priority Areas: • Deep analysis … > to improve data understanding; • Optimized architectures for analytics of data-at-rest and data-in-

motion; • Mechanisms for managing privacy … to enable the vast amounts of

data which are not open data (and never can be open data) to be part of the Data Value Chain;

• Advanced visualization and user experience • Data management engineering. Non-technical Priority Areas: • Skills development, • Business models and ecosystems; • Policy, regulation and standardization; • Social perceptions.

Page 43: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Big Data Analytics Life Cycle

Page 44: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Describe Preserve

Discover

Integrate

Analyze

Assure

Collect

Plan

<metadata/>

Publish

Scientific Data management

Présentateur
Commentaires de présentation
The data life cycle is a conceptual tool which helps to understand the different steps that data follow from data generation to knowledge creation. Different versions of data lifecycles: including publications or without, data submission
Page 45: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Scientist

Visualization

Visualization

E-Bird

Page 46: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Big Data Analytics Life Cycle

How do I assure my data for quality?

How do I choose my algorithm ?

Which type of Architecture do I use?

Page 47: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 48: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

IDigBio

Page 49: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

IdigBio

Page 50: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Big Data Challenges for ML and EDA

• Format variation of the raw data • Noisy and poor quality data • Fast moving streaming data • Trustworthiness of the data analysis • Highly distributed input sources • High dimensionality • Scalability of algorithms

Page 51: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Part I: Machine Learning Approaches

• One example is the usage of Deep Learning • Deep learning algorithms lead to abstract

representations because more abstract representations are often constructed based on less abstract ones.

• An important advantage of more abstract representations is that they can be invariant to the local changes in the input data.

• Learning such invariant features is an ongoing major goal in pattern recognition

Page 52: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Example

An image is composed of different sources of variations such a light, object shapes, and object materials. The abstract representations provided by deep learning algorithms can separate the different sources of variations in data.

Page 53: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Example of A DNN

Learning the parameters in a deep architecture is a difficult optimization task, such as learning the parameters in neural networks with many hidden layers.

Page 54: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

• Google’s “word2vec” tool is a technique for automated extraction of semantic representations from Big Data.

• This tool takes a large-scale text corpus as input and produces the word vectors as output.

Page 55: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Deep Learning

• Extracting complex patterns from massive volumes of data,

• Semantic indexing, • Data tagging, • Fast information retrieval

Page 56: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Deep Learning in Biodiversity Distribution (WildeLife Monitoring)

• Affordable and effective measures of conservation outcomes.

• Improve the quality of conservation monitoring and to scale monitoring programs to meet the global need.

• Extract meaningful information from the torrent of new sensor data, and improve the adaptive management of natural systems.

Page 57: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Case Studies Monitoring

Invasing species

Detecting Rare

Species Monitoring Population

through time

Empower biologists to analyze petabytes of sensor data from a network of remote microphones and cameras.

This system, which is being used to monitor endangered species and ecosystems around the globe, has enabled an order of magnitude improvement in the cost effectiveness of such projects.

This approach can be expanded to encompass a greater variety of sensor sources, such as drones, to monitor animal populations, habitat quality, and to actively deter wildlife from hazardous structures.

Detecting Bird

Vocalization

Detecting Fish in

underwater

Page 58: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Part II: The HOW-TO … Practice

Page 59: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 60: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 61: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 62: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Using Spark for BioDiversity Data

• Processing snapshots of biodiversity data providers’ entire datasets locally is an important capability.

• It allows broad questions to be asked across multiple data providers without needing to wait for providers to develop integrations or interfaces with each other;

• the providers’ web interfaces and application programming interfaces (APIs) no longer limit the way data is presented

• data can be processed at a much higher rate locally instead of through APIs.

Page 63: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Spark • In 2014, Spark became an Apache Foundation top-level

project and its popularity as a big data processing engine has taken off.

• It is a much simpler to install and use this implementation of the map-reduce pattern of data processing than its industry-favorite predecessor, Hadoop.

• With Spark, arbitrary querying, joining, and reducing operations on and between entire biodiversity datasets can be done with very little code on a desktop computer or commonly available cloud computing resources.

• Machine Learning Library (Mllib)

Page 64: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

iDigBio

• iDigBio – 44 million record datasets. • Sparkonomy, an iDigBio tool, was developed

to join tokenized taxon names from iDigBio to GBIF’s backbone taxonomy in a few minutes on a desktop computer.

• Effechecka from EOL is an early-phase web application that uses Spark jobs to construct checklists for taxon and spatial queries from iDigBio occurrence information.

Page 65: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 66: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

Perform interactive analytics on observational scientific data

Grid or Many Task Software, Hadoop, Spark

Data Storage: HDFS, Hbase, File Collection

Streaming data for weather

Science Analysis Code, Mahout, R

Transport batch of data to primary analysis data system

Record Scientific Data in “field”

Local Accumulate and initial computing

Direct Transfer

Examples include Remote Sensing, Astronomy and Bioinformatics

Page 67: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

References (1) [1] J. Salle, K. J. Williams, and C. Moritz, “BioDiversity Analysis in the Digital Era,” Phil. Trans. R. Soc. B371:20150337. [2] M. Collins, J. Poelen, A. Thompson, “Whole-Dataset Analysis using Apache Spark,” Missouri Botanical Garden Open Conference Systems, TDWG 2015 ANNUAL CONFERENCE. [3] C. Marchese, “Biodiversity Hotspots: A Shortcut for A More Complicated Concept,” Global Ecology and Conservation, Vol. 3, pp.297-309, 2015. [4] D. Klein, M. McKown, and B. Tershy, “Deep Learning for Large Scale BioDiversity Monitoring,” Bloomberg Data for Good Exchange Conference. 28-Sep-2015, New York City, NY, USA. [5] M. Najafabadi, F. Villanustre, T. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, “Deep learning applications and challenges in big data analytics,” Journal of Big Data.

Page 68: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of

References (2) • https://bigdatacoursespring2015.appspot.com/preview • http://bigdataopensourceprojects.soic.indiana.edu/ • http://dx.doi.org/10.1098/rstb.2015.0337 • http://www.gbif.org

1/26/2015 68

Page 69: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of
Page 70: Big Data Analytics for - uni-jena.de · project and its popularity as a big data processing engine has taken off. • It is a much simpler to install and use this implementation of