OpenSource Big Data Platform - Flamingo Project
-
Upload
edward-kim -
Category
Technology
-
view
1.732 -
download
2
description
Transcript of OpenSource Big Data Platform - Flamingo Project
Open Cloud Engine
Introduction and Case Study of Open Source Project Flamingo, the Big Data Platform Open Cloud Engine Flamingo Project Leader Edward Kim ([email protected])
2014.03.01 v0.8
What is a Big Data Platform?
The roles of the Big Data Platform • What are main tasks that can be done on the Big Data platform?
• Data mining, Statistical analysis, Log handling (collecting, pre-processing)
• Who does what on the platform? • Varies with users.
• Most operators: development background, so focus on system management and log handling.
• Analysts: focus on establishing a better environment for analyzing data. • How many users are using the Big Data platform?
• Lots of users à functionality of platform and accessibility of infra is important.
• Big Data platform handles data à vulnerable. Hadoop is insecure. • What am I? Operator? Architect? Developer? Data Scientist?
• Depending on a role, functions of platform can be defined differently.
What Big Data Platform Must Provide SOFTWARE STACK
What Big Data Platform Must Provide
INFRA MANAGEMENT MONITORING
What Big Data Platform Must Provide
WORKFLOW
What Big Data Platform Must Provide
ANALYSIS AND VISUALIZATION
What Big Data Platform Must Provide
DASHBOARD
What Big Data Platform Must Provide
SECURITY
• ACCESS • AUTHENTICATION • AUTHORIZATION • ENCRYPTION • AUDITING • POLICY
What Big Data Platform Must Provide • Batch job management and monitoring
• MR based Parallel analysis program
• User activities monitoring
• Policy on accessing resources and systems.
• Various functions to improve accessibility to infrastructures.
Flamingo Project In Open Cloud Engine • Taking advantage of the web technology, using big data infrastructures and data becomes convenient.
• Users can handle data easily. • Provides functionalities to do various jobs in one workspace.
• Can reuse analysis and processing MapReduces • Open source oriented and all systems are ready to go. • Designed to be operator friendly. • Supports Hadoop EcoSystem.
Browser
Designer Search
Morphology����������� ������������������ Analysis����������� ������������������
Analyze����������� ������������������ Graph����������� ������������������
User����������� ������������������ Evaluation����������� ������������������
Elect����������� ������������������ a����������� ������������������ leader����������� ������������������
Log����������� ������������������
Data����������� ������������������ Scientist����������� ������������������ Service����������� ������������������ Planner����������� ������������������
Data����������� ������������������ Analyst����������� ������������������
Browser
Informa.on Catalogue Search
Informa-on Security Batch Type
User Similarity 1 Daily, 4 PM XML
Item Recommenda.on 2 Daily, 2 AM JSON
Purchase Preference 3 Daily, 8 PM XML/JSON
Opinion Leader 2 Daily, 7 AM XML/JSON
Data����������� ������������������ users����������� ������������������
Systems����������� ������������������
Opinion����������� ������������������ Leader����������� ������������������ Score����������� ������������������ Board����������� ������������������
Open����������� ������������������ API����������� ������������������
Data����������� ������������������ Visualization����������� ������������������ Charts����������� ������������������
Design����������� ������������������ a����������� ������������������ workflow����������� ������������������
Collect����������� ������������������
Data����������� ������������������ user����������� ������������������
Request����������� ������������������ Service����������� ������������������
Mobil����������� ������������������ Devices����������� ������������������
Reuse����������� ������������������ analyzed����������� ������������������ results����������� ������������������ Analyzed����������� ������������������ results����������� ������������������ are����������� ������������������ exposed����������� ������������������ through����������� ������������������ an����������� ������������������ Open����������� ������������������ API����������� ������������������
Validation����������� ������������������ Log����������� ������������������ Data����������� ������������������
MapReduce����������� ������������������ Analysis����������� ������������������ Module����������� ������������������
Big����������� ������������������ Data����������� ������������������ Analysis����������� ������������������ and����������� ������������������ Service����������� ������������������ Platform����������� ������������������
1����������� ������������������
2����������� ������������������
3����������� ������������������
4����������� ������������������
5����������� ������������������
6����������� ������������������
7����������� ������������������
Future of Big Data Platform
Flamingo Project • Functionalities matter the most in the Hadoop based Big Data environment.
• Integrated open source projects are difficult to manage and not enough UIs exist to handle them
Flamingo Workbench • Users can freely move around w/in a workspace conducting various jobs.
• Each window is separated for its own functionalities
• To minimize coding, reusable parts are componentized.
• The system is simplified and well-known frameworks are implemented for easy addition
• A development method is standardized (Tools, Procedures, Manuals, Environments…)
Flamingo Architecture
File System Browser • Managing files is an integral part of Hadoop
• A familiar windows file explorer style UI provides a better UX to users
File System Browser
Converts directories into Hive DBs or tables
Hive DBs and tables are marked with different icons in the browser.
FLAMINGO HAS OPTIMIZED FREQUENTLY NEEDED
FUNCTIONS
File System Browser Enhancement • Previewing files and its location
• Restrictions on viewing directories and files to unauthorized users (doesn’t come with Hadoop). • E.g. /tmp directory is not visible to common users.
• Setting permission on directories and files
• A home directory for each user (doesn’t come with Hadoop)
• Setting a quota on directories
• Regularly dumping file system size info (for monitoring)
Audit Log • Search all recorded HDFS logs.
Workflow Designer • Mounts various analytic modules (e.g. Mahout)
• Drag and drop provided modules to the canvas.
• Currently analytic and statistical modules are mounted, Mahout and Giraph are being mounted, and ETL MRs
will be mounted soon.
Big Workflow Case Supports a workflow composed of multiple nodes.
Apache Access Log To CSV
Apache Access Log To CSV
Parameters to MapReduce • Delimiter • An option to print non matching pattern logs
Location of Apache Access Log and an output path of a CSV file.
MapReduce JAR file and a driver name
Workflow Designer • A complex workflow is needed to see a final output.
• Most times several steps are required to process files with MapReduce jobs. It makes creating a workflow difficult.
• Engineers prefer the Apache Hive’s SQL like query language over writing MapReduces, so Workflow Designer comes in handy.
• When handling various types of log file, Workflow Designer and MapReduce are essential.
Workflow Monitoring • Monitors workflows submitted from Workflow Designer. Accurate logs can be checked.
Workflow Monitoring
root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $> ls -lsa 합계 40 4 drwxr-xr-x 2 root root 4096 2014-03-31 17:23 . 4 drwxr-xr-x 20 root root 4096 2014-03-31 17:23 .. 16 -rw-r--r-- 1 root root 12731 2014-03-31 17:23 action.log à execution log 4 -rwxrwxrwx 1 root root 1259 2014-03-31 17:23 core-site.xml 0 -rw-r--r-- 1 root root 0 2014-03-31 17:23 hadoop.job_201403300831_0471 à MapReduce Job ID 4 -rwxrwxrwx 1 root root 852 2014-03-31 17:23 script.sh root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $>
NODES IN A WORKFLOW CONTAIN SEVERAL MAPREDUCE JOBS. SO THEY MUST BE ABLE TO BE TR
ACKED
What users view in the MapReduce execution history
Hadoop Job Monitoring
Must be able to be tracked in Hadoop Job Monitoring.
Expression Language (EL) • Dynamically substitute values into variables.
• E.g. Today’s date : dateFormat(‘yyyyMMdd’) dateFormat(‘yyyy-MM-dd’)
• For example, replace variables with certain dates • E.g. Daily batch. Record yesterday’s date into a workflow executed today.
• Supported Expression Language • dateFormat(‘DATE FORMAT’) à dateFormat(‘yyyyMMddHHmmss’) • hostname, escapeString, • yesterday, tommorow • month, day, hour, minute, … à day(‘yyyyMMdd’, -1) :: yesterday’s date(2013
1111) • trim, concat, urlEncode, firstNotNull
Expression Language (EL)
The ${EL} format is dynamically replaced with real values.
Hadoop Job Tracker Monitoring • Displays Hadoop’s job tracker info on graphs
Hadoop Job Tracker Monitoring • Remote monitoring and tracking of Hadoop jobs are available.
Hive Editor & Hive Metastore Browser • Search, browse, and download using SQL.
• Hive Metastore is integrated. Easy to manage databases and tables.
Hive Editor Use Case • Case 1: Search user access log with Hive
– If the log is semi-structured or unstructured, it’s problematic.
– If a column contains an array of map, it’s also problematic.
• Below is an example of a semi-structured log
TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]" TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[8758ßß99349] unq[5000000]"
Hive Editor Use Case
TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]”
Hive Editor Use Case
Hive Editor Use Case public class MasSerde implements SerDe { private StructTypeInfo rowTypeInfo; private ObjectInspector rowOI; private List<String> colNames; private List<Object> row = new ArrayList<Object>(); Pattern p = Pattern.compile("\"(.*?)\""); @Override public Object deserialize(Writable blob) throws SerDeException { row.clear(); Matcher m = p.matcher(blob.toString()); List list = new ArrayList(); while (m.find()) { list.add(m.group(1)); } String[] split = (String[]) list.toArray(new String[list.size()]); int i = 0; for (String fieldName : rowTypeInfo.getAllStructFieldNames()) { TypeInfo fieldTypeInfo = rowTypeInfo.getStructFieldTypeInfo(fieldName); row.add(parseField(split[i], fieldTypeInfo)); i++; } return row; } ... 생략 }
WHEN A LOG FILE IS LOADED, IT’S DESERIA
LIZED.
Hive Editor Use Case
Pig Script Editor • Edits and saves Pig Latin scripts.
• Executes and manages Pig Latin scripts to expedite data processing.
Dashboard • Displays batch job history
Job Management • Schedules, monitors, and executes batch job execution
Job Management • Cron Expression Fully Supported
Project Details • Download
– http://www.sourceforge.net/projects/hadoop-manager • Wiki(manuals and tech notes)
– http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819205
• Issues(bugs and new features) – http://jira.opencloudengine.org
• Build Server – http://build.opencloudengine.org
• Google Groups: [email protected]
• Subscription : [email protected]
The Future of Flamingo Project
• Big Data on Cloud
• Netra (OpenStack based Hadoop Provisioning)
+ Flamingo (Hadoop based Workspace)
• Open Source based Big Data Platform
• Apache Hadoop EcoSystem
• Big Data Management Using Flamingo
• Apache Hadoop PaaS (Platform as a Service)
• Big Data All In One Package
Workflow Designer • MapReduce developers use different parameters.
• How will we standardize these various MapReduces?
Workflow Designer • Most UI parts are reusable and provided as components
• MapReduce Module and UI controls are standardized and offered as a framework
Reuse components
UI Layout
Workflow Designer • Define module icons through metadata and minimize coding.
• The framework takes care of most of them, and users only handle metadata
Participate and Share with Us!!
www.opencloudengine.org