Post on 15-Apr-2017
HLoaderData Ingestionfrom Oracle Databasesto Hadoop ClustersAutomaticallyOn-Demand
8/13/2015 HLoader – A. Bose, D. Stein 2
HL
Problem
– Control and monitor data transferusing Sqoop, a CLI tool for bulk data transfer
– Two in onetwo distinct Summer Student task proposals for basically the same job
8/13/2015 HLoader – A. Bose, D. Stein 3
Problem
– Frequent requestsdifferent users with different but similar use casesATLAS Job Monitoring, CMS Job Monitoring, CMS data popularity, ACCLOG
– Manually executed jobthat can be partially automated
8/13/2015 HLoader – A. Bose, D. Stein 4
Requirements– Run jobs…
… incrementally
… communicate withthe end user
– Handle failuresretry, notify, prevent
– Be secure, stay safeauthorize, authenticate the users without exchanging passwords
– Use what’s providedRun on the CERN-provided infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 5
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution overview
8/13/2015 HLoader – A. Bose, D. Stein 6
1. Provided infrastructureOracle Databases and Hadoop Clusters
2. Transfer Datathe user wants to transfer data, so they create a new job: what, when, where to transfer
3. Execute the transfer on behalf of the userschedule and execute the job at the requested time (also inform the user of the status)
4. Update if neededif the user requested incremental updates, schedule it after the given interval
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
1
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
2
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
3
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
44
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution security
8/13/2015 HLoader – A. Bose, D. Stein 7
1. CERN SSO authenticationno password exchange
2. Authorizationonly available (ownership) and enabled (configured) Oracle servers could be used
3. Kerberos SSH tunnelingseparate user to log in to the clusters, without password
4. Secure password inputother users can not see the password as plaintext anywhere
1
2 3
44
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
1
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
2
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
3
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
4
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
5
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution modularity
8/13/2015 HLoader – A. Bose, D. Stein 8
1. DB connector agnosticSQLAlchemy supports several dialects, also other connectors can be integrated
2. Interchangeable schedulerbased on the servers and the needed schedule complexity
3. Flexible communication with Hadoopbesides commands through SSH, Oozie could also be used
4. Client communicating using REST API
5. Changeable Sqoop JDBC drivernormal or fast connectors if possible
1
2
3
4
5
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
1
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
2
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
3
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
4
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution infrastructure
8/13/2015 HLoader – A. Bose, D. Stein 9
1. PostgreSQL On-Demandwith Postgre and SQLAlchemy connector
2. Central WebServicesDFS | Windows > IIS 8.5 > FastCGI > Python 2.7 > Flask
3. Agent running separatedon DB locally managed server, OpenStack or WebServices (TBD)
4. Client hosted with the REST APIfor easy usage and update, could be separate
1
2 3
4
Solution meta DB
8/13/2015 HLoader – A. Bose, D. Stein 10
HL_SERVERS
HL_CLUSTERS
HL_JOBS
HL_TRANSFERS
HL_LOGS
server_idPK
server_address
server_name
cluster_idPK
cluster_address
cluster_name
job_idPK
source_server_idFK
source_schema_name
source_object_name
destination_cluster_idFK
destination_path
owner_username
sqoop_nmap
sqoop_splitting_column
sqoop_incremental_method
sqoop_direct
start_time
interval
job_last_update
transfer_idPK
scheduler_transfer_id
job_idFK log_idPK
transfer_idFK
log_source
transfer_status
transfer_start
transfer_last_update
last_modified_value
log_path
log_content
Solution restrictions
8/13/2015 HLoader – A. Bose, D. Stein 11
– Only allow tables and views to be importedthe DB is responsible for evaluating and checking the queries
– Selected (preconfigured) source databasesgradual introduction for new users
– Preset destination folder structurewith restricted access rights, avoiding collision, unauthorized access
– Basic Sqoop command logic (for now)eg., with primary key, only one PK attribute
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
Solution current state
8/13/2015 HLoader – A. Bose, D. Stein 12
1. Client Katein progress, meanwhile the REST interface can be used
2. REST API Danielalmost ready, missing the new job processing interface
3. Agent Scheduling Anibasically ready, can schedule jobs and update itself after job description modifications
4. Agent Runners Danielworking for initial imports, soon to be able to execute incremental updatespartially working SSH and REST monitorig
Solution current state
8/13/2015 HLoader – A. Bose, D. Stein 12
Solution future work
8/13/2015 HLoader – A. Bose, D. Stein 13
– Support more database connectors SQLA/NoSQL
– Support alternative runners like Oozie
– Prepare for Sqoop 2
– Integrate with Hive
– Resolve restrictions
– Release on GitHub with an Open Source license
Summary– Easily expandable framework and service
for transferring data from Oracle to Hadoop
– Designed with automation in mindminimal administrator intervention needed
– Service built for easy usageeasy to use for the routine jobs
8/13/2015 HLoader – A. Bose, D. Stein 14
Workflow tools– GitLab
– JIRA
– Slack
– Jenkins CI
8/13/2015 HLoader – A. Bose, D. Stein 32
Contributors– Anirudha Bose– Dániel Stein
– Antonio Romero Marin– Domenico Giordano– Kacper Surdy– Katarzyna Maria Dziedziniewicz-Wójcik– Manuel Martín Márquez– Zbigniew Baranowski
8/13/2015 HLoader – A. Bose, D. Stein 15
Client
Meta DB
REST API Agent
Oracle Databases
FIM
Hadoop Clusters
…
HLoader
8/13/2015 HLoader – A. Bose, D. Stein 16
HL