ETL into Neo4j

Post on 10-May-2015

15.930 views 0 download

Tags:

description

Learn some of the ways to load data into Neo4j quickly.

Transcript of ETL into Neo4j

ETL into Neo4j

Max De Marzi

About Me

• My Blog: http://maxdemarzi.com• Find me on Twitter: @maxdemarzi• Email me: maxdemarzi@gmail.com• GitHub: http://github.com/maxdemarzi

Built the Neography Gem (Ruby Wrapper to the Neo4j REST API)Playing with Neo4j since 10/2009

Agenda

• ETL your mind• ETL with Batch and the REST API• ETL with Gremlin and Groovy• ETL with the Batch Importer• ETL from SQL

ETL your Mind

You have to start there

More Relational than Relational

Stop thinking about howTables are related

Start thinking about relationships

Objects like to mingle

Optimized for “trees” of data Optimized for seeing the forest and the trees, and the branches, and the trunks

SELECT skills.*, user_skill.* FROM users JOIN user_skill ON users.id = user_skill.user_id JOIN skills ON user_skill.skill_id = skill.id WHERE users.id = 1

START user = node(1) MATCH user -[user_skill]-> skill RETURN skill, user_skill

Property Graph

name

code

word_count

Language

name

code

flag_uri

Country

IS_SPOKEN_IN

as_primary

language_code

language_name

word_count

Language

country_code

country_name

flag_uri

Country

language_code

country_code

primary

LanguageCountry

name: “Canada”

languages_spoken: “[ ‘English’, ‘French’ ]”

name: “Canada”

language:“English”

language:“French”

spoken_in

spoken_in

name: “USA”

name: “France”

spoken_in

spoken_in

name

flag_uri

language_name

number_of_words

yes_in_langauge

no_in_language

currency_code

currency_name

Country

USES_CURRENCY

name

flag_uri

Country

name

number_of_words

yes

no

Language

SPEAKS

code

name

Currency

ETL with Batch and the REST API

Batch command from REST API

Great for importing Facebook/Twitter friends

Keep each request under 10k commands

Preferably send a request every 2k to 5k commands

Using Batch from Neography

Why BatchTransactional: any failures not committed.

Ordered: responses guaranteed to be in the same order as sent.

Continuous loading/updating nodes and relationships in spurts or streaming.

ETL with Gremlin and Groovy

Commit every 1000 changes or so, make sure to stop the transaction to commit the last few changes at the very end.

Look into auto-indexing to make life easier.

Disabled by default. See Docs for trick to make it full text instead of exact index.

http://docs.neo4j.org/chunked/milestone/auto-indexing.html

Crazy Format is okId :: Title :: Genre|Genre|Genre

But it’s preferable to stay clear of escape characters like “|”

String location of data file, converted to URL, then processed one line at a time.Movie vertex created, genre vertex created unless it exists (index lookup), edge from movie to genre is created.

Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-one/

ETL with the Batch Importer

Installation Walk-Through

Testing it

7.5M nodes, 42M relationships in just over 3 minutes on a laptop.

Loading it into Neo4j

Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/

When to use the Batch Importer?

• 1st time loading or periodic reloading

• When you need Speed

• When you don’t mind a little Java

ETL from SQL

Identities who vouched for each other

row_number() and INTO are our friends

The “term” vouched for will serve as our relationship type, status is a relationship property.

Notice there are no node ids.These are automatic, clkao is node 1

No time to get coffee >8-[

What about multiple types of nodes?No problem, just add the MAX(node_id) from the first table.

Full walk-through at: http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

Need help? E-mail me, catch me on Google chat or Skype.

Please don’t be shy…. and read my blog:

http://maxdemarzi.com

Thank you!http://maxdemarzi.com