ETL into Neo4j

31
ETL into Neo4j Max De Marzi

description

Learn some of the ways to load data into Neo4j quickly.

Transcript of ETL into Neo4j

Page 1: ETL into Neo4j

ETL into Neo4j

Max De Marzi

Page 2: ETL into Neo4j

About Me

• My Blog: http://maxdemarzi.com• Find me on Twitter: @maxdemarzi• Email me: [email protected]• GitHub: http://github.com/maxdemarzi

Built the Neography Gem (Ruby Wrapper to the Neo4j REST API)Playing with Neo4j since 10/2009

Page 3: ETL into Neo4j

Agenda

• ETL your mind• ETL with Batch and the REST API• ETL with Gremlin and Groovy• ETL with the Batch Importer• ETL from SQL

Page 4: ETL into Neo4j

ETL your Mind

You have to start there

Page 5: ETL into Neo4j

More Relational than Relational

Stop thinking about howTables are related

Start thinking about relationships

Page 6: ETL into Neo4j

Objects like to mingle

Optimized for “trees” of data Optimized for seeing the forest and the trees, and the branches, and the trunks

Page 7: ETL into Neo4j

SELECT skills.*, user_skill.* FROM users JOIN user_skill ON users.id = user_skill.user_id JOIN skills ON user_skill.skill_id = skill.id WHERE users.id = 1

Page 8: ETL into Neo4j

START user = node(1) MATCH user -[user_skill]-> skill RETURN skill, user_skill

Page 9: ETL into Neo4j

Property Graph

Page 10: ETL into Neo4j

name

code

word_count

Language

name

code

flag_uri

Country

IS_SPOKEN_IN

as_primary

language_code

language_name

word_count

Language

country_code

country_name

flag_uri

Country

language_code

country_code

primary

LanguageCountry

Page 11: ETL into Neo4j

name: “Canada”

languages_spoken: “[ ‘English’, ‘French’ ]”

name: “Canada”

language:“English”

language:“French”

spoken_in

spoken_in

name: “USA”

name: “France”

spoken_in

spoken_in

Page 12: ETL into Neo4j

name

flag_uri

language_name

number_of_words

yes_in_langauge

no_in_language

currency_code

currency_name

Country

USES_CURRENCY

name

flag_uri

Country

name

number_of_words

yes

no

Language

SPEAKS

code

name

Currency

Page 13: ETL into Neo4j

ETL with Batch and the REST API

Page 14: ETL into Neo4j

Batch command from REST API

Great for importing Facebook/Twitter friends

Keep each request under 10k commands

Preferably send a request every 2k to 5k commands

Page 15: ETL into Neo4j

Using Batch from Neography

Page 16: ETL into Neo4j

Why BatchTransactional: any failures not committed.

Ordered: responses guaranteed to be in the same order as sent.

Continuous loading/updating nodes and relationships in spurts or streaming.

Page 17: ETL into Neo4j

ETL with Gremlin and Groovy

Page 18: ETL into Neo4j

Commit every 1000 changes or so, make sure to stop the transaction to commit the last few changes at the very end.

Look into auto-indexing to make life easier.

Disabled by default. See Docs for trick to make it full text instead of exact index.

http://docs.neo4j.org/chunked/milestone/auto-indexing.html

Page 19: ETL into Neo4j

Crazy Format is okId :: Title :: Genre|Genre|Genre

But it’s preferable to stay clear of escape characters like “|”

String location of data file, converted to URL, then processed one line at a time.Movie vertex created, genre vertex created unless it exists (index lookup), edge from movie to genre is created.

Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-one/

Page 20: ETL into Neo4j

ETL with the Batch Importer

Page 21: ETL into Neo4j

Installation Walk-Through

Page 22: ETL into Neo4j

Testing it

7.5M nodes, 42M relationships in just over 3 minutes on a laptop.

Page 23: ETL into Neo4j

Loading it into Neo4j

Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/

Page 24: ETL into Neo4j

When to use the Batch Importer?

• 1st time loading or periodic reloading

• When you need Speed

• When you don’t mind a little Java

Page 25: ETL into Neo4j

ETL from SQL

Page 26: ETL into Neo4j

Identities who vouched for each other

row_number() and INTO are our friends

Page 27: ETL into Neo4j

The “term” vouched for will serve as our relationship type, status is a relationship property.

Page 28: ETL into Neo4j

Notice there are no node ids.These are automatic, clkao is node 1

Page 29: ETL into Neo4j

No time to get coffee >8-[

Page 30: ETL into Neo4j

What about multiple types of nodes?No problem, just add the MAX(node_id) from the first table.

Full walk-through at: http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

Need help? E-mail me, catch me on Google chat or Skype.

Please don’t be shy…. and read my blog:

http://maxdemarzi.com

Page 31: ETL into Neo4j

Thank you!http://maxdemarzi.com