Importing*into*Neo4j /No*witty*subtitle*/ · Model*the*Problem 4 Employee Various-Properties Event...
Transcript of Importing*into*Neo4j /No*witty*subtitle*/ · Model*the*Problem 4 Employee Various-Properties Event...
Model the Problem
4
Employee
Various Properties
Event
Various Properties
Expense Report
Various Properties
:SUBMITTEDSubmitDateReportNbr
Relationship entity has a name and also various properties
:ATTENDED
Simple relationship has a name, but no properties
Employee
Various Properties
:ATTENDED
:REFERENCES
From user story to model
MATCH (e:EMPLOYEE)-[:ATTENDED]->(ev:EVENT)<-[:ATTENDED]-(e1:EMPLOYEE)WITH e,ev,e1MATCH (ev)-[:EXPENSED_ON]->(er:EXPENSE_REPORT)RETURN e,ev,e1,er;
(person)-[:ATTENDED]->(event)<-[:ATTENDED]-(colleague)
person ATTENDED eventperson SUBMITTED exense_report
?Which people claimed expenses for the same event?
Load CSV
ETL Power Tool• Combines multiple aspects in a single operation• Supports loading / ingesting CSV data from an URI (file://, http://, https://, ftp://)
• Direct mapping of input data into complex graph/domain structure
• Data conversion• Supports complex computations• Create or merge data, relationships and structure
11
Load CSV
Nodes – Indexes -‐ Relationships• Do multiple passes to create nodes and relationships instead of large, combined statements
LOAD CSV WITH HEADERS FROM “file:///path/to/file.csv“ AS lineMERGE (a:Person {id:line.id}) ON CREATE SET a.name=line.name;CREATE INDEX on :Person(id);CREATE INDEX on :Movie(id);LOAD CSV WITH HEADERS FROM file:///path/to/file.csv AS lineMATCH (m:Movie {id:line.movieId})MATCH (a:Person {id:line.personId})CREATE (a)-‐[:ACTED_IN {roles:[line.role]}]-‐>(m);
12
Load CSV
Periodic Commit• ALWAYS prefix your LOAD CSV with USING PERIODIC COMMIT.
• The number given is the number of import rows after which a commit of the imported data happens. • Depending on the complexity of your import operation, you might create from 100 elements per 1000 rows (if you have a lot of duplicates) up to 100,000 when you have complex operations that generate up to 100 nodes and relationships per row of input.
• That’s why a commit size of 1000 might be a safe bet.
13
Load CSV
More Tips• Rather than a long merge/create statement that attempts to create multiple entities in one pass, favor short, simple statements and do multiple passes over the input as needed
• If you load your CSV file over the network make sure the network is fast enough to sustain the ingestion rate you’d like to have. Otherwise:
• If possible download it, and use a file:// URL.• Column names are case sensitive.• Misspelled column names result in null values.
14
Load CSV
Considerations• Make sure you have sufficient RAM• Use file:///path/to/file.csv on OSX and Unix, use file:c:/path/to/file.csv on Windows
• Check correct delimiters and columns• Columns are case sensitive• Empty columns are treated as null• Default data type is String. Use toInt or toFloat to convert• Change the delimiter if needed with …AS line FIELDTERMINATOR ‘;’• Create necessary indexes and constraints upfront• Use the Neo4j-‐Shell for larger imports
15
LOADing the Data
// Create customersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/customers.csv" AS rowCREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});
// Create productsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowCREATE (:Product {productName: row.ProductName, productID: row.ProductID, unitPrice: toFloat(row.UnitPrice)});
LOADing the Data// Create suppliersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/suppliers.csv" AS rowCREATE (:Supplier {companyName: row.CompanyName, supplierID: row.SupplierID});
// Create employeesUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/employees.csv" AS rowCREATE (:Employee {employeeID:row.EmployeeID, firstName: row.FirstName, lastName: row.LastName, title: row.Title});
LOADing the Data// Create categoriesUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/categories.csv" AS rowCREATE (:Category {categoryID: row.CategoryID, categoryName: row.CategoryName, description: row.Description});
// Create ordersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMERGE (order:Order {orderID: row.OrderID}) ON CREATE SET order.shipName =row.ShipName;
Creating the IndexesCREATE INDEX ON :Product(productID);CREATE INDEX ON :Product(productName);CREATE INDEX ON :Category(categoryID);CREATE INDEX ON :Employee(employeeID);CREATE INDEX ON :Supplier(supplierID);CREATE INDEX ON :Customer(customerID);CREATE INDEX ON :Customer(customerName);
Creating the RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (customer:Customer {customerID: row.CustomerID})MERGE (customer)-[:PURCHASED]->(order);
USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowMATCH (product:Product {productID: row.ProductID})MATCH (supplier:Supplier {supplierID: row.SupplierID})MERGE (supplier)-[:SUPPLIES]->(product);
Creating the RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (product:Product {productID: row.ProductID})MERGE (order)-[pu:PRODUCT]->(product)ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity =toFloat(row.Quantity);
USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (employee:Employee {employeeID: row.EmployeeID})MERGE (employee)-[:SOLD]->(order);
Creating the RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowMATCH (product:Product {productID: row.ProductID})MATCH (category:Category {categoryID: row.CategoryID})MERGE (product)-[:PART_OF]->(category);
USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/employees.csv" AS rowMATCH (employee:Employee {employeeID: row.EmployeeID})MATCH (manager:Employee {employeeID: row.ReportsTo})MERGE (employee)-[:REPORTS_TO]->(manager);
Neo4j-‐Import
Create your initial database• Allows you to load large amounts of data into a Neo4j database
• Allows you to specify nodes and relationships in separate files
• Supports loading / ingesting CSV data from your file system• Does not support data transformations
29
Neo4j-‐Import
Notes:• Fields are comma separated by default but a different delimiter can be specified.
• All files must use the same delimiter.• Multiple data sources can be used for both nodes and relationships.
• A data source can optionally be provided using multiple files.• A header which provides information on the data fields must be on the first row of each data source.
• Fields without corresponding information in the header will not be read.
• UTF-‐8 encoding is used.
30
Neo4j-‐Import
Sample Script./bin/neo4j-‐import -‐-‐into /Users/davidfauth/testDB-‐-‐nodes /Users/davidfauth/neo4j-‐atlanta-‐meetup/employee.csv-‐-‐nodes /Users/davidfauth/neo4j-‐atlanta-‐meetup/locations.csv-‐-‐nodes /Users/davidfauth/neo4j-‐atlanta-‐meetup/events.csv-‐-‐nodes /Users/davidfauth/neo4j-‐atlanta-‐meetup/expense_report.csv-‐-‐relationships /Users/davidfauth/neo4j-‐atlanta-‐meetup/events_rels.csv-‐-‐relationships /Users/davidfauth/neo4j-‐atlanta-‐meetup/exp_rep_rels.csv-‐-‐relationships /Users/davidfauth/neo4j-‐atlanta-‐meetup/expense_report_rels.csv-‐-‐bad-‐tolerance 10000
31
Transactional REST Endpoint
Pass Cypher Statements over REST• The Neo4j transactional HTTP endpoint allows you to execute a series of Cypher statements within the scope of a transaction.
• Can use the same transaction for multiple HTTP requests
32
Additional Resources• Max DeMarzi (http://maxdemarzi.com)
• Wikipedia into Neo4j with Graphipedia• Scaling concurrent writes in Neo4j• Online payment risk management with Neo4j
• Mark Needham (http://markneedham.com/blog)• Loading data – REST API vs Batch Import• The Batch Inserter and the sunk cost fallacy
• Rik Van Bruggen (http://blog.bruggen.com)• Import: summarised• Using LOAD CSV` to import data from a Google Spreadsheet• Food networks, countries, diets, health and Load CSV• Some Neo4j import tweaks, what and where• Spreadsheet method: plenty!
• Michael Hunger (http://www.jexp.de/blog)• `LOAD CSV` into Neo4j quickly and successfully• Use `LOAD CSV` to import Git history into Neo4j• On Importing Data into Neo4j (blog)
34