Scaling opensimulator inventory using nosql

Scaling OpenSimulatorInventory using NoSQLDAVID DAESCHLER

INWORLDZ, LLC

OpenSimulator Community Conference 2014

Who are you? Why should I watch? You’re not the boss of me!Hello! I’m David Daeschler, also known as Tranquillity Dexler. I am a partner and software architect over at InWorldz, LLC.

I designed and deployed an LSL compiler, virtual machine, and script runtime named Phlox to eliminate CPU and memory issues caused by user scripts.

We developed PhysX physics integration for stable rigid body dynamics and vehicle functionality.

I’ve designed scale out asset services that now run over 11 servers (10 TB of data), and an inventory system running on top of Apache Cassandra that is now running on 8 nodes holding about 250 GB of data.

We routinely handle over 300 concurrent users on the grid and we’ve peaked out just shy of 500 concurrent users without experiencing backend faults or load issues.

We’ve experienced and conquered more than a few scaling problems while running our Opensimderived grid over the past 5 years. It’s been a school of hard knocks, and we’d like to share some of our experiences and solutions.

Oh noes! Inventory woes!You’re running an opensimulator grid, and everything is going just great!

... Until concurrency and inventory size picks up, then out of the blue:

Your users are having trouble logging in

You’re starting to see timeouts on MySQL for inventory operations

Your inventory tables are getting really huge. People insist on keeping 100,000+ items containing at least 400 copies of “primitive”. You hit 100GB of data and realize scaling up wont be a viable option much longer.

“My Inventory stopped downloading at 75,000 items, but I have 999,999!”

“My pony avatar is missing!”

Federation only helps so muchEven if your grid is part of the Hypergrid, it still may become wildly popular. In this case, manual federation becomes a nonstarter. Trying to predict growth/loss, to set up multiple “your grid-1, your grid-2” just really isn’t an ideal solution for fast action in time to keep up with changes in demand.

Manual self-federation to provide scale out for growth also would require advanced software tools to set up entirely new backend and frontend services. Shard keys would have to be chosen, and users would have to be manually distributed against the new set of servers every time your grid needed to scale again. Essentially, each shard instance that was spun up would require doing everything again that you had to do for your grid initially.

MySQL read slaves/scale outMySQL supports read-only slaves out of the box that can help you when your workload is dominated by reads.

Unfortunately, this would only allow you to scale out until writes became the bottleneck for your master MySQL server, at which point the master and all slaves would have to be scaled UP with better hardware.

We quickly started to see IOwait numbers climb because of the amount of writes that were happening on MySQL. People love to buy stuff and give stuff to each other. Once we got past a certain point, tuning was no longer an option. It was either get a better master server, or replace the MySQL backend with something else.

Apache

TO THE RESCUE!

Apache CassandraCassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.

It is a distributed, scale-out, fault tolerant database with tunable consistency.

Benefits

Your data is replicated onto multiple servers that can even span different datacenters

You can lose one or more servers in a cluster and still stay up and running with zero downtime and zero data loss. This goes well beyond simple RAID.

Seeing the load on backend increase beyond your comfort level? Simply add new servers to the cluster with ZERO downtime.

http://www.slideshare.net/daveconnors/cassandra-puppet-scaling-data-at-15-per-month

http://planetcassandra.org/blog/post/cassandra-at-cern-large-hadron-collider/

http://www.slideshare.net/planetcassandra/nyc-tech-day-using-cassandra-for-dvr-scheduling-at-comcast

http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376

http://planetcassandra.org/blog/post/analytics-at-github-with-apache-cassandra/

http://planetcassandra.org/blog/post/godaddy-worlds-largest-domain-name-registrar-and-web-host-provider-utilizes-cassandra-for-replication-and-scalability/

http://planetcassandra.org/blog/post/cassandra-used-to-build-scalable-and-highly-available-systems-at-hulu-streaming-content-to-over-5-million-subscribers/

http://planetcassandra.org/blog/post/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/

http://www.slideshare.net/planetcassandra/3-mohit-anchlia

http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra

http://planetcassandra.org/blog/post/reddit-upvotes-apache-cassandras-horizontal-scaling-managing-17000000-votes-daily/

http://planetcassandra.org/blog/post/make-it-rain-apache-cassandra-at-the-weather-channel-for-severe-weather-alerts/

http://planetcassandra.org/companies/

But WAIT! Cassandra is eventually consistent! What about ACID?!I’d hate to break it to you, but a traditional RDBMS scale out solution with a single master and one or more slaves is also eventually consistent.

“LIES!” you say. No, seriously. There’s this metric that you can query in a MySQL setup called slave lag. This number tells you exactly how far behind a slave is from its master. The slave will never be exactly up to date with the master as long as it is taking constant writes, and reading from it may return results that are from in the past. Application designers need to keep this in mind as much as they need to understand Cassandra’s eventual consistency.

It turns out that Cassandra has tunable consistency and can offer better guarantees to obtain a consistent read than traditional scale out on an RDBMS. This is because we can tell Cassandra to write to a set of nodes, and not return until a quorum of them have responded that they have written the new value. When we again read at quorum consistency, we are guaranteed to see the most up to date value!

I is based on DynamoDynamo was invented by Amazon in 2007 as a solution to provide a highly available distributed data store. Amazon works at a massive scale and even a few minutes of downtime means they lose a ton of money.

The dynamo paper has a few important implementation details that Cassandra borrows from:

oData is automatically sharded based on the consistent hash of primary key and replicated to N hosts in a hash ring.

oHinted handoff helps bring the dataset into convergence during temporary failures.

oThe ability to add and remove storage nodes without interruption of service.

Consistent hashingYour data is automatically divided up between storage nodes based on the value of the consistent hash of a row’s primary key

Each of nodes a,b,c,d would own 25% of your primary keys and 25%Of your data at Replication Factor (RF) =1

Using quorum reads and writes to achieve Consistency and Partition ToleranceWhen you write and read to/from a quorum of nodes you will get a consistent view of the data, and you will be able to tolerate a node or network outage. An example quorum is 2 out of 3 nodes that form a majority.

WRITE “HELLO!”TO A AND C

Node A dies We still read “HELLO” fromC, and we stay running!

Simple Cassandra setup with DockerIf you haven’t heard of Docker yet, you need to check it out: https://www.docker.com

Docker lets you package an app and all of its dependencies in a portable container.

We’ll use the prebuilt Cassandra container from https://github.com/tobert/cassandra-docker to build our demo on. (By the way, Al Tobey is an awesome guy and you should follow him on twitter @AlTobey)

Once Docker is downloaded and set up, starting a single node Cassandra cluster is super easy:

:2.0.10

Alternatively, if you’re on windows, grab the latest release from http://cassandra.apache.org/ and run cassandra.bat

https://www.docker.com/

https://github.com/tobert/cassandra-docker

http://cassandra.apache.org/

CQL: Like SQL but differentOriginally when Cassandra made its debut, the only way to get at the data was to use Thrift calls that pulled and updated columns very much like working with a hash set.

Cassandra then developed CQL (Cassandra Query Language) which is a familiar cousin of SQL with the following notable exceptions:

No joins. No GROUP BY. Data in Cassandra is expected to be mostly denormalized. Cassandra writes are extremely fast, faster than reads, which mitigates the extra write penalty.

Cassandra supports compound keys and data that has the data grouped together by the partition key (important! more on this later)

You can not use a WHERE clause to filter on columns that aren’t part of the row key or a secondary index. Partition keys must be queried using the = operator or IN statements.

These rules and features keep you from shooting yourself in the foot.

CQL Inventory schema designThings to keep in mind:

SL based viewers don’t request subfolders individually in inventory fetch. The protocol CAN do this, but instead, all folders and subfolders are retrieved as part of the skeleton during login.

All items inside an individual folder are requested at once. We want to optimize reads based on this fact and not turn every item into an individual random IO. We can use a compound key to achieve this.

Items are rezzed into the world based on their UUID. Therefore we need to map item IDs back to their parent folder ID. We’ll do this explicitly and avoid secondary indexes which seem to have issues with becoming stale as of writing based on mailing list traffic.

All folders have version numbers that get incremented when items or subfolders and changed, created, moved, or deleted. We’ll use a special CQL column called a counter for this.

Our CQL Schema

CompoundPrimary

Keys

A bit more detail about the designYou’ll notice that the design of the schema is geared around how the data will be queried. This is important because it runs contrary to how we’re used to setting up schemas in the relational world where the entities normally closely follow our class model.

PRIMARY KEY (Partition Key, Clustering Column, Clustering Column, Clus...)

The reason we’re using compound primary keys is due to the way Cassandra stores data. When you use a compound primary key, all the data matching the first component of the compound key, known as the partition key, is grouped together. This means that when we query using this key alone, or this key with a range of clustering columns, Cassandra is able to retrieve the data without seeking out each individual row for the clustering columns.

This allows us to efficiently read the data from all items inside a folder without performing additional seeking for each item.

To the code! .. But firstA few things to remember:

Since we’re maintaining a denormalized dataset, we need to make sure updates to item/folder parentage and versioning are reflected in all related tables. We can make these queries via batches. As of Cassandra 1.2, batches are atomic by default, which means there is less of a chance of inconsistencies slipping in.

RememberMoving a folder requires you to alter the skeletons table, and update the folder_versions table.

Renaming a folder requires you to alter skeletons, folder_contents, and folder_versions tables.

Moving or renaming an item requires you to alter folder_contents, folder_versions, and item_parents.

Deleting folders and items requires hits to all associated tables.

OK! NOWTO the CODE and questions!

Some CQL samplesFOLDER_ATTRIB_INSERT_STMT = _session.Prepare(

"INSERT INTO folder_contents “ +“(folder_id, item_id, name, inv_type, creation_date, owner_id)“ + "VALUES (?, ?, ?, ?, ?, ?);");

FOLDER_ATTRIB_INSERT_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);

Example insert with batchRemember, to insert a folder, we need to insert to the skeletons table, the folder_versions table, and the folder_contents table.

public void CreateFolder(InventoryFolder folder)

{

var batch = new BatchStatement()

.Add(skelInsert)

.Add(contentInsert);

_session.Execute(batch);

VersionInc(folder.OwnerId, folder.FolderId);

}

What up with VersionInc() ?We can’t include a counter table as part of a batch with other non-counter tables. So unfortunately we need to increment the counter separately.

FOLDER_VERSION_INC_STMT =session.Prepare("UPDATE folder_versions SET version = version + 1 WHERE” +

“user_id = ? AND folder_id = ?;");

FOLDER_VERSION_INC_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);

private void VersionInc(Guid ownerId, Guid folderId){

var versionInc = FOLDER_VERSION_INC_STMT.Bind(ownerId, folderId);_session.Execute(versionInc);

}

Thank youThe full source code with unit test coverage is available on github at:

https://github.com/InWorldz/opensim-cql-inventory

Thanks you for stopping by!

David Daeschler (Tranquillity Dexler)Co-FounderInWorldz, LLC

https://github.com/InWorldz/opensim-cql-inventory

Scaling opensimulator inventory using nosql

Technology

Transcript of Scaling opensimulator inventory using nosql