2010 11-02-documents

download 2010 11-02-documents

If you can't read please download the document

Transcript of 2010 11-02-documents

Exploring Document Databases

zendcon 2010

Documents,Documents,Documents

Matthew Weier O'PhinneyProject Lead, Zend Framework

Writing the
typical PHP app

design the schema

(http://musicbrainz.org/)

take input, and shove it in a DB

(http://musicbrainz.org/)

write queries to pull from the DB

$result = mysql_query(
"SELECT * FROM sometable"
);$rows = false;if (mysql_num_rows($result) > 0) { $rows = array(); while ($row = mysql_fetch_assoc($result)) { $rows[] = $row; }}

spit data onto a page

(http://www.irs.gov/)

Profit!

Things
That Happen

Things
That Happen

go wrong

SQL Injection

performance issues

Expensive queries

Potentially ORM induced resource issues

Design
Issues

Prague's Dancing House

design issues?

Many 1:1 or 1:N relationshipsNon-trivial insert/update operations

Harder to hit indexes on read operations

For trivial stuff like tags, addresses, etc!

design issues?

Worse: Changing requirementsAdditional columns needed?

Additional tables needed?

Occasional data needed?

design issues?

Entity-Attribute-Value
Anti-PatternOften added after-the-fact, as requirements change or expand

Support arbitrary data for any record of any table

Can't do type enforcement

Leads to complex joins that often cannot hit table indexes

Leads to complex application logic to support retrieval and insertion of such metadata

Bill Karwin has written on this on his blog and in his book SQL Anti-patterns

Design First

Use Domain Driven Design (DDD), or Behavior Driven Design (BDD)

Eric Evans has written the classic text on DDD, and performs DDD immersion classes regularly.

Develop your application logic first, in order to determine what needs to be persisted.

the primary rule

define your application entities

use Plain Old PHP Objects

class User{ public function getId() {} public function setId($value) {} public function getRealname() {} public function setRealname($value) {} public function getEmail() {} public function setEmail($value) {}}

write tests

class PostTest extends PHPUnit_Framework_TestCase{ public function testRaisesExceptionOnInvalidDate() { $this->setExpectedException(
'InvalidArgumentException'); $this->post->setDate('foo bar'); }}

implement behaviors

class Post{ private $date; private $timezone = 'America/New_York';

public function setDate($date) { if (false === strtotime($date)) { throw new InvalidArgumentException(); } $this->date = new DateTime(
$date, $this->timezone); return $this; }}

NOW
determine
what data
you need
to persist.

Define a schema based on
the objects you use

(http://musicbrainz.org/)

map entities to data store

public function fromArray(array $data){ $filter = new OptionsFilter(); foreach ($data as $key => $value) { $method = 'set' . $filter($key); if (method_exists($this, $method)) { $this->$method($value); } }}

public function toArray(){ return array( '_id' => $this->getId(), 'timestamp' => $this->getTimestamp(), 'title' => $this->getTitle(),

approaches

Transaction Scripts

Object Relational Maps (ORM)

Transaction scripts do not need o be strictly procedural; they pattern can also apply to OOP code using such patterns as Strategy, Visitor, etc.

use mappers or transaction scripts
to translate objects to data & back

$user = new User();$user->setId('matthew') ->setName("Matthew Weier O'Phinney");$mapper->save($user);

$user = $repository->find('matthew');

additions

Service LayersInteracts with domain entities

Good place for caching, ACLs, etc.

use service objects
to manipulate entities

namespace Blog\Service;class Entries{ public function fetchEntry($permalink) {} public function fetchCommentCount(
$permalink) {} public function fetchComments($permalink) {} public function fetchTrackbacks($permalink) {} public function addComment($permalink,
array $comment) {} public function addTrackback($permalink,
array $comment) {} public function fetchTagCloud() {}}

Data
Persistence

you have a choice

Before, relational databases were the only choice

you have a choice

Today, relational databases are only one choice

have your domain dictate storage

Do you have many arbitrary, row-specific fields in the design?

Do you need many pivot tables to describe a single entity?

Is transactional integrity part of your requirements?

Do changes need to be immediately available?

defining by what it isn't?

still defining by what it isn't

types: key/value stores

each record is a key/value pair, (though the value may be non-scalar)


Interesting, but not what we're going to look at today.

types: document databases

Each document can define its own structure

Typically a document consists of many key/value pairs




This is what we'll look at!

{ _id: "weierophinney", realname: "Matthew Weier O'Phinney", email: "[email protected]", roles: [ "admin", "user" ]}

document dbs are plentiful

Also mention Azure Tables

document dbs solve web problems

Data can expand and add properties over time
without requiring schema changes!

Different content types can
co-exist in the same general storage

document dbs solve web problems

Aggregate related content in the document that owns itTags

Comments

Addresses

Eventual consistencyUpdates often don't need to propagate in real-time

types of problems documents solve

Blog and News Posts

Product Entries

Content Management documents

what don't they solve?

identifiers are king

Most are optimized for fetching via identifierProvide your own IDs

Fallback on system
(usually UUID)

mapping documents to objects

Many utilize JSON

If they don't, abstractions let you sling PHP arrays

$result = $cxn->fetch($id);$user = new User();$user->fromArray($result);

$cxn->save($user->toArray());

aggregate metadata

Instead of EAV tables, store metadata in the document

{ "_id" : "blog-post-stub", "published" : true, "reviewed" : true, "reviewed_by" : "matthew"}

to pivot tables required!

Instead of pivot tables, aggregate data inside values

{ "_id" : "blog-post-stub", "tags" : [
"zend framework",
"presentations"
]}

It's not all walks in the park

increased disk usage

Each document contains its schemaSilver lining: most solutions can cluster and/or provide sharding.

schema differences

How do you keep schemas in sync between documents when requirements change?

If you have multiple schemas for the same document type, what do you query on?firstName or FIRST_NAME?

Did you remember to create new indexes?

managing schema changes

Handle the differences in your application code

switch ($user->schema_version) { case '2010-01-31': // ... break; case '2010-11-02': // ... break;}

Meh.

managing schema changes

Do a batch conversionCopy all records to a new database or collection

Migrate all records to the new schema

Point your application to the new database/collection

Meh.

managing schema changes

Version the document schema


Update when fetched

{ "_id" : "blog-post-stub", "schema_version" : "2010-11-02"}

if ($post->schema_version != $latest) { $post->metadata = $post->METADATA; $post->schema_version = $latest; unset($post->METADATA); $mapper->save($post);}

Use AOP-like practices such as SignalSlot, Subject/Observer, etc to help automate this.

benefits you may enjoy

Easier mapping of document concepts to data persistence

Easier scalingMost support clustering and sharding natively

Easier migration to cloud-based storage

Closing Notes

Don't start your development from the wrong end. Start with objects.

Be aware of all the options you have for persisting data; choose appropriately.

Consider document data stores when your objects represent content; store metadata in the document.

Thank you

Feedback? http://joind.in/2233http://twitter.com/weierophinneyhttp://framework.zend.com/