Building Large Scale Services - LISA 2013

57
Building Large Scale Services PRESENTED BY Jennifer Davis November 8, 2013

description

Yahoo! Service Engineers (SE) specialize in bridging the gap between system administration and development. SEs are tasked with delivering a reliable, consistent quality service through the use of best practices. They must understand network, OS, hardware, and customer use cases; and dive deep into the application internals. In this talk, Jennifer will describe her journey with the Sherpa service at Yahoo! and lessons learned about building a reliable, consistent, and high-quality service from scratch. The key takeaway from this talk will be to educate practitioners on successful strategies and pitfalls when building out a service.

Transcript of Building Large Scale Services - LISA 2013

Page 1: Building Large Scale Services - LISA 2013

Bu i l d i ng   L a r ge   S ca l e   Se r v i c e s  

PRESENTED  BY  Jennifer  Davis⎪  November  8,  2013  

Page 2: Building Large Scale Services - LISA 2013

Twitter: @sigje Email: [email protected]

Page 3: Building Large Scale Services - LISA 2013

SysAdmin Controls all the things

11/11/13  3  

Page 4: Building Large Scale Services - LISA 2013

Shared Dependencies

11/11/13  4  

Page 5: Building Large Scale Services - LISA 2013

The Reality…  

11/11/13  5  

Page 6: Building Large Scale Services - LISA 2013

The Dream…

11/11/13  6  

Page 7: Building Large Scale Services - LISA 2013

How?

Page 8: Building Large Scale Services - LISA 2013

Define Core Principles

11/11/13  8  

§ Common    ›  CollaboraGon  across  teams,  companies,  industry,  define  standards  

›  Incident,  Problem,  Change,  Config,  Release  management  

§ DisGnct  ›  Specifics  to  an  applicaGon  or  service  ›  Availability,  Service,  Business  ConGnuity,  Capacity    

Page 9: Building Large Scale Services - LISA 2013

Kill the Myths

11/11/13  9  

§ Stupid  User    

Page 10: Building Large Scale Services - LISA 2013

Kill the Myths  

11/11/13  10  

§ Stupid  User  § System  Admin  ==  Operator  

 

Page 11: Building Large Scale Services - LISA 2013

11/11/13  11  

Failing Gracefully

puppet

ruby

SKILLS

perl

nosql

operability security

mysql

unix

TCP/IP

bash

CHEF

Page 12: Building Large Scale Services - LISA 2013

11/11/13  12  

Page 13: Building Large Scale Services - LISA 2013

Kill the Myths  

11/11/13  13  

§ Stupid  User  § System  Admin  ==  Operator  § Words  have  a  common  universal  implicit  meaning    

 

Page 14: Building Large Scale Services - LISA 2013

11/11/13  14  

Page 15: Building Large Scale Services - LISA 2013

Learn to Modulate your Message  

11/11/13  15  

 

 

Page 16: Building Large Scale Services - LISA 2013

11/11/13  16  

Team

Manager Customer

Page 17: Building Large Scale Services - LISA 2013

Team

11/11/13  17  

§ People  working  towards  common  goal.  § Different  roles.    § Different  views.  § Same  objecGves.  

Page 18: Building Large Scale Services - LISA 2013

11/11/13  18  

 Image  Credit:  Kyle  LaGno  

Page 19: Building Large Scale Services - LISA 2013

Team

11/11/13  19  

Sugges/on:  Don’t  talk  about  the  “devs”  request,  talk  about  Elaine’s  request.    

Page 20: Building Large Scale Services - LISA 2013

Team

11/11/13  20  

Sugges/on:  Don’t  talk  about  the  “devs”  request,  talk  about  Elaine’s  request.    Sugges/on:  Verify  that  your  team  has  the  same  vision.  

Page 21: Building Large Scale Services - LISA 2013

Understand the vision.

11/11/13  21  

§  Are  there  other  opGons,  open  source  or  not  within  the  company?  §  Are  there  other  opGons  outside  the  company?  §  Is  EVERYONE  on  the  same  page  about  what  the  service  is?  

Page 22: Building Large Scale Services - LISA 2013

Vision Statement

11/11/13  22  

§  Clear  statement  about  the  problem  that  the  service  is  solving.  ›  DirecGon  ›  IdenGty  management  ›  Team  cohesion  

New  product?  Be  part  of  creaGng  that  vision!  

Page 23: Building Large Scale Services - LISA 2013

Sherpa’s Vision

11/11/13  23  

..  Distributed  replicated  eventually  consistent  key  value  store  that  had  a  focus  on  scalability  ..    

Page 24: Building Large Scale Services - LISA 2013

My Job

11/11/13  24  

§  Examine  soaware  §  Define  risk  §  Communicate  cost  of  risks    §  MiGgate  risks  §  IdenGfy  events  §  Manage  events  

Page 25: Building Large Scale Services - LISA 2013

Fragile Platforms are Bad.

11/11/13  25  

Page 26: Building Large Scale Services - LISA 2013

Change is inevitable

11/11/13  26  

§  Products  pivot  based  on  needs.  §  Requirements  change  and  evolve.  §  Know  core  issues.  

Page 27: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  27  

§  Limit  the  scope  of  focus.    

Page 28: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  28  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.    

Page 29: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  29  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  

 

Page 30: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  30  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  

 

Page 31: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  31  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  ›  Talk  to  them.  IdenGfy  their  key  terms.  “Enhancements”,  “Defects”  

 

Page 32: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  32  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  ›  Talk  to  them.  IdenGfy  their  key  terms.  “Enhancements”,  “Defects”  ›  Establish  the  “Top”  list.    

 

Page 33: Building Large Scale Services - LISA 2013

Create checklists

11/11/13  33  

§  Not  because  people  are  dumb.  §  Not  only  because  of  automaGon.  §  When  things  break,  knowing  what  needs  focus.  §  During  normal  maintenance,  can  idenGfy  “not  OK”.  

›  Audit  checklists  for  deployment  through  staging  environment.  

Page 34: Building Large Scale Services - LISA 2013

Know Outputs

11/11/13  34  

§  IdenGfy  components.  §  Well  defined  protocols  between  components.  §  Expected  Inputs.  §  Expected  Outputs.  

Page 35: Building Large Scale Services - LISA 2013

11/11/13  35  

Page 36: Building Large Scale Services - LISA 2013

11/11/13  36  

Page 37: Building Large Scale Services - LISA 2013

11/11/13  37  

Page 38: Building Large Scale Services - LISA 2013

11/11/13  38  

Page 39: Building Large Scale Services - LISA 2013

11/11/13  39  

Page 40: Building Large Scale Services - LISA 2013

Know State Transitions Explicitly.

11/11/13  40  

§  When  component  is  installed  but  not  ready  

Page 41: Building Large Scale Services - LISA 2013

Know State Transitions Explicitly.

11/11/13  41  

§  When  component  is  installed  but  not  ready  §  When  the  colo  is  going  away  §  Go  through  What  If  Scenarios.  

›  Document  them.  

Page 42: Building Large Scale Services - LISA 2013

Know choke points explicitly.

11/11/13  42  

§  Memory  §  Disk  §  Bandwidth  

Now  and  in  6  months.  JIT?  

Page 43: Building Large Scale Services - LISA 2013

Failure will happen.

11/11/13  43  

§  There  are  no  0  failure  systems.  §   “Give  me  the  brain”  documentaGon  so  that  anyone  can  be  the  brain.  §  Repeatable/Reliable  failure  handling.  §  Run  fire  drills.  Really.    

Page 44: Building Large Scale Services - LISA 2013

11/11/13  44  

Page 45: Building Large Scale Services - LISA 2013

System Administration is Gardening.

11/11/13  45  

§  No  guarantee  of  resources.  §  Only  guarantee  is  change.  

Page 46: Building Large Scale Services - LISA 2013

System Administration is Gardening.

11/11/13  46  

§  Nurture  relaGonships.  ›  Be  authenGc.  ›  Be  trusGng  and  trustworthy.  ›  Have  integrity.  

Page 47: Building Large Scale Services - LISA 2013

Success At Scale is Collaboration & Cooperation across Teams.

Page 48: Building Large Scale Services - LISA 2013

Decreasing Value

11/11/13  48  

Page 49: Building Large Scale Services - LISA 2013

11/11/13  49  

0

2

4

6

8

Jan Apr Jul Oct

# of Support Engineers

# of Support Engineers

Page 50: Building Large Scale Services - LISA 2013

11/11/13  50  

0

1

2

3

4

5

6

Jan Apr Jul Oct

# of Support Engineers

# of Support Engineers

Page 51: Building Large Scale Services - LISA 2013

11/11/13  51  

Page 52: Building Large Scale Services - LISA 2013

Documentation is not the cure.

11/11/13  52  

§ DocumentaGon  doesn’t  guarantee  understanding.  ›  OperaGons  Sandbox  Environment  

§ Don’t  spend  Gme  at  the  end  documenGng.  

Page 53: Building Large Scale Services - LISA 2013

53   11/11/13  

Page 54: Building Large Scale Services - LISA 2013

Summary

Page 55: Building Large Scale Services - LISA 2013

Be Expendable. Feed your brain.

11/11/13  55  

Page 56: Building Large Scale Services - LISA 2013

Acknowledgements

11/11/13  56  

•  hkp://www.flickr.com/photos/levork    •  hkp://www.flickr.com/photos/puggles  •  hkp://www.flickr.com/photos/byteorder  •  hkp://www.flickr.com/photos/egoant  •  hkp://www.flickr.com/photos/happymonkey  •  Kyle  LaGno    •  Greg  Connor    

Page 57: Building Large Scale Services - LISA 2013

Thanks!

11/11/13  57  

[email protected] http://www.slideshare.net/sigje/

presentation-lisa