Operationalizing Clojure Confidently
Prasanna Gautam Staples-SparX 02/19/2015
Image: http://www.rohitnair.net/img/design/core.jpg
–Douglas Hofstadter (“I am a Strange Loop”)
“We don't want to focus on the trees (or their leaves) at the expense of the forest.”
My Clojure StoryIntroduced to Clojure - didn’t have prior Lisp experience.
Did my senior project on simulating Mobile Ad-hoc networks using Clojure at Trinity College in 2011.
Started working at ESPN Innovation
Worked on variety of other languages - Java, Ruby, Python, Javascript, C++
Clojure was my primary interface to JVM for experimentation
Decided to use Clojure to deliver ESPN programming to International Space Station
SparX
2009
2011
2011-2013
2013
2015
RequirementsCmdr. Chris Cassidy reached out to request regular ESPN programming.
200 MB file limit
Had to be ready every day at noon Central Time
Obvious choice:
Lets hire people to clip and send videos every day!
But it’s 2013Why not automate?
Also, let’s remove ads.
Motive: Validating the video services and interfaces we had been working on.
Ok, so why Clojure?
Why Clojure?Two weeks to deadline
Not all the pieces were clear
No guarantees from upstream services
Human errors abound
Source of data was people pressing buttons
And, systems failing would result in similar behavior
Why Clojure?Immutability
I could keep the system as a “constant” in ever changing world
Idempotency - re-run if failed, resume at any point in pipeline.
Java Interop
Even when I had APIs that weren’t written by my group, they were SOAP and XML based. Yay!
Inherently refactorable if designed correctly
Post-mortemStill in production since September 2013
Strictly enforced the “naïve” approach that “should” work
Learned a lot of lessons that go beyond Clojure
This talk is about these lessons
- Paul Graham (“Hackers & Painters: Big Ideas from the Computer Age”)
“When you're forced to be simple, you're forced to face the real problem.”
Parts of the stackCore Assumptions
Operations
Familiar Interfaces
Overrides
State
Logging
Error Handling
Iterative Development
Core: TimestampsPrograms — items that have a name and “start” and “end” times
Program Segments, Breaks — blocks within a program that “start” and “end” at particular times.
It’s just a map and reduce operation now!!
Take only program segments and make them into a video.
Why was it a good idea?Bare set of functionality to bind everything together.
Everything else is a good signal and would make system “better” but not dependable.
Aligning timestamps in UI is dead-easy to see where things are not aligned.
TV Programs are events too.
Core: Dependency GraphYour tasks are dependent on previous tasks
What’s the plan when they fail to execute?
Core: Loose Coupling/Lazy Execution
Separate data gathering and execution
You can expose the data to the user with no side-effects.
On OperationsFunctional Programs still need Operational expertise
If you’re in big enough company with an ops team
They don’t care about your FP patterns - they shouldn’t have to.
Make configurations declarative and readable
On Familiar InterfacesUse standard configuration formats — readable, parseable by anything
I picked Yaml
Familiar scheduling
Used cron strings thanks to Quartz
Everything in UTC internally
Timezones treated as side-effects
programs:)
))*)name:)AROUND)THE)HORN)
))))short_name:)ATH)
))))start_time:)"20:00:00")
))*)name:)PARDON)THE)INTERRUPTION)
))))short_name:)PTI)
))))start_time:)"20:30:00")
))*)name:)SPORTSCENTER)
))))short_name:)SportsCenter)
))))start_time:)"14:00:00")))
run:)
))cron:)0)0)14)1/1)*)?)*)
)
final_tz:)America/Anchorage)
)
On Familiar InterfacesStarted with a solid command line interface.
Took the Config and Options abstractions and exposed as REST API.
Switches)))))))))))))))))))))))))Default))))))))Desc)
)////////)))))))))))))))))))))))))///////))))))))////)
)/c,)//config)))))))))))))))))))))nasamatic.yml))Use)this)config)file)path)
)/h,)//no/help,)//help))))))))))))false))))))))))Show)Help)
)/f,)//no/force,)//force))))))))))false))))))))))Force)run)now)instead)of)using)Cron)
)/u,)//no/upload,)//upload))))))))true)))))))))))Upload)or)not)
)/t,)//no/transcode,)//transcode))true)))))))))))Transcode)or)not)
)/B,)//hours/before/now)))))))))))0))))))))))))))How)many)hours)before)now)to)look)at)
)/d,)//no/dry/run,)//dry/run))))))false))))))))))Dry)Run)modeOptions)
)
On Familiar InterfacesAlso wrote a Web UI in AngularJS for Operations team to use in cases of failed runs
The system failed rarely enough that I had to retrain people all the time.
Just gave up and used the CLI tool most of the time
UI breakage due to javascript issues
Exposing the API to Slack was more popular
On Familiar InterfacesOne-to-one correspondence between CLI and JSON
Key switch type default description
upload -u,--[no-]upload flag TRUE Upload to the FTP server
transcode -t, --[no-]transcode flag TRUE Pass the files through transcoder
qc -q,--[no-]qc flag FALSE Submit file to be QC’d by Pulsar
hours-before-now -B,--hours-before-now int 0 Number of hours before to look
dry-run -d,--dry-run flag FALSE Run without affecting filesystem/uploading
filter-by-program-tag -p, --[no-]filter-by-program-tag
flag TRUE Select contiguous programTags from Authnet or not
short-names -s,--short-names string Programs to select as declared in the configuration file under programs. Default behavior is to run all programs declared in configuration.
On OverridesCore Abstractions - Config and Options
Config: A static set of parameters that defines the general behavior of program. Doesn’t change too often.
Options: A dynamic set of parameters that can override config per-run.
Every job gets defined entirely by them.
On StateKeep the least amount of state possible
The system used no database at all for operations.
Intermediate files that were effects of steps were relied upon
Have to keep only last-seen state for live operation.
Re-running is trivial.
On LoggingTimestamp, state, key=value
Parseable by anything! (It was Splunk’s weirdness that led to this)
Can generate metrics from on-going operations without instrumenting further.
Wired to PagerDuty directly
On Error HandlingFind out about error, try to fix it — if not possible, system should try the whole process next day/job
Parent form generates random trace-id for a job
Passed to all children for that job
Any exceptions are passed via the chain and logged
Back off and Retry — if all else fails, let humans figure it out.
(defmacro)do$with$log+
++"+Works+functionally+like+a+do+block+$$+more+or+less,+it+runs+all+the+given+forms+in+order+and+returns+the+output+of+the+last+form+it+ran..+It+logs+when+the+job+
started,+ended+or+when+it+runs+into+any+problems.+It+logs+the+error+and+rethrows+the+Throwable+upstream."+
++([[job$name+name+&+{:keys+[trace$id]+:or+{trace$id+(str+"trace$"+(rand$int+100000))}}]+&++body]+
+++(if$not+name+
+++++(throw+(IllegalArgumentException.+"You+want+to+provide+a+name+for+the+block+you+want+to+run.")))+++
+++`(let)[out#+(atom+nil)+
++++++++++start$time#+(System/currentTimeMillis)+
++++++++++~job$name+(str+~name)+
++++++++++~'trace$id+(str+~trace$id)+
+++++++++]+
++++(infoAm+"job"+~job$name+"status"+"Started"+"trace$id"+~trace$id)+
++++(reset!+out#+(try)
+++++++~@body+
+++++++(catch+Throwable++e#+
+++++++++(errorAm+"job="+~job$name++"status"+"Error"+"trace$id"+~trace$id++"message"+e#)+
+++++++++(throw+e#))))+
++++(infoAm+"job"+~job$name+"status"+"Ended"+"trace$id"+~trace$id+"time_taken"+(str+($+(System/currentTimeMillis)+start$time#+)+"ms"))+
++++@out#)+
+
+++++)+
++)+
2014-05-20 00:28:26 INFO utils-verify:1 - trace-id=trace-94295, status=Started, job=sleeps 2014-05-20 00:28:27 INFO utils-verify:1 - trace-id=trace-94295, status=Started, job=throws-error 2014-05-20 00:28:27 ERROR utils-verify:1 - job==throws-error, trace-id=trace-94295, message=java.lang.Throwable: Boo! I errored Out, status=Error 2014-05-20 00:28:27 ERROR utils-verify:1 - job==sleeps, trace-id=trace-94295, message=java.lang.Throwable: Boo! I errored Out, status=Error
Only Macro I needed
Iterative DevelopmentUsed “lein ns-deps-graph” to see the inter-relations between namespaces
Operational ClojureBuilds on simple concepts
they’re the units of composition
Sparingly depends on global state, if at all
Leverages existing infrastructure and people
Adapts to changes in scope and requirements
Loosely couples data and execution
FutureI had great time coming up with some of these patterns
Particularly - config and options for jobs
Thinking about open source re-implementations
More Clojure-y things at SparX coming soon. ;)
Questions/Comments?