Doing data science with Clojure
-
Upload
simon-belak -
Category
Data & Analytics
-
view
714 -
download
0
Transcript of Doing data science with Clojure
↳ Design constrains
↳ The environment
↳ notebooks vs. REPL
↳ programmable environments
↳ The tools
↳ design decisions behind Huri (my data science library)
↳ data frame considered harmful
↳ encoding computation into structure
↳ composability
↳ feedback loops
↳ Expanding the ecosystem with mini compilers (to ggplot, scipy, …)
Design constraints
Divide and conquer complexity
KafkaPostgreSQL
ElasticSearch
frontend actions orderbook changes monitoring telemetry flight changes Intercom …
s3
Intercom
Automatic views
• Event & attribute ontology
• Manual
• Inferred
• Seasonality detection
Data science: the process
(aka it’s about communication, stupid!)
The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow
< 2min < 20min project
squeeze in somewhere in the day
fail
roadmapahoy!
Think in distributions, not numbers
No throwaways
Sharing results
• Have one canonical version that is always current.
• Concentrate discussion in one place and make it searchable and persistent.
• Include methodology (=code).
The environment
REPL vs. notebook
REPL vs. notebook+[Ephemeral] [Spital grouping]
#alderaan #sales #growth
Code hidden, but can be expanded
Questions, comments,
& annotations
Shareable
Periodically re-run to keep it fresh
#alderaan #sales #growth
discoverability
Notebooks as dashboards
The power of sharing runtime
Wishlist/TODO
• Better editor (shaunlebron.github.io/parinfer/ ?)
• Embedded REPL
• Better exception reporting
• Browsable data structures
The tools
Data frame considered harmful
• Data frame (=table) conflates representation and abstraction
• Clojure excels in structure manipulation/encoding
github.com/sbelak/huri• No data structures, just functions over collections
• Composable (even DSLs — no macros!)
• Reasonably fast (transducers <3)
• Do-what-I-mean (auto-sort, liberal with inputs, …)
• Minimal buy-in
composable data structure based DSLs
->> and partial friendly Support reaching into nested structures everywhere
vanilla vector of maps
interoperability
Provide curried versions where possible
Composability is key to quick iterating
• Curried versions where possible
• ->> and partial friendly
• Side benefit: consistent API
• Generalised accessors (reaching into complex structures everywhere via comp)
function
map key
“virtual” structure
“This is possibly Clojure’s most important property: the syntax expresses the code’s semantic layers. An experienced reader of Clojure can skip over most of the code and have a lossless understanding of its high-level intent.”
— Z. Tellman, Elements of Clojure
On feedback
Catching errors early ⇒ more context ⇒ easier debugging ⇒ faster iterating
clojure.spec
=>
Should have been a keyword->fn map
<3 Bret Victor
What about machine learning?
farm it out to sklearn
Mini compilers for DSLs targeting a specific library in another language
huri.plot
• DSL that compiles to ggplot2
• Targets Gorilla REPL
• Follows the rest of Huri’s design philosophy
• bar chart, scatter plot, line chart, box & violin plot, heatmap, histogram
Takeouts• Speed-of-answer matters
• Data science is about communication
• We don’t have to reinvent every wheel in Clojure
• Clojure is fantastic at structure manipulation, play to its strengths
• Blurring the line between environment and work is a powerful idea