Using Anaconda to Build a Custom Data Science Distribution at Bloomberg | AnacondaCON 2017
-
Upload
continuum-analytics -
Category
Data & Analytics
-
view
270 -
download
0
Transcript of Using Anaconda to Build a Custom Data Science Distribution at Bloomberg | AnacondaCON 2017
©2017BloombergFinanceL.P.Allrightsreserved.
Leveraging the Anaconda PlatformPlatform to Build a Custom Data Science ScienceDistribution at BloombergAnacondaCON 2017Armin Burgmeier <[email protected]>Senior Software Engineer
February 9, 2017
1
©2017BloombergFinanceL.P.Allrightsreserved.
Outline• Motivation
• Overview of the Anaconda Platform
• The Bloomberg Distribution
• Building and Deployment
• Lessons Learned
• Wishlist
2
©2017BloombergFinanceL.P.Allrightsreserved.
Motivation• Bloomberg is first and foremost a data company• Many teams augment, ingest and analyze financial data• Provide a modern data science platform for such teams
• “Replace Excel with Jupyter notebooks”
• Combining the Python scientific stack with Bloomberg data and services
3
©2017BloombergFinanceL.P.Allrightsreserved.
Requirements• Deployment to the desktop• Windows only• Support only a limited combination of packages• Reproducible runtime environments• Facilitate sharing of projects
DifferencesfromAnaconda:
• Different set of standard packages (Financial domain)• Automatic management of packages and environments• Allow update of stable environments for security fixes, adaptions
to backend changes
4
©2017BloombergFinanceL.P.Allrightsreserved.
The Anaconda Platform• Conda
• A cross-platform binary package and environment manager
• Anaconda• Conda + a set of commonly used Python packages
• Anaconda Cloud• Share notebooks, environments, packages, …
Aconda packageinanutshell:
matplotlib-1.5.3-np111py35_1.tar.bz2
Name Version Build string Tarball
Setoffilestoinstall+metadata:• Name,Version,Buildstring• Dependencies• Platform• License• …
5Conda logofromhttp://conda.pydata.org
©2017BloombergFinanceL.P.Allrightsreserved.
Building a conda package• Packages are built from “recipes” with conda-build• Recipes are meant to describe a reproducible build environment• Consists of
• Name, Version, Build string• Dependencies needed to build (such as Python and setuptools for Python
packages)• Build scripts (might invoke C/C++ compiler)• Tests for the package
• conda-build ensures binary compatibility
6
©2017BloombergFinanceL.P.Allrightsreserved.
conda-forge• Community-driven effort to provide conda recipes and build
infrastructure• One git repository (“feedstock”) per recipe
• Each recipe gets built by the infrastructure
• https://conda-forge.github.io/• Packages available through a separate “channel”• Some packages available both through anaconda and conda-forge
• Others are exclusive in either anaconda and conda-forge• mix-and-match
7conda-forgelogofromhttps://github.com/conda-forge
©2017BloombergFinanceL.P.Allrightsreserved.
The Package Manifest• Pin versions of all packages in the manifest• Keep a separate manifest of “intended” packages• Both manifests are conda packages
python3.5.21numpy 1.11.1py35_2pandas0.19.1np111_py35_0…
“Locked”Manifest
python3.5*pandas…
“Intended”Manifest
• Similar idea as the difference between cargo.lock and cargo.toml in Rust
8
©2017BloombergFinanceL.P.Allrightsreserved.
Platform Versioning• Assign a semantic version number to each manifest
Distribution1
Distribution2
Distribution3
Distribution4
Deprecated
Supported
Supported
Preview
9
©2017BloombergFinanceL.P.Allrightsreserved.
Picking a Platform Version• Every version is installed into its own environment• Creating a new notebook:
• Choose current default major version
• Opening an existing notebook:• Same major version than the one the notebook was created with• Latest minor version
• Deprecating a major version:• Refuse to open notebooks with that major version
• Upgrading an existing notebook:• Always a conscious action• Run the notebook in the new environment• Testing and verification by developer
Determinemajorversion
Findlatestminorversion
(maybe)Createenvironment
(maybe)LaunchNBserver
OpenfileinNBserver
UserAction
10
©2017BloombergFinanceL.P.Allrightsreserved.
Garbage Collection• Need a way to remove unused packages and environments• Observation: we are never running a version that has a more recent
version in the same stable series (same major version)• Remove all deprecated versions • Remove all versions with no longer supported major versions• Remove all packages no longer installed in any environment
• Beware of concurrent operations!Distribution1
Distribution2
Distribution3
Distribution4
11
©2017BloombergFinanceL.P.Allrightsreserved.
Build System• Inspired by conda-forge
• Feedstock repositories separate from upstream code
• Continuous Integration• Buildbot builds the recipe on every PR and every
push to master• Upload to internal Bloomberg channel
• Works great for C# codebases as well
PRonFeedstockrepo
Automaticbuild
Uploadtoseparatechannel
ManualTestingifneeded
MergePR
UploadtoMainChannel
12buildbot logofromhttp://buildbot.net/about.html
©2017BloombergFinanceL.P.Allrightsreserved.
Customization• matplotlib conda package depends on Qt• Not needed in a Jupyter notebook-based environment• No notion of “optional” dependencies in conda
• Fork conda-forge matplotlib-feedstock repo• Make customization and add “noqt” feature to the build• Created package is matplotlib-1.5.3-np111py35_noqt_0
• Avoids collision with packages from other channels• Tracking the “noqt” feature in our environment makes conda prefer our
customization over the default package
Needforcustomizationofupstreampackage:
13
©2017BloombergFinanceL.P.Allrightsreserved.
Deployment• All builds end up in an internal (“dev”) channel• When a new platform version is ready for a wider audience, propagate
the platform package and all packages it contains into a production channel.
1.0
1.1
1.2
2.0
“dev”channel 1.0
1.1
1.2
“prod”channel
14
©2017BloombergFinanceL.P.Allrightsreserved.
Lessons Learned: Install Order• We are using packages from both conda-forge and anaconda
• Sometimes they don’t play well together
• Bqplot needs ipywidgets installed at install time for post-install script• Circular dependencies are handled fine by the conda solver
• But no guarantee about installation order!
bqplotipywidgets
_nb_ext_conf
ipywidgets
conda-forge conda-forgeanaconda
anaconda• Workaround: prefer conda-forge over anaconda• https://github.com/conda-forge/bqplot-feedstock/issues/11
15
©2017BloombergFinanceL.P.Allrightsreserved.
Lessons Learned: Channel Pinning• One package that we pinned is
mpmath-0.19-py35_1
• Originally it was available in the anaconda channel• “Suddenly” it became available in conda-forge
• With different dependencies
• Build fails because the new dependencies are not pinned
• Ideally we could pin the channel as well• In addition to version and build string
• Workaround: Upload mpmath from anaconda to Bloomberg channel• Ultimate channel priority: Bloomberg -> conda-forge
-> anaconda
mpmath-0.19-py35_1
python
mpir
mpfr
gmpy
mpmath-0.19-py35_1
anaconda
conda-forge
16
©2017BloombergFinanceL.P.Allrightsreserved.
Lessons Learned: Reprod. Builds• Build time dependencies are not pinned
• Hard to enforce with current conda tools
• A build that works today might no longer work tomorrow• e.g. pandas 0.17.1 changed merge behavior which broke one of our
packages
• Possible solution:• After a successful build, “freeze” the dependency resolution and add to the
recipe• On subsequent builds, use the “frozen” dependency resolution• Make it an explicit action to re-resolve dependencies• Would need separate resolutions for different features, platforms, py/np
versions
17
©2017BloombergFinanceL.P.Allrightsreserved.
Wishlist: conda download• A command that downloads packages but does not install them• Allows to work around build dependency pinning:
• Download the build dependencies for a package• Add them to a local channel• Build the package with dependencies only from that channel• Conda is forced to resolve dependencies with the previous downloaded
packages
• Allows to ship packages so they can be installed later without connectivity to the original channels
• http://github.com/conda/conda/issues/1150
18
©2017BloombergFinanceL.P.Allrightsreserved.
Wishlist: Parallelize Install Steps• Optimizing the install time of the first
environment is crucial in our scenario• Creating a conda environment takes
three steps• Download package tarballs (Network I/O-
bound)• Extract package tarballs (CPU-bound)• Install packages into the environment
(Disk I/O and/or CPU bound)
• Download size is O(200MiB)• First two steps could be (easily?)
parallelized
DownloadPackageA
DownloadPackageB
DownloadPackageC
DownloadPackageD
ExtractPackageA
ExtractPackageB
ExtractPackageC
ExtractPackageD
time
19
©2017BloombergFinanceL.P.Allrightsreserved.
Wishlist: .xz conda packages• LZMA has better compression ratio and better decompression speed• Would significantly improve time to download packages and create an
environment
Method Size(MiB) DecompressionSpeed DecompressionMemorybz2 110.0 18.6s 4M
xz 73.3 9.6s 8M
xz -9 59.0 8.3s 64M
Testobject:win-64/mkl-11.3.3-1.tar.bz2
• Drawbacks:• Higher memory requirements at decompression• Extra dependency in Python 2.7 (backports.lzma)
20
©2017BloombergFinanceL.P.Allrightsreserved.
Conclusions• The Anaconda Platform together with conda-forge is a great ecosystem
for creating Python distributions• Bloomberg builds its own provisioning of environments around it
• Automatic management of environments• Long-term support for existing notebooks• Allow minor updates to stable environments for continued maintenance
• Mixing anaconda and conda-forge has some quirks• Pinning of packages
• “Intended” set of packages vs. “frozen” set of packages
21
©2017BloombergFinanceL.P.Allrightsreserved.
Thank you!
22
AnacondaCON 2017Armin Burgmeier <[email protected]>Senior Software Engineer
©2017BloombergFinanceL.P.Allrightsreserved.
Lessons Learned: Conflict Hints• Conflict is if dependency constraints cannot be satisfied
• e.g. ipywidgets=5.2.2 widgetsnbextension=1.2.3• ipywidgets depends on widgetsnbextension >= 1.2.6
• conda creates “hints” on how to resolve a conflict:
The following specifications were found to be in conflict:- alabaster 0.7.8 py35_0- widgetsnbextension 1.2.3 py35_1
• alabaster just happens to be the first entry in the list of dependencies• http://github.com/conda/conda/issues/1859
23
©2017BloombergFinanceL.P.Allrightsreserved.
Environment extensions• What if you need a package not included in the platform?• Add an IPython extension “%install”• Runs the equivalent of
• pip install –t some-directory <name>==<version>--no-deps --only-binary=:all:
• Adds it to sys.path
• Pros:• Reproducible• Sources the python package archive
• Cons:• No dependency resolution• No installation of data files (such as Javascript for IPython widgets)• Only works for wheels (no custom code execution at install time)
24
©2017BloombergFinanceL.P.Allrightsreserved.
Environment extensions (cont.)• Alternative: Use conda• User specifies extra requirements, for example pandas >=0.19.1• When creating the environment
• Install packages from platform• Then install extra requirements• “Freeze” list of additional packages installed (possibly replacing platform
packages)
• When re-creating the environment• Install the packages from the recorded (“frozen”) package list• Might create a conflict when the minor platform version has changed:
• Conda should be able to solve by downgrading some packages in the platform
• Provide option to re-resolve requirements at a later point
25