Open Data Innovation: Building on Open Data Sets for Innovative Applications
-
Upload
amazon-web-services -
Category
Technology
-
view
846 -
download
3
Transcript of Open Data Innovation: Building on Open Data Sets for Innovative Applications
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Open Data InnovationBuilding on Open Data Sets for Innovative
Applications
Jed Sundwall
Open Data Technical Business Development Manager
Agenda
• Open data on AWS overview– Why open data matters to AWS
• Landsat on AWS– The newest AWS public dataset
• Frank Warmerdam from Planet Labs– Open data in the geospatial world
• What’s next for open data on AWS
Open Data on AWS
What is open data?
Open data is data that can be used by anyone for any purpose for free.
Many of our customers, such as Esri, the Weather Company, and the
Climate Corporation, rely on quality open data as much as they rely on our
computing, storage, and other web services.
Open data on AWS
Amazon Web Services provides a comprehensive toolkit for gathering,
storing, analyzing, and working with data at any scale.
Amazon Elastic MapReduce
(Amazon EMR) provides the
Apache Hadoop analytics
framework as an easy-to-use
managed service.
Amazon S3 lets you store
and retrieve any amount of
data, at any time, from
anywhere on the web.
Amazon DynamoDB is a
fully-managed NoSQL
database service that makes
it cost-effective to store and
retrieve any amount of data.
The power of open data on AWS
Making data open on AWS enables more innovation by making data
available for rapid access to our flexible and low-cost computing
resources.
Amazon
EC2
Amazon
EMR
Amazon
Redshift
Amazon
DynamoDB
AWS
Lambda
The Weather Company saves $1 million per year running its
forecasting application on AWS
The Weather Company provides millions of people
with the world’s best weather forecasts,
content and data, every day.
Using AWS, TWC can scale as
necessary to handle constantly
changing workloads and maintain
our 11-millisecond response time.
Bryson Koehler
EVP, CTO, CIO, The Weather Company
”
“ • Needed a cost-effective, scalable
alternative to operating 13 data centers
with legacy systems.
• TWC ingests, stores, and analyzes
ingests 4 GB of weather data per
second from over 800 sources.
• Designed to handle more than 15 billion
API calls each day, at a rate of 150,000
per second.
• Reduced its on-premises IT
environment form 13 to 6 data centers.
Data Enrichment
Sen
sem
akin
g
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Lower cost of knowledge(Efficiency)
Open data as a platform
Data Creation Data Enrichment
Sen
sem
akin
g
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Efficiency
Open data as a platform
Data Enrichment
Sen
sem
akin
g
AmazonKinesis
AmazonEC2
AmazonEC2
AWS DataPipeline
AmazonS3
AmazonRDS
AmazonEMR
AmazonRedshift
AmazonDynamoDB
AWSLambda
Open data as a platform
Moovit: Smart Public Transportation
• Mobile app turns bus and
train riders into real-time
sensors for city government
• Integrates with city back-end
systems to improve both
service and rider experience
• Powered worldwide by the
AWS cloud
• First government-wide national intelligent map portal – Integrated map system for government agencies to deliver location-based
services and information to government agencies and citizens
– Powers over 100 government GIS websites and applications
– Reduced costs by 60%
Singapore government
“AWS has helped my organization
to provide better service availability
and handle higher traffic load at a
lower cost.” —Chan Chin Wai, Chief Information Officer
Singapore Land Authority
Landsat on AWS
Public datasets on AWS
To enable more innovation, AWS hosts a selection of datasets that anyone
can access for free. Data in our public datasets is available for rapid
access to our flexible and low-cost computing resources.
Earth Science
NASA Earth Exchange
(NASA NEX)
Life Sciences
1000 Genomes Project
Internet Science
Common Crawl Corpus
Landsat
The Landsat program is a joint effort
of the U.S. Geological Survey and
NASA. It is the longest running
program to gather Earth imagery
from space and is considered the
gold standard for natural resources
satellite imagery.
Landsat is big open data
The Landsat program is a joint effort
of the U.S. Geological Survey and
NASA. It is the longest running
program to gather Earth imagery
from space and is considered the
gold standard for natural resources
satellite imagery.
It has traditionally been time-
consuming and expensive to
acquire, store, and analyze Landsat
data.
Landsat on AWS
We have committed to making up to
1 petabyte of Landsat imagery
readily available as objects on
Amazon S3.
Now, anyone can analyze Landsat
data at web scale with no significant
up-front investment of time or capital
expense.
Esri—Unlock Earth’s Secrets
Esri has created a tool to show how
ArcGIS Online can quickly visualize
Landsat data for live visualization and
analysis within the browser.
“These are not pre-generated cache
services limited to just visualization—
they are dynamic, high-performance
image services that perform on-the-
fly processing and dynamic
mosaicking of Landsat’s multi-
spectral and multi-temporal imagery.”
http://www.esri.com/landsatonaws
landsat-util
Landsat on AWS helped
Development Seed make
optimizations that make landsat-util
over 2× faster and allow for more
functionality.
https://developmentseed.org/blog/2015/03/19/aws-landsat-archive/
Landsat-live
Mapbox created Landsat-live, a map
that is constantly refreshed with the
latest satellite imagery from NASA’s
Landsat 8 satellite.
Creating a live Earth imagery
pipeline is possible because Landsat
imagery is available on Amazon S3
within hours of creation.
https://www.mapbox.com/blog/landsat-live-live/
MATLAB—Landsat8 Data Explorer
MathWorks created a freely
downloadable MATLAB based tool
for accessing, processing, and
visualizing Landsat 8 data.
The tool allows MATLAB users to
find Landsat 8 scenes, analyze
them, and combine them with other
sources of GIS data for new
visualizations.
http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Frank Warmerdam
Geospatial Software Developer, Planet Labs
www.planet.com
• Geospatial software developer for 20 years
• PCI, independent consultant, Google, Planet Labs
• OSGeo/open source/open data
• Not really very “sciency”
• Working on the “data pipeline” team
Frank Warmerdam
Why not?• Costly to collect
• Hard to control
Why?• Open datasets are an enabler for innovation
• Making data open ensures optimum utilization
• Geodata (images, maps) are common heritage– More like science than art or literature
Open geo data—why?
• Free by default!• TIGER/Line
– National roads from US Census Bureau– Base of many commercial roadmaps (Google, etc.)
• NAIP– 2 m resolution air photos of continental US– Base for many commercial image maps
• Landsat– 30 m images of the world for 30+ years
• NASA/NOAA/USGS– Science data– Weather data– Geological data
Open geo data—USA
We consume:
• Landsat8 PAN
• Landsat8 RGB
• NAIP
• CGIAR DEM
• SRTM90
• SRTM30
• NED
• Open Street Map
• NOAA cloud predictions
Open geo data @ Planet Labs
• Format conversions
• Slow servers (i.e. USGS, lots of 503s)
• Incomplete mirrors (i.e. missing Landsat updates)
• Dynamic datasets require constant monitoring
• Storage is costly
• Such a waste of bandwidth!
Why not share one copy?
Ingesting is a hassle
• Mapbox loaded recent NAIP data on Amazon S3
• Offered to Mark Korver at Amazon Web Services
• Mark put in an AWS-funded Amazon S3 bucket
• Available with “requester pays” for network egress
Planet Labs attaches to this NAIP data
• Reference from the foreign Amazon S3 bucket
• Need to sign all requests (for requester pays)
• /vsicurl/ works (used to get footprint cheaply)
• Succeeded in building 4.7m NAIP mosaic of CONUS!
One example: NAIP
• AWS provides up to 1 PB of S3 storage
• AWS provides free network egress
• MapBox (Charlie and Amit) provides USGS pull library
and expertise
• Planet Labs writes ingestion scripts
• Planet Labs provides Amazon EC2 workers for ingestion
• Updating every two hours
• All scenes from January 2015 on
• Selective backfill from 2014 and earlier
Landsat on AWS
• TAR files split into internally compressed TIF
• External overviews
• Simple HTTP access (no auth)
• /vsicurl/ capable (with caveats)
• _MTL file (soon) also available as .json
• .csv scene list in root of bucket
http://github.com/landsat-pds
https://s3-us-west-2.amazonaws.com/landsat-pds/L8/index.html
Landsat on AWS—organization
• Easy access to desired bands
• Tiling and overviews potentially support
mapping/viewing applications efficiently
• HTTP/VSI Curl support for the win
• Reduce load on USGS
• Mount Amazon S3 bucket via file system on Amazon
EC2 instance
• Open to collaboration and layered tools
Landsat on AWS—advantages
• Ingest into our system “via reference to Amazon S3”
• Successful used for mosaicking etc.
• We now track L8 PDS hourly
Landsat on AWS—Planet Labs
{
"DATA_TYPE": "L1GT",
"MTL_link": "https://s3-us-west-2.amazonaws.com/landsat-
pds/L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_MTL.txt",
"cloud_cover": {
"cloud_mask_link": "https://storage.planet-
labs.com/v0/scenes/landsat8_qa/LC81830182014347LGN00_BQA.TIF",
"estimated": "0.92"
},
"derived_from": {
"input_params": {
"ARGS": "--next-run"
},
"job_url": "https://jobs.planet-labs.com/v0/programs/l8_aws_process/jobs/26996483",
"process": "l8_aws_process"
},
"footprint": {...},
"index_link": "https://s3-us-west-2.amazonaws.com/landsat-
pds/L8/183/018/LC81830182014347LGN00/index.html",
"pass_at": "2014-12-13 00:00:00",
"remote_info": {
"backend": "s3_remote",
"s3_bucket": "landsat-pds",
"s3_path": "L8/183/018/LC81830182014347LGN00/LC81830182014347LGN00_B11.TIF"
}
}
Landsat on AWS—Planet Labs
• Promote use in the community
• Divert existing USGS pullers to this
• Promote integrations– Development Seed’s landsat-util and libra viewer– Additional catalog interfaces
– Web map view onto data
• Show case derivative works (mosaics, etc.)
• More “operators”
Landsat on AWS—future steps
• This is the future!• Public access (HTTP)• Preserve source data (pixels and metadata)• Organize for efficient use• Keep up to date• Amazon S3 -> anyone can spin up Amazon EC2 nearby
Other datasets:• Elevation (Stamen project)• Planet Labs public datasets (more soonish)• …
Cloud hosted raw geodata
Landsat on AWS as a platform
What’s Next…
What’s next
• More open data– If you rely on open data for your work, we want to hear from you
• More services and features– AWS JavaScript S3 Explorer: a simple JavaScript application for
displaying the contents of an Amazon S3 bucket in the browser.
– https://github.com/awslabs/aws-js-s3-explorer
– Roughly 90–95% of our roadmap is driven by what our
customers tell us matters, so tell us at [email protected]