Fedora Service Framework Simple Queue Services For fulfillment of the Mellon Grant June 29, 2009.
-
Upload
ruth-askins -
Category
Documents
-
view
216 -
download
2
Transcript of Fedora Service Framework Simple Queue Services For fulfillment of the Mellon Grant June 29, 2009.
Fedora Service FrameworkSimple Queue Services
For fulfillment of the Mellon Grant
June 29, 2009
Simple Queue Services
• Provide a simple, reliable way to connect content-related infrastructure services to:– Enable moving notifications and content between services and
repositories– Perform tasks using decoupled, reusable services– Enable easy reuse and repurposing of services as programmable flows
• Inspirations– Amazon “Simple Queue Services” (FOSS Implementation)– Tom Cramer, Stanford Library “Work Do” workflow (via Hydra)– Richard Rogers, MIT Libraries “Cloud Task Replica”– NSDL NCORE
Example FSF-SQS Application
Request Queue
Response Queue
File System Or DuraspaceOr Naked Akubra
Or Fedora Repository
SimpleIngest
Service
PortableIngestClient
Validation Service(e.g.)
CustomIngestClient
Browser
Example Chained FSF-SQS Application
Request Queue
Response Queue
Staging or Institutional Store
SimpleIngest
Service
Request Queue
Response Queue
Appraisal Service(e.g.)
Validation Service(e.g.)
PortableIngestClient
FedoraRepository
Service
Example Replication FSF-SQS Application
Request Queue
Response Queue
NotificationPollingService
Request Queue
Response Queue
Fedora Ingest
Service
Transform Service
ExistingClient
Metadata
Bitstreams DSpace Fedora
Repository
Fedora Repository ServiceGSearch
OAI
Ingest
SimpleJMS
Service Integration
More…
First, we are providing simple messaging (via ActiveMQ in Fedora 3.0)
repository publishes events
Serviceslisten andconsumeevents or other messages
Next, lightweight integration with workflow engine(s); orchestration
Original FSF Messaging Concept
Did not get implemented
No message ingest method
Collective Experience• Domain Characterization (reference Mellon ESB Study):
– Limited governance structures– High developer turnover– Rapid environment changes– Cost-sensitive
• Examples:– RepoMMan and Remap (BPEL)– Hydra (three approaches)(Dlib)– eSciDoc plus others (Red Hat jBPM)
• Northwestern Books• Trident Project
• Conclusion:– Using full-featured workflow systems will be difficult for the majority of our
targeted organizations
Amazon’s Simple Queue Service
• Amazon SQS• Implemented as a service within Amazon’s Cloud• Less capable but much simpler than direct JMS• Limited to an 8K message body with no attachments• SOAP and Query (aka Web) API• Messages are durable for 4 days• Messages are locked while processing
Amazon’s SQS API• CreateQueue: Create queues for use with your AWS account.• ListQueues: List your existing queues.• DeleteQueue: Delete one of your queues.• SendMessage: Add any data entries to a specified queue.• ReceiveMessage: Return one or more messages from a specified queue.• ChangeMessageVisibility: Change the visibility timeout of previously received
message.• DeleteMessage: Remove a previously received message from a specified queue.• SetQueueAttributes: Control queue settings like the amount of time that messages
are locked after being read so they cannot be read again.• GetQueueAttributes: See information about a queue like the number of messages in
it. • AddPermission: Add queue sharing for another AWS account for a specified queue. • RemovePermission: Remove an AWS account from queue sharing for a specified
queue.
Rogers’ “Cloud Task Replica”
• OR09 Presentation • Oriented to Cloud characteristics• Uses lightweight interfaces and queuing, highly-decoupled• Primarily focuses on replication use cases• At prototype stage
“CTR” - Roles
• decompose work into distinct replaceable agents
• archive = content home• replicator = manages copies• auditor = implements and enforces policy• role != institution
“CTR” - Process Model
• a message queue for each role• message post triggers activity
asynchronously• bucket brigade - message is a handoff or
acknowledgment• storage is abstracted
“CTR” - Workflow: Replication
archive replicator auditor
S3
“CTR” - Message Semantics
• web-standard URI addressing• entities: packages, ORE maps• content model agnostic• entity checksums for integrity • standard identifiers for actors
Stanford’s “Work Do” Workflow• Puts the resource management state inside the Fedora digital
object• Each application is read the object and performs its function• Able to support both human workflow and BPE• Uses logical queues to manage workflow (no messaging SW)• Depends on applications doing the right thing• Simplifies governance to resource management semantics
and representation
“Work Do” - Approach• Each object in DOR has:
– a locally defined resource-management metadata– a special Datastream to describe processing conditions and
their state for that object.• Places work-related information in the object:
– it can be indexed (using SOLR or other search engines)– co-located alongside other useful processing information– contains collection and selector identity to mark records
ready for a particular process.
“Work Do” – Process Model• Simple queries are used to:
– establish logical queues– queues define the work ready for a particular robot or
human interaction at any given time.• Queries also provide:
– ongoing management information about the flow of objects through the system
– can be exposed as facets in an administrative discovery environment
• Simple REST based interactions based on Fedora service calls are used to identify queues and update state.
“Work Do” – Process DataA workflow datastream in each object describes processing requirements
and status
<workflow id=“googleScannedBookWF" status="active” …> <process name="register-object" status="completed” attempts="1" /> <process name="desc-metadata" status="completed” attempts="1" /> <process name="google-convert" status="completed” attempts="1" /> <process name="google-download" status="exception” message="Item for barcode 0339518 not found" attempts="3" /> <process name="create-pages" status="waiting” attempts="0" /> <process name="ingest" status="waiting” attempts="0" /> <process name="shelve" status="waiting” attempts="0" /> <process name="cleanup" status="waiting” attempts="0" /></workflow>
FSF-SQS Development Approach
• Merge selected aspects of Amazon, Stanford “Work Do”, and MIT “Cloud Task Replica” approaches
• Enable moving notifications and data between repository services
• Mostly integration of existing FOSS, minimal new build• Extends existing ActiveMQ implementation
– Adds tools for moving data– Adds additional language bindings likely using Stomp– Realizes promise of completing asynchronous messaging– Can be extended later to include business rules engine, full workflow– Can be extended to Cloud implementations (Amazon, Eucalyptus)– Note: No FOSS implementation currently available for Amazon SQS
Targeted Use Cases
• Bi-directional replication between Fedora repositories– initial and ongoing– possibly update
• Uni-direction replication from DSpace to Fedora– initial and ongoing
• One-time ingest (ETL) from legacy repositories• Validation services• Selected workflows (TBD)
FSF-SQS Implementation
• Would prefer to use FOSS implementation of Amazon SQS interface
• Fallback is to use other products directly• Under investigation:
– ActiveMQ integrations including Apache CXF– Mule– Apache Camel– FUSE ESB 4 (Apache ServiceMix – Mellon ESB top
recommendation)• Note: “Bus In the Cloud”• Note: “Is Eucalyptus ready to be your private cloud?”
Don’t Need to Build
• Messaging (ActiveMQ)• Language Bindings, Brokers/Gateways (e.g. Stomp)• ESB (e.g. Camel, Mule) or Workflow (e.g. jBPM, Kepler)• Most services• Business integration patterns (but will have to choose)
– Document (send object, action and content through)– Disconnected (temporarily put the content in storage or in Fedora and
incrementally perform actions) – Notification (events only)
Do Need to Build• Service Wrappers (or request from community)• FSF-SQS based on Amazon SQS in ActiveMQ possibly with Mule• Message payload formats include resource processing state• DSpace to Fedora extract, transform, transfer and load flow• Replacement for Diringest service (maybe)
– Chris Wilper wants this work done– Needs to handle content without requiring FOXML wrapper, manifest– Good to use Fedora Content Models where feasible– Be extensible– Needs some common components with FC-REPO WebDAV– Support Messaging and Web end-point (brokers/gateways)
• Portable client (partial SIP builder replacement)(maybe)– Works both client or server-side (consider Python, Ruby, Flex)– Works with or without manifest, synchronous and asynchronous– Simple, Simple, Simple on-ramp client for entry-level users
Advantages and Drawbacks
• Advantages– Messaging is the simplest of the enterprise methods– Low risk since simplifying approaches may be taken at may points– Has been requested many times by large repository users– Immediately useful– Fits overall Mellon goals
• Drawbacks– Does not include a named “workflow” product though workflow term
used by Amazon and others to describe this approach– Meat and potatoes type implementation does not excite people
Details
Integrate a Simple Queue Service
• Demonstrates a lightweight ingest pipeline using off-the-shelf open source technology (ActiveMQ with REST brokers/gateways)
• Performs the services selected by the Simple Ingest Service web application
• Work consists mostly of integration tasks with building some service wrappers
• Service code is to be selected only from existing off-the-shelf FOSS
• Provides a model for integration with the Fedora Repository• The specific products/languages for services to be determined
when the use cases and partners are well characterized
FSF-SQS Integration Patterns
• Enterprise Integration Patterns• Document (object, actions/state and content
in message)• Disconnected (object and content stored in
file systems, Akubra, DuraCloud or Fedora during processing, actions/state in message)
• Notification (actions in message, state, object and content elsewhere)
Potential Demonstration Services• Create derivative forms• Format conversion• Verify Checksum• Virus scanning• Validate object• Validate datastream format (and label or check FORMAT_URI and MIME-
type)• Get non-Fedora PID• Metadata feature services (feature extraction with write into FOXML or
datastream)– JHOVE– iVia (Descriptive metadata generation plus other services)
• Many other services possible but a few key selections should be incorporated leaving room for later additions
Workflow States• Object State
– State of a data object at a point in time– Can be contained in the object and reflected on
• Process State– State of an instance of a processing flow– Workflow engines designed to handle this– Long running vs. short running
• Event State– General notion of “event” is a statement which is reflected on– PREMIS-like “preservation event” is more of a process
• Person State– Characteristics of a person (actor) with respect to objects, processes, or
events– (e.g. requirements fulfilled by a PH.D. student to graduate)
Build a Simple Ingest Service
• Directory/file ingest (Diringest replacement)• Web application (server-side service)• Generates FOXML for transferred content• Supports content models where practical (also
needed for WEBDav interface)• Use lightweight ingest pipeline described
below to perform the pre-ingest preparation services
Build a Portable Ingest Client
• Ingest a single file or a directory• Choose the content model (if any) from menu • Choose what pre-ingest services to perform on the
content from menu• Works both as a Web App and as a Desktop App• Communicates by Web (REST) and messaging via
broker/gateway• Later can be extended more towards FedoraShare
concept• Consider scripting framework Python, Ruby, Flex