Kazoup software appliance - A technical deep dive
-
Upload
kazoup -
Category
Data & Analytics
-
view
640 -
download
0
Transcript of Kazoup software appliance - A technical deep dive
Technical Deep Dive
Radek Dymacz
Co-founder and CTO
Twitter: @kazoup @ChroniclesOfRD
Demo: https://demo.kazoup.com
Kazoup software.
Provides analytics, enterprise search and policy based data
management.
Delivered as Virtual Appliance behind corporate network
Discovers and indexes large volumes of
unstructured file data
Overview.
Files
Current supported file stores are SAMBA shares these include Windows and Linux
Kazoup Appliance
All indexed data is stored locally on Kazoup appliance behind customer network. Accessible via secure web interface providing search, analytics and archive services
Customer infrastructure
Cloud Object Storage
When archiving, data leaving customer network is de-duplicated, compressed on a file level and encrypted at transfer and rest
Encrypted files are stored in supported object storage providers AWS, Azure, Google, CenturyLink and private object storage.
AES 256 Envelope Data Encryption
SSL Connection over WAN
Kazoup virtual appliance.
Docker
Elasticsearch Postgres RabbitMQ Celery Tika Natural Language Processing
Backend App ( Python )
API ( REST )
Frontend ( Polymer + D3.js )
Preconfigured Ubuntu virtual appliance for VMware and Hyper-V with Docker service
Analytics.
Elasticsearch
Data Aggregations
D3.js
Metadata Content
Web Interface
Local Archive Files
Index
Web Frontend Presentation
Analytics Engine
Data
NLP
Backend App
Discovery and indexing.
Elasticsearch
Discovery Worker
Meta Scan Worker
ChecksumWorker
Tika Worker NLP Worker
Recursively walks directories and finds files in data source and passes them to meta scan queue
Extracts file metadata information and updates Elasticsearch document
Finds files which haven’t been checksummed, runs checksum and passes them to queue for content extraction with Tikka
Extracts content plus some additional meta data based on the type of the file. Updates Elasticsearch and pass text to NLP queue
Natural Language Processing (NLP) of extracted text performs Named Entities Recognition and updates Elasticsearch document
DISCOVERY AND METADATA SCAN CHECKSUM AND CONTENT EXTRACTION SCAN
Extracted Metadata
● Name● Location ● Size● Access Time● Modification Time● Created Time● MIME Type● Extension● Category● SMB Extended Attributes● AD ACL
Extracted Content.
● Checksum● Content raw text● Tika Metadata● Language● Named Entity Extraction
○ Location○ Organisation○ Places○ Person○ Money○ Percent○ Date○ Time
Search.
Elasticsearch
AD ACL
AD Authentication
User search
Search supports Active Directory integration. It allows users to logon to web interface with AD credentials. All search results are scoped to AD permissions for logged on user. Web interface is optimised for desktop and mobile access.
Elasticsearch Archive JobArchive Workers
Checksum
Encrypt
Multipart upload
Validate Checksum
Files
Cloud or Private Object Storage
C7A2BE207FE44286AB12F22DFBA360E3818705DB0AB94DD5B284336E9DAE39D4AB76D89A01BB4C388F821A3E763AE44F….
Compress
Deduplicate
SSL Connection over WANAES 256 Data EncryptionZLIB Compression
Kazoup archive has a policy based engine - there are 2 types of policies; Mirror policy will copy files but not delete them from the original location. Meaning files will be both local and in the cloud. Archive policies will copy files and delete them from the original location once archived successfully. You can mix and match archive and mirror policies.
The Archive Job runs daily, finding all files that match a given policy and are then sent to the archive workers. The workers make sure files are checksummed before applying envelope encryption and compression to the files. Once complete multipart upload transfers data to chosen object storage service. After successful upload file checksum is validated and file is set as send to archive. After Kazoup appliance daily encrypted backup finishes files are marked as archived
Deep Dive into Archive.
Kazoup Appliance
Cloud Object Storage
Encryption is important.
DataMaster KeyKey Generator
Envelope KeyEncryption Encryption
Encrypted DataEncrypted Envelope Key
Cloud stored object.
● Name - random based on UUID ie: C7A2BE207FE44286AB12F22DFBA360E3
● Metadata encrypted with master key○ x-amz-meta-x-kazoup-iv○ x-amz-meta-x-kazoup-compression-level○ x-amz-meta-x-kazoup-original-location○ x-amz-meta-x-kazoup-envelope-key○ x-amz-meta-x-kazoup-master-key-hash○ x-amz-meta-x-kazoup-compression
● Content encrypted with envelope key● Object Storage Provider metadata
○ size○ last modified○ object storage class ○ etc
Business Continuity.
When file archiving is enabled the appliance configuration and index are encrypted and backed up daily to the Cloud Object Storage service. These can be easily recovered to fresh Kazoup appliance installation in DR
In case of losing entire customer site or appliance, then the software can be reinstalled from backups to existing customer site, new customer location or even the Cloud providing access to the archived files
When original file data source is not accessible all archived data is still available via appliance
Appliance updates
● Appliance sends usage statistics every hour for billing purposes ○ appliance software version○ size of analysed data○ no sensitive information is stored or
transferred to Kazoup backend■ like file or folder names,
content, ACL’s etc● Automatic updates are pulled over
HTTPS and doesn’t require opening any additional ports
Integrated support.
● Appliance web frontend contains build in support chat application
● Remote session can be initiated by client via appliance console○ gives access to appliance behind
firewall for troubleshooting and diagnosis of the issues without need to open any additional ports
○ can be only initiate by client
Kazoup Security.
Index data stored behind customer network
Any data leaving customer network is encrypted in transfer with SSL and at rest with AES-256 envelope encryption
Appliance updates are pulled over HTTPS and doesn’t require opening any additional ports
Search integration with Active Directory ACL’s
Remote support can be only initiate by client
Technology roadmap
Cloud Archive
Cloud data sources OneDrive, Dropbox, Box, Google Drive, Slack, Gmail
Desktop and Mobile search integration
Additional data sources: Office 365, Sharepoint, Egnyte, Salesforce
Cloud SaaS version
We are here
Automated document classification, automatic anomaly detection, Speech API and Audio to text
Natural Language API, Content Classification Translation, Distributed ML (leverage spare compute of users workstation for analytics)
Machine Learning with Tensor Flow, Vision API, OCR and Image Content Analysis
We are strong believers in open source and our software wouldn’t be possible without it the following technology that help us make Kazoup happen; ElasticSearch - powers our analytics and search engine.Google Polymer and Web Components - future of the web development is here and we are using it.Docker - our development, staging, QA, production, deployment and automatic updates hero.RabbitMQ - robust messaging for our application.HTTP2 - delivers our application at speed even on high latency mobile networks.Envelope Encryption - keeps customer data safe and allows us to rotate the encryption key at any time.Machine Learning - helps us make sense of the large volumes of data. The Twelve-Factor App and TDD - guides our development process.Github/CircleCI - Continuous Integration and Deployment made right.Intercom - helps us communicate directly with our users.AWS - provides our backend infrastructure. Python and Go Programming Language - makes us happy.Slack - helps us communicate internally.