Kazoup software appliance - A technical deep dive

19
Technical Deep Dive Radek Dymacz Co-founder and CTO Twitter: @kazoup @ChroniclesOfRD Demo: https://demo.kazoup.com

Transcript of Kazoup software appliance - A technical deep dive

Technical Deep Dive

Radek Dymacz

Co-founder and CTO

Twitter: @kazoup @ChroniclesOfRD

Demo: https://demo.kazoup.com

Kazoup software.

Provides analytics, enterprise search and policy based data

management.

Delivered as Virtual Appliance behind corporate network

Discovers and indexes large volumes of

unstructured file data

Overview.

Files

Current supported file stores are SAMBA shares these include Windows and Linux

Kazoup Appliance

All indexed data is stored locally on Kazoup appliance behind customer network. Accessible via secure web interface providing search, analytics and archive services

Customer infrastructure

Cloud Object Storage

When archiving, data leaving customer network is de-duplicated, compressed on a file level and encrypted at transfer and rest

Encrypted files are stored in supported object storage providers AWS, Azure, Google, CenturyLink and private object storage.

AES 256 Envelope Data Encryption

SSL Connection over WAN

Kazoup virtual appliance.

Docker

Elasticsearch Postgres RabbitMQ Celery Tika Natural Language Processing

Backend App ( Python )

API ( REST )

Frontend ( Polymer + D3.js )

Preconfigured Ubuntu virtual appliance for VMware and Hyper-V with Docker service

Analytics.

Elasticsearch

Data Aggregations

D3.js

Metadata Content

Web Interface

Local Archive Files

Index

Web Frontend Presentation

Analytics Engine

Data

NLP

Backend App

Discovery and indexing.

Elasticsearch

Discovery Worker

Meta Scan Worker

ChecksumWorker

Tika Worker NLP Worker

Recursively walks directories and finds files in data source and passes them to meta scan queue

Extracts file metadata information and updates Elasticsearch document

Finds files which haven’t been checksummed, runs checksum and passes them to queue for content extraction with Tikka

Extracts content plus some additional meta data based on the type of the file. Updates Elasticsearch and pass text to NLP queue

Natural Language Processing (NLP) of extracted text performs Named Entities Recognition and updates Elasticsearch document

DISCOVERY AND METADATA SCAN CHECKSUM AND CONTENT EXTRACTION SCAN

Extracted Metadata

● Name● Location ● Size● Access Time● Modification Time● Created Time● MIME Type● Extension● Category● SMB Extended Attributes● AD ACL

Extracted Content.

● Checksum● Content raw text● Tika Metadata● Language● Named Entity Extraction

○ Location○ Organisation○ Places○ Person○ Money○ Percent○ Date○ Time

Search.

Elasticsearch

AD ACL

AD Authentication

User search

Search supports Active Directory integration. It allows users to logon to web interface with AD credentials. All search results are scoped to AD permissions for logged on user. Web interface is optimised for desktop and mobile access.

Elasticsearch Archive JobArchive Workers

Checksum

Encrypt

Multipart upload

Validate Checksum

Files

Cloud or Private Object Storage

C7A2BE207FE44286AB12F22DFBA360E3818705DB0AB94DD5B284336E9DAE39D4AB76D89A01BB4C388F821A3E763AE44F….

Compress

Deduplicate

SSL Connection over WANAES 256 Data EncryptionZLIB Compression

Kazoup archive has a policy based engine - there are 2 types of policies; Mirror policy will copy files but not delete them from the original location. Meaning files will be both local and in the cloud. Archive policies will copy files and delete them from the original location once archived successfully. You can mix and match archive and mirror policies.

The Archive Job runs daily, finding all files that match a given policy and are then sent to the archive workers. The workers make sure files are checksummed before applying envelope encryption and compression to the files. Once complete multipart upload transfers data to chosen object storage service. After successful upload file checksum is validated and file is set as send to archive. After Kazoup appliance daily encrypted backup finishes files are marked as archived

Deep Dive into Archive.

Kazoup Appliance

Cloud Object Storage

Encryption is important.

DataMaster KeyKey Generator

Envelope KeyEncryption Encryption

Encrypted DataEncrypted Envelope Key

Cloud stored object.

● Name - random based on UUID ie: C7A2BE207FE44286AB12F22DFBA360E3

● Metadata encrypted with master key○ x-amz-meta-x-kazoup-iv○ x-amz-meta-x-kazoup-compression-level○ x-amz-meta-x-kazoup-original-location○ x-amz-meta-x-kazoup-envelope-key○ x-amz-meta-x-kazoup-master-key-hash○ x-amz-meta-x-kazoup-compression

● Content encrypted with envelope key● Object Storage Provider metadata

○ size○ last modified○ object storage class ○ etc

AWS example.

Business Continuity.

When file archiving is enabled the appliance configuration and index are encrypted and backed up daily to the Cloud Object Storage service. These can be easily recovered to fresh Kazoup appliance installation in DR

In case of losing entire customer site or appliance, then the software can be reinstalled from backups to existing customer site, new customer location or even the Cloud providing access to the archived files

When original file data source is not accessible all archived data is still available via appliance

Appliance updates

● Appliance sends usage statistics every hour for billing purposes ○ appliance software version○ size of analysed data○ no sensitive information is stored or

transferred to Kazoup backend■ like file or folder names,

content, ACL’s etc● Automatic updates are pulled over

HTTPS and doesn’t require opening any additional ports

Integrated support.

● Appliance web frontend contains build in support chat application

● Remote session can be initiated by client via appliance console○ gives access to appliance behind

firewall for troubleshooting and diagnosis of the issues without need to open any additional ports

○ can be only initiate by client

Kazoup Security.

Index data stored behind customer network

Any data leaving customer network is encrypted in transfer with SSL and at rest with AES-256 envelope encryption

Appliance updates are pulled over HTTPS and doesn’t require opening any additional ports

Search integration with Active Directory ACL’s

Remote support can be only initiate by client

Technology roadmap

Cloud Archive

Cloud data sources OneDrive, Dropbox, Box, Google Drive, Slack, Gmail

Desktop and Mobile search integration

Additional data sources: Office 365, Sharepoint, Egnyte, Salesforce

Cloud SaaS version

We are here

Automated document classification, automatic anomaly detection, Speech API and Audio to text

Natural Language API, Content Classification Translation, Distributed ML (leverage spare compute of users workstation for analytics)

Machine Learning with Tensor Flow, Vision API, OCR and Image Content Analysis

We are strong believers in open source and our software wouldn’t be possible without it the following technology that help us make Kazoup happen; ElasticSearch - powers our analytics and search engine.Google Polymer and Web Components - future of the web development is here and we are using it.Docker - our development, staging, QA, production, deployment and automatic updates hero.RabbitMQ - robust messaging for our application.HTTP2 - delivers our application at speed even on high latency mobile networks.Envelope Encryption - keeps customer data safe and allows us to rotate the encryption key at any time.Machine Learning - helps us make sense of the large volumes of data. The Twelve-Factor App and TDD - guides our development process.Github/CircleCI - Continuous Integration and Deployment made right.Intercom - helps us communicate directly with our users.AWS - provides our backend infrastructure. Python and Go Programming Language - makes us happy.Slack - helps us communicate internally.

Intelligent file data management.