Lucene solrrev documentlevelsecurity_rajanimaski_final

Post on 17-Dec-2014

125 views 0 download

Tags:

description

Apache Solr Search Engine

Transcript of Lucene solrrev documentlevelsecurity_rajanimaski_final

Rajani Maski - Senior Software Engineer

DOCUMENT LEVEL SECURITY IN SEARCH BASED APPLICATIONS

Introduction to Search Based Applications Requirement Analysis of Document Level Security Access Control Lists Multiple Solutions Summary

Agenda

Search Based Applications are software application in which Search Engine

platform is used as the core infrastructure for information accessing and reporting.

E-commerce web applications or content management systems are the types of search based application.

Search Based Applications

Authentication• User is authenticated before providing access

to the applicationApplication• Presents with full fledge User Interface• Perform user operations such as upload

documents, send emails, search, etc.Unified Data Layer• Search Server• Indexes content across the sources• Retrieves data at very high speed.Data Storage• Volume of data sources from different

repositories.

Overview of Search Based System

Unified Data Layer

Search Based Application Server

Archives Documents

User Authentication System

EmailsFile

Server

So Far, So Good!

What’s the problem?

Unified Data Layer

Search Based Application

Archives Documents

User Authentication System

Emails

Common Access To Unified data Layer

How is this a threat?

File Servers

User A :- Logs in to application.- Performs a search operation - With the key words such as ‘Pay Slips’, ‘Personal’ or ‘appraisal’.

Sample results demonstrated for “appraisal”

Consider a Sample Use Case

Un Authorized Results

Search Results

Relevant Search Results : [Correct]- User A was returned with relevant search results based on his search query;

such as exact matches, more like this key words, synonym key words, etc.

Unauthorized Search results: [Wrong]- Few of the search results retrieved were the documents to which he was not

authorized to view.

Threats:• Exposure to other users’ confidential documents• Access to Unauthorized information.

Observations

How are we doing with this?

• To develop a search platform where every user has access to only those documents to which he/she is authorized to.

• To ensure that all the confidential data uploaded is not globally searchable unless it is intended to be globally accessible.

Problem Definition

How can we achieve this?

SolutionMaintaining Access Control List mapped to each document

object.

Access Control

List?

• Access Controls are Security features that control how users [subject] and documents[object] communicate and interact with one another.

• Subject: An active entity[User] that requests access to an object[Document].

• Object: A passive entity[Document] that contains information

Access Control List

Document

ObjectSubject

Interaction

Let’s first understand the data model of search engine.

How are documents stored in search engine?Document Oriented Approach.

Data Model

Alec_1167{_id:”1167”,

Name:”Ale C”,Agent:”Miller”

Place:”NY, NJ, CA”,Units:570}

3424 Kiwi reds 340

5612 Reh Mo’s 664

1167 Alec Miller 570

1167 2 NJ

1167 3 CA

1167 1 NY

• User A uploads a document into the system• Text Extraction• Convert it to a flat structure• Input it to Search Engine

Indexing and Storing Document Object

Document Text Extract

Search Engine

Document Saved

• We missed to capture something!

• What did we miss?– Capturing of User information for each document!

• Who uploaded the document • To whom did the user share with?

• How do we maintain this information?– Access control list to each document object.

Document Text Extract

Search Engine

Document Saved

• Access Control Lists for each user.

• At the time of search, – Retrieve search results,– And perform a check on each document for

user’s authorization and– Finally return the results.

Conventional Solution

Search Engine

Security Filter Each Document

Return Results to User

Multiple Solutions.

Solutions are dependent on the Access Control Models we choose.

Two important types of Access Control Models:

1. Non-Discretionary Access Control(Role Based)2. Discretionary Access Control (DAC)

Access Control Models

Definition:

• Non-Discretionary ACL uses a administered set of rules to determine how Users and Documents interact.

• It is referred to as nondiscretionary because assigning a user to a role is unavoidable

1. Non-Discretionary (Role Based)Sales

Super User

Manager

Sales Documents

Marketing Documents

Engineering Documents

Admin Documents

System that has,• Roles defined during design time and Static ACL set

to each document .• We choose, “Early Binding with ACL bound to

Document Objects”

In such systems,• Document objects will include a multi-valued Role-

id field that will contain list of role-Ids which has access to the document.

Solution For Role Based ACL - Type 1

Documents with ACLs

Index Time

Document 1role-Ids: [“1”, “2”, “3”]

Document 1role-Ids: [“1”, “2”, “3”]Document 2

“role-Ids:” [ “2”, “3”]

Continued…At the time of search,• User Search Query should be appended

with user’s Role Id.• Solr’s Filter Query feature and it’s caching

techniques gives the most efficient solution for

such ACL Techniques. This approach is called as‘Early Binding’ approach.

Query Request

Solr J Client

QueryResponse

User Role-Id

Early Binding

Systems that has,• Roles which often change; data is normalized by

segregating access control information into different tables.

• This approach is called as ‘Early Binding with Externalized ACL’

In such systems: • Role-Ids are not attached to the document object.• Instead they are stored into different tables with

foreign key relation.• Use Pseudo Joins at the time of Search

Solution For Role Based ACL - Type 2

Document1D1

Doc ID Role-IdsD1 1, 2, 3, N

Definition:• Discretionary – Document

owner has the authority to control access of the document.

• A system that enables the document owner to specify set of Users with access to a set ofdocuments

2. Discretionary Access Control

Specifies Users/groups who can Access

Owner Object

System that has • Frequent changes in ACL• ACL is defined for each user and a document,• We choose ‘Late Binding Approach with

Externalized ACL’

In such systems, • ACL is a 2D-matrix with users and documents

along its rows and columns

Solution for Discretionary ACL - Type 1

Users Doc1 Doc2 Doc N

User A 1 1 1

User B 0 1 1

User M

Encode Values – 0 :No access, 1 : AccessN : Number of Users, M – Number of Documents

For implementation, the ACL matrix can be represented as a array of bits.

This compact representation improves search efficiency and memory over head.

Continued…

Users Doc1 Doc2 Doc NUserA 1 1 1

UserB 0 1 1

111

011

[1]

[2]

Consider, • Maximum documents in the Search systems is 5 with document ids:{1,2, 3, 4, 5}• Maximum Users are 2 { Id : 1,2 }• User 1 has access to document {1, 2, 3} • User 2 has access to Document {1,2,3,4,5}

• ACL matrix and array representation:

User 1 2 3 4 51 1 1 1 0 0

2 1 1 1 1 1

11100

11111

[1]

[2]

1 1 1 1 1

1 1 1 0 0

Example

Solution 1• Solr has a Post Filter Interface that can be extended to develop a Custom Plugin.• Interface has a method called ‘collect()’

• Collect() has a list of documents matched to the user’s search query.– Iterate through the list, get the document-Id from the Field Cache and apply

ACL using bit array .

• Code Snippets: https://gist.github.com/rajanim/7197154

Solr Implementation

1 1 1 0 0

Solution 2• Using BitSet utilities• Get the bitset of documents matched by the search query from Search Engine• Get the User ACL bitset instance• Obtain the intersection of the two bitsets [intersect(bitset other)]

Other Implementation Solution

1 1 1 0 0 1 1 1 0 0

1 1 1 0 0

• Discretionary ACL systems have static ACL• We choose, “Early Binding with ACL bound to Document

Objects”

In such systems,• Document objects will include a multi-valued user-id field that

contains a list of user-ids with access to the document.• The user-id field has to be indexed.

Solution for Discretionary ACL - Type 2

• This solution requires the ACL and document data to be de-normalized to flat structure.

Continued…

Index Time Search Time

Query RequestWith User ID

Solr J Client

QueryResponse

Parse Document

Add List of Users Who has access

Summary

• Discretionary ACL with late binding solution is a complex model and it requires extensive verification

• Leverage Solr’s smart caching capability

• Since ACL always adds an additional over head it has to be optimized to provide minimum delay.

Summary

• searchhub.org/2012/02/22/custom-security-filtering-in-solr/• Secure Search in Enterprise Webs: Tradeoffs in Efficient Implementation for

Document Level Security By Peter Bailey, David Hawking, Brett Matson• All in One Book (Shon Harris, 2005)• http://www.searchtechnologies.com/enterprise-search-document-level-

security.html• http://alvinalexander.com/java/jwarehouse/lucene/src/test/org/apache/

lucene/search/TestFilteredQuery.java.shtml• https://github.com/Zvents/score_stats_component/blob/master/src/main/

java/com/zvents/solr/components/ScoreStatsPostFilter.java

References:

Thank

You