Rajani Maski - Senior Software Engineer
DOCUMENT LEVEL SECURITY IN SEARCH BASED APPLICATIONS
Introduction to Search Based Applications Requirement Analysis of Document Level Security Access Control Lists Multiple Solutions Summary
Agenda
Search Based Applications are software application in which Search Engine
platform is used as the core infrastructure for information accessing and reporting.
E-commerce web applications or content management systems are the types of search based application.
Search Based Applications
Authentication• User is authenticated before providing access
to the applicationApplication• Presents with full fledge User Interface• Perform user operations such as upload
documents, send emails, search, etc.Unified Data Layer• Search Server• Indexes content across the sources• Retrieves data at very high speed.Data Storage• Volume of data sources from different
repositories.
Overview of Search Based System
Unified Data Layer
Search Based Application Server
Archives Documents
User Authentication System
EmailsFile
Server
So Far, So Good!
What’s the problem?
Unified Data Layer
Search Based Application
Archives Documents
User Authentication System
Emails
Common Access To Unified data Layer
How is this a threat?
File Servers
User A :- Logs in to application.- Performs a search operation - With the key words such as ‘Pay Slips’, ‘Personal’ or ‘appraisal’.
Sample results demonstrated for “appraisal”
Consider a Sample Use Case
Un Authorized Results
Search Results
Relevant Search Results : [Correct]- User A was returned with relevant search results based on his search query;
such as exact matches, more like this key words, synonym key words, etc.
Unauthorized Search results: [Wrong]- Few of the search results retrieved were the documents to which he was not
authorized to view.
Threats:• Exposure to other users’ confidential documents• Access to Unauthorized information.
Observations
How are we doing with this?
• To develop a search platform where every user has access to only those documents to which he/she is authorized to.
• To ensure that all the confidential data uploaded is not globally searchable unless it is intended to be globally accessible.
Problem Definition
How can we achieve this?
SolutionMaintaining Access Control List mapped to each document
object.
Access Control
List?
• Access Controls are Security features that control how users [subject] and documents[object] communicate and interact with one another.
• Subject: An active entity[User] that requests access to an object[Document].
• Object: A passive entity[Document] that contains information
Access Control List
Document
ObjectSubject
Interaction
Let’s first understand the data model of search engine.
How are documents stored in search engine?Document Oriented Approach.
Data Model
Alec_1167{_id:”1167”,
Name:”Ale C”,Agent:”Miller”
Place:”NY, NJ, CA”,Units:570}
3424 Kiwi reds 340
5612 Reh Mo’s 664
1167 Alec Miller 570
1167 2 NJ
1167 3 CA
1167 1 NY
• User A uploads a document into the system• Text Extraction• Convert it to a flat structure• Input it to Search Engine
Indexing and Storing Document Object
Document Text Extract
Search Engine
Document Saved
• We missed to capture something!
• What did we miss?– Capturing of User information for each document!
• Who uploaded the document • To whom did the user share with?
• How do we maintain this information?– Access control list to each document object.
Document Text Extract
Search Engine
Document Saved
• Access Control Lists for each user.
• At the time of search, – Retrieve search results,– And perform a check on each document for
user’s authorization and– Finally return the results.
Conventional Solution
Search Engine
Security Filter Each Document
Return Results to User
Multiple Solutions.
Solutions are dependent on the Access Control Models we choose.
Two important types of Access Control Models:
1. Non-Discretionary Access Control(Role Based)2. Discretionary Access Control (DAC)
Access Control Models
Definition:
• Non-Discretionary ACL uses a administered set of rules to determine how Users and Documents interact.
• It is referred to as nondiscretionary because assigning a user to a role is unavoidable
1. Non-Discretionary (Role Based)Sales
Super User
Manager
Sales Documents
Marketing Documents
Engineering Documents
Admin Documents
System that has,• Roles defined during design time and Static ACL set
to each document .• We choose, “Early Binding with ACL bound to
Document Objects”
In such systems,• Document objects will include a multi-valued Role-
id field that will contain list of role-Ids which has access to the document.
Solution For Role Based ACL - Type 1
Documents with ACLs
Index Time
Document 1role-Ids: [“1”, “2”, “3”]
Document 1role-Ids: [“1”, “2”, “3”]Document 2
“role-Ids:” [ “2”, “3”]
Continued…At the time of search,• User Search Query should be appended
with user’s Role Id.• Solr’s Filter Query feature and it’s caching
techniques gives the most efficient solution for
such ACL Techniques. This approach is called as‘Early Binding’ approach.
Query Request
Solr J Client
QueryResponse
User Role-Id
Early Binding
Systems that has,• Roles which often change; data is normalized by
segregating access control information into different tables.
• This approach is called as ‘Early Binding with Externalized ACL’
In such systems: • Role-Ids are not attached to the document object.• Instead they are stored into different tables with
foreign key relation.• Use Pseudo Joins at the time of Search
Solution For Role Based ACL - Type 2
Document1D1
Doc ID Role-IdsD1 1, 2, 3, N
Definition:• Discretionary – Document
owner has the authority to control access of the document.
• A system that enables the document owner to specify set of Users with access to a set ofdocuments
2. Discretionary Access Control
Specifies Users/groups who can Access
Owner Object
System that has • Frequent changes in ACL• ACL is defined for each user and a document,• We choose ‘Late Binding Approach with
Externalized ACL’
In such systems, • ACL is a 2D-matrix with users and documents
along its rows and columns
Solution for Discretionary ACL - Type 1
Users Doc1 Doc2 Doc N
User A 1 1 1
User B 0 1 1
User M
Encode Values – 0 :No access, 1 : AccessN : Number of Users, M – Number of Documents
For implementation, the ACL matrix can be represented as a array of bits.
This compact representation improves search efficiency and memory over head.
Continued…
Users Doc1 Doc2 Doc NUserA 1 1 1
UserB 0 1 1
111
011
[1]
[2]
Consider, • Maximum documents in the Search systems is 5 with document ids:{1,2, 3, 4, 5}• Maximum Users are 2 { Id : 1,2 }• User 1 has access to document {1, 2, 3} • User 2 has access to Document {1,2,3,4,5}
• ACL matrix and array representation:
User 1 2 3 4 51 1 1 1 0 0
2 1 1 1 1 1
11100
11111
[1]
[2]
1 1 1 1 1
1 1 1 0 0
Example
Solution 1• Solr has a Post Filter Interface that can be extended to develop a Custom Plugin.• Interface has a method called ‘collect()’
• Collect() has a list of documents matched to the user’s search query.– Iterate through the list, get the document-Id from the Field Cache and apply
ACL using bit array .
• Code Snippets: https://gist.github.com/rajanim/7197154
Solr Implementation
1 1 1 0 0
Solution 2• Using BitSet utilities• Get the bitset of documents matched by the search query from Search Engine• Get the User ACL bitset instance• Obtain the intersection of the two bitsets [intersect(bitset other)]
Other Implementation Solution
1 1 1 0 0 1 1 1 0 0
1 1 1 0 0
• Discretionary ACL systems have static ACL• We choose, “Early Binding with ACL bound to Document
Objects”
In such systems,• Document objects will include a multi-valued user-id field that
contains a list of user-ids with access to the document.• The user-id field has to be indexed.
Solution for Discretionary ACL - Type 2
• This solution requires the ACL and document data to be de-normalized to flat structure.
Continued…
Index Time Search Time
Query RequestWith User ID
Solr J Client
QueryResponse
Parse Document
Add List of Users Who has access
Summary
• Discretionary ACL with late binding solution is a complex model and it requires extensive verification
• Leverage Solr’s smart caching capability
• Since ACL always adds an additional over head it has to be optimized to provide minimum delay.
Summary
• searchhub.org/2012/02/22/custom-security-filtering-in-solr/• Secure Search in Enterprise Webs: Tradeoffs in Efficient Implementation for
Document Level Security By Peter Bailey, David Hawking, Brett Matson• All in One Book (Shon Harris, 2005)• http://www.searchtechnologies.com/enterprise-search-document-level-
security.html• http://alvinalexander.com/java/jwarehouse/lucene/src/test/org/apache/
lucene/search/TestFilteredQuery.java.shtml• https://github.com/Zvents/score_stats_component/blob/master/src/main/
java/com/zvents/solr/components/ScoreStatsPostFilter.java
References:
Thank
You
Top Related