Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished...

Knowledge Discovery in

Protected Vertical Information

Dr. William PerrizoUniversity Distinguished Professor

of Computer ScienceNorth Dakota State University,

Fargo, [email protected]

mailto:[email protected]

Vertical Data and pTrees

Traditionally, a file or dataset comprises one or more horizontal records, each record comprises one or more fields, and each field comprises some number of bits that represent a piece of data.

Traditionally, a dataset is structured horizontally as a set of such horizontal records, and is processed vertically, record-by-record.

Alternatively, a file or dataset may be structured vertically as a set of columns and processed horizontally, column-by-column.

A column-structured dataset may be further vertically decomposed as a set of bit-vectors or bitslices and processed horizontally, bitslice-by-bitslice using logical operations.

A pTree is a tree-like organization of these bitslices having a zero bit value for each internal node and a zero or one bit value for each of its leaves (external nodes).

The three basic pTree operations are AND, OR and complement.

An example follows.

But it's pure0 so this branch ends

0 0 0 0 1

P11

4. Left half of rt half ? false0 00 0 0

2. Left half pure1? false 0

00 0

1. Whole thing pure1? false 0

5. Rt half of right half? true1

00 0 0 1

R11 0 0 0 0 0 0 1 1

predicate Trees (pTrees): slice by attribute or column

Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure.

3. Right half pure1? false 0 00 0

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 10 1 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

vertically slice off each bit position (12 vertical structures)then compress each bit slice into a tree using a predicatee.g., the compression of R11 into pTree, P11 :

P11

pure1? false=0

pure1? false=0

pure1? false=0pure1? true=1

pure1? false=0

1st, Vertically Processing of Horizontal Data (VPHD)

R(A1 A2 A3 A4)2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4

for Horizontally structured,record-oriented data, one must scan vertically

010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

Base 10 Base 2

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1

1

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 0 01

0 0 00 0 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^

e.g., find the number of occurences of 7 0 1 4 =22nd, using pTrees find the number of occurences of 7 0 1 4

To count (7,0,1,4)s use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43

7 0 1 4 0 *23

0 0 *22 =2 0 1 *21

*20

=

A pTree example

Securing Vertical pTree DataHow can a pTree database be made secure without causing the processing to be slowed down (by, for

instance, requiring a slow decryption step prior to processing)?

A first level of security may be provided by re-ordering (e.g., permuting) the pTrees. The re-ordering method or permutation becomes the security key (those who know it can use the data and those who do not, cannot). For big data, key compression techniques might be necessary.

A second level of security can be provided through random bit padding (at the bottom and/or top) Padding will randomly place the true starting and ending position of the pTree somewhere in the middle of the structure. This second level is intended to prevent an adversary from discovering the re-ordering by focusing on a smaller part of the dataset (e.g., the first bit or the last bit only).

A third level of security is also possible when needed. That third level would involve re-ordering the bits within the actual pTrees. For datasets in which the ordering of the record instances is random or otherwise unknown, this would be unnecessary. However, if there is a known ordering, it might be possible for the clever adversary to discover the starting and ending positions of the pTrees.

At this time, this third level seems problematic since it would probably require the encryption and therefore a time consuming decryption step prior to processing. There may be a way to re-order the pTree bits without introducing significant additional processing time? This is "future work".

The "bottom" padding might be unnecessary from a security perspective (add little or no additional security, however, we believe it will make it much easier to secure volatile datasets (by providing a large bottom pad that can be overwritten as the dataset grows in cardinality)

Implementations

This security scheme can be implemented per dataset or over the entire database as a whole.

That is, for each dataset, there could be a massive string of bits in which all the pTrees of just that dataset are embedded (separate massive string of bits for each dataset),

orthere could be a much more massive string of bits in which all the pTrees of the database are embedded (and intermingled for additional security?).

The latter might be the best choice since it does not introduce any additional processing time overhead and it does provide additional security.

In general, a security monitor subsystem would be used to implement this security scheme.

The security monitor would be responsible for identification, authentication and authorization.

The security monitor could manage free space within the massive bit strings (be they per dataset or per database).

Conclusion

This vertical database scheme involves at least two levels of security.

The first level is a matter of disguising the location of each vertical bitslice structure (e.g. permuting the ordering of the bitslice).

The second level is a matter of disguising the bit position location of the pTree within a bigger bit string (to make it harder to focus on the first bits only, etc.).

An important fact is that neither of these two schemes imposes almost no additional processing time. The reason we can say that is pTree processing always involves

1. identifying those pTrees that are to be involved in the processing

2. locating those pTrees (which is always a matter of following a pointer and we simple shuffle the order of those pointers so that those who know the shuffle can find the data and those who do not, cannot.

3. loading the pertinent pTrees (which is just as fast as long as you know at which bit position to start the load. Those who know that start position, can load and those who do not, cannot.).

Conclusion continued

A third possible level of security was mentioned, that of encrypting or otherwise anonymizing each pTree itself.

Again, this may be unnecessary if there is no know or expected record ordering involved.

One simple way to [somewhat] anonymize individual pTrees it to shuffle the order of the levels within each pTree. This would be metadata (manage by the security monitor) but once known, could be used with no or very very little additional processing delay, since each level of a multi-level pTree is processed separately and therefore the locating portion of the process involves a pointer and no additional processing would be required on that pointer (it is just that the correct level of the correct pTree to be located would not be in the expected place within the pTree as stored.).

Other partial anonymizations, short of full encryption, might suffice in most cases.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished...

Documents

Transcript of Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished...