Introduction to Medical Computing
Transcript of Introduction to Medical Computing
Data Representations
• A reductionist view of data: it is all bits
• Data in main memory, data in files.
• All about interpreting bits in memory.
UWO CS 2125 © Stephen M. Watt
Basic Data Types
• Characters ‘a’ ‘é’ ‘中’
• Character Strings “Hello” “Κωνστάντζα” “孔丘”
• Integers 123
• Floating point numbers 123.7
UWO CS 2125 © Stephen M. Watt
Characters
• Older operating systems stored them as ASCII or EBCDIC, typically as 7 or 8-bit bytes, e.g. ‘a’ stored as 97 in an 8-bit byte, i.e. 0110 0001
• Problems with multiple alphabets.
• ISO/IEC extended 8-bit encodings, e.g. Latin-1, Latin-Thai, etc. Typically use first 128 characters for ASCII + second 128 characters for additional letters. http://en.wikipedia.org/wiki/ISO/IEC_8859-1
• Problem working with multiple alphabets at once.
UWO CS 2125 © Stephen M. Watt
Unicode
• Represent all character sets at once.
• Initially 16 bits.
• Now 17 planes of 16 bits each, i.e. 21 bits.
• Typically represented with variable length encodings:
– UTF 8 (multiple 8-bit bytes)
– UTF 16 (1 or 2 16 bit chunks)
UWO CS 2125 © Stephen M. Watt
Numbers Base 16
• Numbers in base 2 are long and error prone to write.
• It is easy to work base 16 by grouping the digits of base 2 numbers 4 at a time.
0000 -> 0 0001 -> 1 0010 -> 2 0011 -> 3 0100 -> 4 0101 -> 5 0110 -> 6 0111 -> 7 1000 -> 8 1001 -> 9 1010 -> a 1011 -> b 1100 -> c 1101 -> d 1110 -> e 1111 -> f
• So 60 (base 10) = 11 1100 (base 2) = 3c (base 16)
University of Western Ontario CS 2125. © Stephen M. Watt
• 3a1.5 (base 16) means 3 × 256 + a × 16 + 1 × 1 + 5 × 1/16 = 3 × 162 + 10 × 161 + 1 × 160 + 5 × 16-1
UTF-16
• Represent a character as one or two 16 bit chunks.
• The values in the range D800 .. DFFF are special.
• They are not used to represent characters, but are instead used to store parts of characters that need more than 16 bits, that is in the range 10000..10FFFF.
UWO CS 2125 © Stephen M. Watt
Integers • Typically represented as 16 bit, 32 bit or 64 bit.
– Different sizes have different ranges. • 8 bits: -128 to 127
or 0 to 255
• 16 bits: -32,768 to 32,767 or 0 to 65,535
• 32 bits: − 2,147,483,648 to 2,147,483,647 or 0 to 4,294,967,295
• 64 bits: − 9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 or 0 to 18,446,744,073,709,551,615
– 16 bits used when space is at a premium, i.e. large data sets or small devices.
• Arbitrarily large integers possible with dynamic storage allocation, but used mainly for advanced math software and cryptography.
UWO CS 2125 © Stephen M. Watt
Floating Point Numbers
• Represents quantities as
fraction × 2power
• fraction = 1.<fraction bits>
• power = exponent - 1023 • See http://en.wikipedia.org/wiki/Double_precision_floating-point_format
UWO CS 2125 © Stephen M. Watt
Compound Data in Programs
• Programming languages have ways to represent collections of data.
• These may be used to represent single things with many properties, or tables of many things.
UWO CS 2125 © Stephen M. Watt
Homogeneous Collections
• “arrays”
• A collection of things that are all the same.
double boneDensity[1000];
UWO CS 2125 © Stephen M. Watt
Heterogeneous Collections
• “records”, “structs”
struct patient {
int patientNumber;
char familyName[20];
char givenName[20];
int billingCode;
…
};
UWO CS 2125 © Stephen M. Watt
Combinations
• Arrays may have elements that are themselves arrays or other structured data.
double doses[100][100];
struct patient studyGroup[100];
UWO CS 2125 © Stephen M. Watt
Compound Data in Files
• Text files
• Binary files
• XML files
(actually, these are all just interpretations of bits)
UWO CS 2125 © Stephen M. Watt
Text Files
• Files of ASCII or Unicode data.
• Typically one line per item, separated by blanks or commas.
• May be in “fixed format”, i.e. specific data lies in certain columns,
2034632Smith Jane 22 Main St London 3927321Doe John 1004 Peppercorn WaMontreal 2379820Brown Charles2001 King St Toronto
or free-form that is parsed 2034632, Smith, Jane, 22 Main St, London 3927321, Doe, John, 1004 Peppercorn Way, Montreal 2379820, Brown, Charles, 2001 King St, Toronto
UWO CS 2125 © Stephen M. Watt
Text Files
• Easy for programs to construct and to read.
• Easy for people to check and debug.
• Many programs use as an exchange format, e.g.
– Most Unix/Linux programs
– Excel CSV
UWO CS 2125 © Stephen M. Watt
Binary Files
• Store numbers (e.g. integers and floating pt numbers) so the bytes in the file are the same as the bytes in the representation in main memory.
• Pros: Store data more compactly. – Less space required to store (e.g. 4 bytes vs 10 digits).
– Faster to read and write.
• Cons:
– Unforgiving format
– Harder to program
– Usually specific to one program or family of programs. UWO CS 2125 © Stephen M. Watt
XML
• “Extensible Markup Langauge”
• Textual representation of structured data, very easy to parse.
• GML -> SGML -> HTML -> XML
• Can represent complex data objects.
• Important to know, easy to learn. E.g. http://www.w3schools.com/xml/
UWO CS 2125 © Stephen M. Watt
XML Basics
<?xml version=“1.0”?>
<patient>
<patientNumber>102001</patientNumber>
<name>
<family>Jones</family>
<middle initialOnly=“yes”>M</middle>
<given>Veronica</given>
</name>
<billingCode> 7993321</billingCode>
</patient>
UWO CS 2125 © Stephen M. Watt
XML Basics
• XML is not a programming language
– It doesn’t do anything
– It represents data
• Designed to represent and transport data
• Lets you design your own tags.
• Is a W3C “recommendation” (standard).
UWO CS 2125 © Stephen M. Watt
XML Basics
• Tags <patient>
• Attributes initialOnly=“yes”
• Elements <patient> ….. </patient>
• Text Jones
• Comments <!-- Do not disturb -->
UWO CS 2125 © Stephen M. Watt
XML Formats
• Represents data as a tree
• What is acceptable is specified by a grammar in the form of a DTD (old) or Schema (new)
• Various standards are specifications of XML grammars, e.g. MathML, InkML, ChemML, …
UWO CS 2125 © Stephen M. Watt
Cryptography
• Some things should be public, and some things should not be.
• Can secure data by physical means, e.g. locks, guards.
• Can secure data by access controls, e.g. passwords.
• Can secure data by encryption.
UWO CS 2125 © Stephen M. Watt
Types of Cryptography
• Secret Key Cryptography
• Public Key Cryptography
• Hash Functions
UWO CS 2125 © Stephen M. Watt
Some Vocabulary
• Plain text – the original data in unencrypted form
• Cipher text – the data in encrypted form
• Key – a piece of data used to do the encryption, like a code word.
• Dramatis Personae:
– Alice and Bob: want to exchange secret info
– Eve: an evesdropper
UWO CS 2125 © Stephen M. Watt
Secret Key Cryptography
• Single “key” is used for both encryption and decryption.
• Encrypt(plain text, key) -> cipher text
• Decrypt(cipher text, key) -> plain text
• E.g. (simple)
Encrypt(char, offset) -> (char + offset) mod 256
Decrypt(char, offset) -> (char – offset) mod 256
UWO CS 2125 © Stephen M. Watt
Secret Key Cryptography
• Electronic Codebook (ECB)
– Data divided into blocks and each encrypted separately.
– Pro: Simple. Con: Same plaintext -> same ciphertext
UWO CS 2125 © Stephen M. Watt
Secret Key Cryptography
• Cipher Block Chaining (CBC)
• Invented at IBM in the 1970s
• Each block used to modify the input of the next
UWO CS 2125 © Stephen M. Watt
Resources
• http://www.garykessler.net/library/crypto.html
• http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
UWO CS 2125 © Stephen M. Watt
Public Key Cryptography
• Most significant advance in cryptography in hundreds of years.
• First described publicly by Stanford professor Martin Hellman and graduate student Whitfield Diffie in 1976.
UWO CS 2125 © Stephen M. Watt
Public Key Cryptography
• Uses idea of functions that are hard to invert, “one way” functions.
• E.g. Multiplication vs Factorization
– Multiplication is easy. Takes time proportional to b log b log log b to multiply two b bit numbers.
– Factorization is thought to be hard. Best known algorithm for a b bit numbers is
UWO CS 2125 © Stephen M. Watt
Public Key Cryptography
• Uses idea of functions that are hard to invert, “one way” functions.
• E.g. Exponentiation vs logarithms
– Easy to compute 36 to get 729
– Hard to take 729 and find 3 and 6.
UWO CS 2125 © Stephen M. Watt
Public and Private Keys
• In both cases (multiplication, exponentiation) we have 2 pieces of information combining to give a result from which it is hard to find the pieces.
• Each participant can have a private number and reveal something publicly that does not give away the private number.
UWO CS 2125 © Stephen M. Watt
Original Diffie Hellman
• Multiplying integers mod p, a prime.
• Need g, a “primitive root” mod p.
– That is a number g such that { g1 mod p, g2 mod p, g3 mod p, …, gp-1 mod p } gives the values {1, 2, …, p-1} in any order.
UWO CS 2125 © Stephen M. Watt
Original Diffie Hellman
• E.g. 3 is a primitive root mod 7 because
31 = 3 = 3 (mod 7)
32 = 9 = 2 (mod 7)
33 = 27 = 6 (mod 7)
34 = 81 = 4 (mod 7)
35 = 243 = 5 (mod 7)
36 = 729 = 1 (mod 7)
UWO CS 2125 © Stephen M. Watt
Original Diffie Hellman
• Alice and Bob agree to use a prime p and base g.
• Alice chooses secret a. • Bob chooses secret b.
• Alice sends Bob A = ga (mod p). • Bob sends Alice B = gb (mod p).
• Alice computes s = B a = gab (mod p). • Bob computes s = A b = gab (mod p).
• Now Alice and Bob share a secret to use for Shared Key Crypto.
UWO CS 2125 © Stephen M. Watt
Example
• Alice computes s = B a mod p – s = 196 mod 23
– s = 47,045,881 mod 23
– s = 2
• Bob computes s = A b mod p – s = 815 mod 23
– s = 35,184,372,088,832 mod 23
– s = 2
UWO CS 2125 © Stephen M. Watt
Example
• Alice and Bob now share a secret: s = 2. Somebody who had known both these private integers might also have calculated s as follows: – s = 56*15 mod 23 – s = 515*6 mod 23 – s = 590 mod 23 – s = 807,793,566,946,316,088,741,610,050,849,573,099,185,363,389,5
51,639,556,884,765,625 mod23 – s = 2
http://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange
UWO CS 2125 © Stephen M. Watt
Public Key Cryptography
• Can use Diffie-Hellman for public key crypto.
– Alice choses her “private key” a
– Alice publishes a “public key” (A = ga mod p, g, p)
– Bob chooses a random b and sends Alice (B = gb mod p, message encrypted with Ab mod p)
UWO CS 2125 © Stephen M. Watt
RSA Cryptography
• Alice’s key generation: – Choose two distinct random primes of similar size p and q.
– Compute N = p q and φ = (p-1)(q-1)
– Compute e such that 1 < e < φ and gcd(e, φ) = 1.
– Compute d = 1/e mod φ.
– e is the public key exponent, d is the private key exponent.
– Alice’s public key is (N, e).
• Communication: – Bob sends Alice the message M by sending c = Me mod N.
– Alice decrypts the message by computing M = cd mod N.
UWO CS 2125 © Stephen M. Watt
Cryptography in Medicine
Examples:
• Piotr Kasztelowicz, Marek Czubenko, Iwona Zięba, Security of Medical Data Transfer and Storage in Internet. Cryptography, Antiviral Security and Electronic Signature Problems, which Must Be Solved in Nearest Future in Practical Context Pol J Pathol 2003, 54, 3, 209-214
• Johannes Heurix, Thomas Neubauer, Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption
• Y Zhou, K. Panetta, S. Agaian, A lossless encryption method for medical images using edge maps Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE, Sept 2009
UWO CS 2125 © Stephen M. Watt