Post on 14-Dec-2015
1
Smart Software with F#
Joel PobarLanguage Geekhttp://callvirt.net/blog
2
Agenda
What is it?F# IntroAlgorithms:
SearchFuzzy MatchingClassification (SVM)Recommendations
Q&A
3
All This in 45 mins?
This is an awareness session!Lots of content, very broad, very fastYou’ll get all demos, pointers, and slide deck to take offline and digest
Two takeaways:F# is a great language for dataSmart algorithms aren’t hard – use them, explore more!
4
F# is
...a functional, object-oriented, imperative and explorative programming language for .NET
what is Functional Programming?
http://callvirt.net/jaoo.zip
5
What is Functional Programming?
Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data”
-> Emphasizes functions-> Emphasizes shapes of data, rather than impl.-> Modeled on lambda calculus-> Reduced emphasis on imperative-> Safely raises level of abstraction
6
Motivation for Functional
Simplicity in life is good: cheaper, easier, faster, better.
We typically achieve simplicity in software in two ways:
By raising the level of abstraction (and OO was one design to raise abstraction)Increasing modularity
Increasing signal to noise another good strategy:
Communicate more in less time with more clarityBetter composition and modularity == reuse
7
Functional ProgrammingSafer, while still being useful
Unsafe Safe
Useful
Not Useful
C#, C++, … V.Next#
Haskell
F#
8
What is F# for?
F# is a General Purpose languageCan be used for a broad range of programming tasksSuperset of imperative and dynamic features
Great for learning FP conceptsSome particularly important domains
Financial modeling and analysisData miningScientific data analysisDomain-specific modelingAcademic
9
Let
‘Let’ binds values to identifiers
let helloWorld = “Hello, World”print_any helloWorld
let myNum = 12 let myAddFunction x y = let sum = x + y
sum
Type inference. The static typing of C# with
the succinctness of a scripting language
10
Tuples
Simple, and most useful data structure
let site1 = (“msdn.com”, 10)let site2 = (“abc.net.au”, 12)let site3 = (“news.com.au”, 22)let allSites = (site1, site2, site3)
let fst (a, b) = alet snd (a, b) = b
11
Lists, Arrays, Seq and Options
Lists & Arrays are first-class citizensOptions provide a some-or-nothing capability
let list1 = [“Joel"; "Luke"]let array = [|2; 3; 5;|]let myseq = seq [0; 1; 2; ]
let option1 = Some(“Joel")let option2 = None
12
Records
Simple concrete type definition
type Person ={ Name: string; DateOfBirth: System.DateTime; }
let n = { Name = “Joel”; DateOfBirth = “13/04/81”; }
13
Immutability (by default)
Values may not be changed
Data is immutable by default
14
Discriminated Unions
Great for representing the structure of data
type Make = stringtype Model = stringtype Transport = | Car of Make * Model | Bicycle
let me = Car (“Holden”, “Barina”)let you = Bicycle
Both of these identifiers are of type “Transport”
15
Functions
Functions: like delegates + unified and simpleDeep type inference
(fun x -> x + 1)
let myFunc x = x + 1val myFunc : int -> int
let rec factorial n =if n>1 then n * factorial (n-1)else 1
let data = [5; 3; 4; 4; 5]List.sort (fun x y -> x – y) data
16
Pattern Matching
let (fst, _) = (“first”, “second”) Console.WriteLine(fst)
let switchOnType(a:obj) match a with | :? Int32 -> printfn “int!” | :? Transport -> printfn “Transport“ | _ -> printfn “Everything Else!”
Very important part of F#Helps deal with the ‘teasing apart’ of dataWorks best with Discriminated Unions & Records
17
Lists, Types, Interactive
demo
18
Search
Given a search term and a large document corpus, rank and return a list of the most relevant results…
19
Blog Crawler
20
Search
WordsStemming? Tokenize?
E.g ‘Python/Ruby’
MarkupTitle, Author, DateHeadings (h1,h2 etc)Paragraphs
LinksA sign of strength?
Let’s explore something simple…
21
Search
Simplify:For easy machine/language manipulation… and most importantly, easy computation
Vectors: natures own quality data structureConvenient machine representation (lists/arrays)Lots of existing vector math algorithms
After a loving incubation period, moonlight 2.0 has been released. <a
href=“whatever”>source code</a><br><a
href”something else”>FireFox
binaries</a> … after 2
afte
r
1
incu
batio
n
1lo
ving
6m
oonl
ight
4
firef
ox
6
linux
2
bina
ries
22
Term Count
Document1: Linux post:
Document2: Animal post:
Vector space:
9
the
1
incu
batio
n
1
craz
y
6
moo
nlig
ht
4
firef
ox
6
linux
2
peng
uin
2
the
1
dog
5
peng
uin
9
the
1
incu
batio
n
1
craz
y
6m
oonl
ight
4
firef
ox
6
linux
0
dog
2
peng
uin
2 0 2 0 0 0 1 5
2
craz
y
23
Term Count Issues
‘the dog penguin’Linux: 9+0+2 = 11Animal: 2+1+5 = 8
‘the’ is overweightEnter TF-IDF: Term Frequency Inverse Document Frequency
A weight to evaluate how important a word is to a corpus
i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query
9
the
1
incu
batio
n
1
craz
y
6
moo
nlig
ht
4
firef
ox
6
linux
0
dog
2
peng
uin
2 0 2 0 0 0 1 5
24
TF-IDF
Normalise the term count:tf = termCount / docWordCount
Measure importance of termidf = log ( |D| / termDocumentCount)
where |D| is the total documents in the corpus
tfidf = tf * idfA high weight is reached by high term frequency, and a low document frequency
25
Search Engine in under 10 mins
demo
26
Fuzzy Matching
String similarity algorithms:SoundEx; MetaphoneJaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; …We’ll look at Levenshtein Distance algorithm
Defined as: The minimum edit operations which transforms string1 into string2
27
Fuzzy Matching
Edit costs: In-place copy – cost 0Delete a character in string1 – cost 1Insert a character in string2 – cost 1Substitute a character for another – cost 1
Transform ‘kitten’ in to ‘sitting’kitten -> sitten (cost 1 – replace k with s)sitten -> sittin (cost 1 - replace e with i)sittin -> sitting (cost 1 – add g)
Levenshtein distance: 3
28
Fuzzy Matching
Estimated string similarity computation costs:Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance.Parallelisable – split the set of words to compare across n cores.Can do approximately 10,000 compares per second on a standard single core laptop.
29
Did You Mean?
demo
30
Classification
Support Vector Machines (SVM)Supervised learning for binary classificationTraining Inputs: ‘in’ and ‘out’ vectors.SVM will then find a separating ‘hyperplane’ in an n-dimensional space
Training costs, but classification is cheapCan retrain on the fly in some cases
31
SVM Classification
32
SVM Issues
Classification on 2 dimensions is easy, but most input is multi-dimensionalSome ‘tricks’ are needed to transform the input data
33
SVM Classifier
demo
34
F# and AlgorithmsNetflix Demo
Netflix Prize - $1 million USDMust beat Netflix prediction algorithm by 10% 480k users100 million ratings18,000 movies
Great example of deriving value out of large datasetsEarns Netflix loads and loads of $$$!
35
MovieId CustomerId RatingClerks 444444 5Clerks 2093393 4Clerks 999 5Clerks 8668478 1Dogma 2432114 3Dogma 444444 5Dogma 999 5... ... ...
Nearest NeighbourFind neighbours who like what I like
36
MovieId CustomerId RatingClerks 444444 5Clerks 2093393 4Clerks 999 5Clerks 8668478 1Dogma 2432114 3Dogma 444444 5Dogma 999 5... ... ...
Netflix Data FormatNetflix Demo
37
CustomerId 302 4418 3 56 732
444444 5 4 5 2999 5 5 1
111211 3 5 366666 5 51212121 5 4
5656565 1
454545 5 5
Nearest Neighbour AlgorithmFind all my neighbours movies
Find the best movies my neighbours agree on
38
Netflix Recommendations
demo
39
A Short Stop-over at Vector Math
A (x1,y1)
B (x2,y2)
C (x0,y0)
If we want to calculate the distance between A and B, we call on Euclidean Distance
We can represent the points in the same way using Vectors: Magnitude and Direction.
Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieveEuclidean Distance/Angle calculations.