1 Smart Software with F# Joel Pobar Language Geek .

40
1 Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

Transcript of 1 Smart Software with F# Joel Pobar Language Geek .

Page 1: 1 Smart Software with F# Joel Pobar Language Geek .

1

Smart Software with F#

Joel PobarLanguage Geekhttp://callvirt.net/blog

Page 2: 1 Smart Software with F# Joel Pobar Language Geek .

2

Agenda

What is it?F# IntroAlgorithms:

SearchFuzzy MatchingClassification (SVM)Recommendations

Q&A

Page 3: 1 Smart Software with F# Joel Pobar Language Geek .

3

All This in 45 mins?

This is an awareness session!Lots of content, very broad, very fastYou’ll get all demos, pointers, and slide deck to take offline and digest

Two takeaways:F# is a great language for dataSmart algorithms aren’t hard – use them, explore more!

Page 4: 1 Smart Software with F# Joel Pobar Language Geek .

4

F# is

...a functional, object-oriented, imperative and explorative programming language for .NET

what is Functional Programming?

http://callvirt.net/jaoo.zip

Page 5: 1 Smart Software with F# Joel Pobar Language Geek .

5

What is Functional Programming?

Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data”

-> Emphasizes functions-> Emphasizes shapes of data, rather than impl.-> Modeled on lambda calculus-> Reduced emphasis on imperative-> Safely raises level of abstraction

Page 6: 1 Smart Software with F# Joel Pobar Language Geek .

6

Motivation for Functional

Simplicity in life is good: cheaper, easier, faster, better.

We typically achieve simplicity in software in two ways:

By raising the level of abstraction (and OO was one design to raise abstraction)Increasing modularity

Increasing signal to noise another good strategy:

Communicate more in less time with more clarityBetter composition and modularity == reuse

Page 7: 1 Smart Software with F# Joel Pobar Language Geek .

7

Functional ProgrammingSafer, while still being useful

Unsafe Safe

Useful

Not Useful

C#, C++, … V.Next#

Haskell

F#

Page 8: 1 Smart Software with F# Joel Pobar Language Geek .

8

What is F# for?

F# is a General Purpose languageCan be used for a broad range of programming tasksSuperset of imperative and dynamic features

Great for learning FP conceptsSome particularly important domains

Financial modeling and analysisData miningScientific data analysisDomain-specific modelingAcademic

Page 9: 1 Smart Software with F# Joel Pobar Language Geek .

9

Let

‘Let’ binds values to identifiers

let helloWorld = “Hello, World”print_any helloWorld

let myNum = 12 let myAddFunction x y = let sum = x + y

sum

Type inference. The static typing of C# with

the succinctness of a scripting language

Page 10: 1 Smart Software with F# Joel Pobar Language Geek .

10

Tuples

Simple, and most useful data structure

let site1 = (“msdn.com”, 10)let site2 = (“abc.net.au”, 12)let site3 = (“news.com.au”, 22)let allSites = (site1, site2, site3)

let fst (a, b) = alet snd (a, b) = b

Page 11: 1 Smart Software with F# Joel Pobar Language Geek .

11

Lists, Arrays, Seq and Options

Lists & Arrays are first-class citizensOptions provide a some-or-nothing capability

let list1 = [“Joel"; "Luke"]let array = [|2; 3; 5;|]let myseq = seq [0; 1; 2; ]

let option1 = Some(“Joel")let option2 = None

Page 12: 1 Smart Software with F# Joel Pobar Language Geek .

12

Records

Simple concrete type definition

type Person ={ Name: string; DateOfBirth: System.DateTime; }

let n = { Name = “Joel”; DateOfBirth = “13/04/81”; }

Page 13: 1 Smart Software with F# Joel Pobar Language Geek .

13

Immutability (by default)

Values may not be changed

Data is immutable by default

Page 14: 1 Smart Software with F# Joel Pobar Language Geek .

14

Discriminated Unions

Great for representing the structure of data

type Make = stringtype Model = stringtype Transport = | Car of Make * Model | Bicycle

let me = Car (“Holden”, “Barina”)let you = Bicycle

Both of these identifiers are of type “Transport”

Page 15: 1 Smart Software with F# Joel Pobar Language Geek .

15

Functions

Functions: like delegates + unified and simpleDeep type inference

(fun x -> x + 1)

let myFunc x = x + 1val myFunc : int -> int

let rec factorial n =if n>1 then n * factorial (n-1)else 1

let data = [5; 3; 4; 4; 5]List.sort (fun x y -> x – y) data

Page 16: 1 Smart Software with F# Joel Pobar Language Geek .

16

Pattern Matching

let (fst, _) = (“first”, “second”) Console.WriteLine(fst)

let switchOnType(a:obj) match a with | :? Int32 -> printfn “int!” | :? Transport -> printfn “Transport“ | _ -> printfn “Everything Else!”

Very important part of F#Helps deal with the ‘teasing apart’ of dataWorks best with Discriminated Unions & Records

Page 17: 1 Smart Software with F# Joel Pobar Language Geek .

17

Lists, Types, Interactive

demo

Page 18: 1 Smart Software with F# Joel Pobar Language Geek .

18

Search

Given a search term and a large document corpus, rank and return a list of the most relevant results…

Page 19: 1 Smart Software with F# Joel Pobar Language Geek .

19

Blog Crawler

Page 20: 1 Smart Software with F# Joel Pobar Language Geek .

20

Search

WordsStemming? Tokenize?

E.g ‘Python/Ruby’

MarkupTitle, Author, DateHeadings (h1,h2 etc)Paragraphs

LinksA sign of strength?

Let’s explore something simple…

Page 21: 1 Smart Software with F# Joel Pobar Language Geek .

21

Search

Simplify:For easy machine/language manipulation… and most importantly, easy computation

Vectors: natures own quality data structureConvenient machine representation (lists/arrays)Lots of existing vector math algorithms

After a loving incubation period, moonlight 2.0 has been released. <a

href=“whatever”>source code</a><br><a

href”something else”>FireFox

binaries</a> … after 2

afte

r

1

incu

batio

n

1lo

ving

6m

oonl

ight

4

firef

ox

6

linux

2

bina

ries

Page 22: 1 Smart Software with F# Joel Pobar Language Geek .

22

Term Count

Document1: Linux post:

Document2: Animal post:

Vector space:

9

the

1

incu

batio

n

1

craz

y

6

moo

nlig

ht

4

firef

ox

6

linux

2

peng

uin

2

the

1

dog

5

peng

uin

9

the

1

incu

batio

n

1

craz

y

6m

oonl

ight

4

firef

ox

6

linux

0

dog

2

peng

uin

2 0 2 0 0 0 1 5

2

craz

y

Page 23: 1 Smart Software with F# Joel Pobar Language Geek .

23

Term Count Issues

‘the dog penguin’Linux: 9+0+2 = 11Animal: 2+1+5 = 8

‘the’ is overweightEnter TF-IDF: Term Frequency Inverse Document Frequency

A weight to evaluate how important a word is to a corpus

i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query

9

the

1

incu

batio

n

1

craz

y

6

moo

nlig

ht

4

firef

ox

6

linux

0

dog

2

peng

uin

2 0 2 0 0 0 1 5

Page 24: 1 Smart Software with F# Joel Pobar Language Geek .

24

TF-IDF

Normalise the term count:tf = termCount / docWordCount

Measure importance of termidf = log ( |D| / termDocumentCount)

where |D| is the total documents in the corpus

tfidf = tf * idfA high weight is reached by high term frequency, and a low document frequency

Page 25: 1 Smart Software with F# Joel Pobar Language Geek .

25

Search Engine in under 10 mins

demo

Page 26: 1 Smart Software with F# Joel Pobar Language Geek .

26

Fuzzy Matching

String similarity algorithms:SoundEx; MetaphoneJaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; …We’ll look at Levenshtein Distance algorithm

Defined as: The minimum edit operations which transforms string1 into string2

Page 27: 1 Smart Software with F# Joel Pobar Language Geek .

27

Fuzzy Matching

Edit costs: In-place copy – cost 0Delete a character in string1 – cost 1Insert a character in string2 – cost 1Substitute a character for another – cost 1

Transform ‘kitten’ in to ‘sitting’kitten -> sitten (cost 1 – replace k with s)sitten -> sittin (cost 1 - replace e with i)sittin -> sitting (cost 1 – add g)

Levenshtein distance: 3

Page 28: 1 Smart Software with F# Joel Pobar Language Geek .

28

Fuzzy Matching

Estimated string similarity computation costs:Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance.Parallelisable – split the set of words to compare across n cores.Can do approximately 10,000 compares per second on a standard single core laptop.

Page 29: 1 Smart Software with F# Joel Pobar Language Geek .

29

Did You Mean?

demo

Page 30: 1 Smart Software with F# Joel Pobar Language Geek .

30

Classification

Support Vector Machines (SVM)Supervised learning for binary classificationTraining Inputs: ‘in’ and ‘out’ vectors.SVM will then find a separating ‘hyperplane’ in an n-dimensional space

Training costs, but classification is cheapCan retrain on the fly in some cases

Page 31: 1 Smart Software with F# Joel Pobar Language Geek .

31

SVM Classification

Page 32: 1 Smart Software with F# Joel Pobar Language Geek .

32

SVM Issues

Classification on 2 dimensions is easy, but most input is multi-dimensionalSome ‘tricks’ are needed to transform the input data

Page 33: 1 Smart Software with F# Joel Pobar Language Geek .

33

SVM Classifier

demo

Page 34: 1 Smart Software with F# Joel Pobar Language Geek .

34

F# and AlgorithmsNetflix Demo

Netflix Prize - $1 million USDMust beat Netflix prediction algorithm by 10% 480k users100 million ratings18,000 movies

Great example of deriving value out of large datasetsEarns Netflix loads and loads of $$$!

Page 35: 1 Smart Software with F# Joel Pobar Language Geek .

35

MovieId CustomerId RatingClerks 444444 5Clerks 2093393 4Clerks 999 5Clerks 8668478 1Dogma 2432114 3Dogma 444444 5Dogma 999 5... ... ...

Nearest NeighbourFind neighbours who like what I like

Page 36: 1 Smart Software with F# Joel Pobar Language Geek .

36

MovieId CustomerId RatingClerks 444444 5Clerks 2093393 4Clerks 999 5Clerks 8668478 1Dogma 2432114 3Dogma 444444 5Dogma 999 5... ... ...

Netflix Data FormatNetflix Demo

Page 37: 1 Smart Software with F# Joel Pobar Language Geek .

37

CustomerId 302 4418 3 56 732

444444 5 4 5 2999 5 5 1

111211 3 5 366666 5 51212121 5 4

5656565 1

454545 5 5

Nearest Neighbour AlgorithmFind all my neighbours movies

Find the best movies my neighbours agree on

Page 38: 1 Smart Software with F# Joel Pobar Language Geek .

38

Netflix Recommendations

demo

Page 39: 1 Smart Software with F# Joel Pobar Language Geek .

39

A Short Stop-over at Vector Math

A (x1,y1)

B (x2,y2)

C (x0,y0)

If we want to calculate the distance between A and B, we call on Euclidean Distance

We can represent the points in the same way using Vectors: Magnitude and Direction.

Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieveEuclidean Distance/Angle calculations.

Page 40: 1 Smart Software with F# Joel Pobar Language Geek .

40

Q & A

Any questions?http://callvirt.net/[email protected]

THANKS!