A Search Engine That Learns

24
A Search Engine That Learns Jeff Elser – [email protected] John Paxton – [email protected] Montana State University - Bozeman

description

A Search Engine That Learns. Jeff Elser – [email protected] John Paxton – [email protected] Montana State University - Bozeman. Presentation Outline. Problem Background Information Approach Preliminary Results Future Work Summary Questions. I. Problem. RightNow software use - PowerPoint PPT Presentation

Transcript of A Search Engine That Learns

Page 1: A Search Engine That Learns

A Search Engine That Learns

Jeff Elser – [email protected] Paxton – [email protected]

Montana State University - Bozeman

Page 2: A Search Engine That Learns

Presentation OutlineI. ProblemII. Background InformationIII. ApproachIV. Preliminary ResultsV. Future WorkVI. SummaryVII. Questions

Page 3: A Search Engine That Learns

I. Problem RightNow software use Spidering and searching Website optimization

• Page by page is tedious and time consuming• Dual ownership should allow perfect optimization

Solutions• Search engine adjustments• Suggesting specific web page changes

Page 4: A Search Engine That Learns

II. Background – Search Engine Spidering Indexing Weighting factors

Weight Identifier Default Value First Test Case Results

backlink 1000.0 510.0

description 150.0 980.0

keywords 100.0 66.0

title 100.0 180.0

meta-description 50.0 920.0

heading 1 5.0 130.0

author 1.0 440.0

multi-match 1.0 170.0

text 1.0 0.0

url text 1.0 540.0

date 0.35 140.0

Page 5: A Search Engine That Learns

II. Background – Genetic Algorithms Goldberg’s Simple

GA• Mutation• Crossover• Elitism• Non-overlapping

populations• Several fitness

functions

Individual 1• Fitness = 2

Individual 2• Fitness = 4

0 0 0 0

1 1 1 1

Page 6: A Search Engine That Learns

III. ApproachA. ArchitectureB. Training dataC. Testing controls (website source)D. GA specificsE. Fitness functions

Page 7: A Search Engine That Learns

A. Architecture

Page 8: A Search Engine That Learns

B. Training Data Website source

• 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive

• Hand formatted HTML• Chosen for word count and structure

Page 9: A Search Engine That Learns

C. Testing Controls Webmaster

provides training data• List of important

keywords • Associated ranked

pages• Tedious, but trivial

compared to optimizing all pages

Page 10: A Search Engine That Learns

D. GA Specifics Random initial population

• Population size 1000• Used GAlib’s built in random number

generator Genome

• 16 real numbers corresponding to the 16 weighting factors

• Range 0.0 – 1000.0

Page 11: A Search Engine That Learns

D. GA Specifics GA executes for 10000 generations Elitism is turned on Mutation probability = 0.01 Crossover probability = 0.6

Page 12: A Search Engine That Learns

D. Fitness Function 1

∑D D = |(actual ranking) – (desired ranking)| +1 to avoid division by 0

Page 13: A Search Engine That Learns

D. Fitness Function 2

+100 penalty for pages that don’t appear -10 reward for pages with a perfect fit

Page 14: A Search Engine That Learns

IV. Preliminary Results 12 tests using fitness function #2

• 1 realistic set of desired rankings• 11 random sets

4 tests obtained perfect rankings 4 improved rankings, but did not achieve

optimal 4 tests showed no improvement

Page 15: A Search Engine That Learns

IV. Preliminary Results

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

0 25 50 75 100

Generation Number

Dis

tanc

e

Htdig default weights

Fitness Function #2

Page 16: A Search Engine That Learns

IV. Preliminary Results

0

0.1

0.2

0.3

0.4

0 25 50 75 100 125 150 175

Generation Number

Fitn

ess

Valu

e

Fitness Function #2

Htdig default weights

Page 17: A Search Engine That Learns

V. Future Work – Fitness Function 3Levenshtein Distance D = string 1; A = string 2 Construct a mxn Matrix (M)

where m = |D|+1 and n = |A|+1 M[0,i] = i and M[j,0] = j For each remaining cell:

D[i] == A[j] then cost = 0D[i] != A[j] then cost = 1M[i,j] = MIN {a, b, c} where

a = M[i-1,j] + 1b = M[i,j-1] + 1c = M[i-1,j-1] + cost

Distance = M[m,n]

23334M22223O21112R32101F43210MRAF

Page 18: A Search Engine That Learns

V. Future Work – Fitness Function 3Levenshtein Distance

Reduce the url comparison to string comparison

Experiment further using LD as a fitness function• Sigmoid weighting function to increase the importance

of the front of the string

F A R M↓

www.url.com/index.htm www.url.com/ga.htm www.url.com/seo.htm www.url.com/etc.htm

Page 19: A Search Engine That Learns

V. Future Work Create more extensive test sets

• dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org

Page 20: A Search Engine That Learns

V. Future Work

Page 21: A Search Engine That Learns

V. Future Work

Page 22: A Search Engine That Learns

V. Future Work For pages that still do not rank properly,

create optimization suggestions Use custom meta tags to properly rank

outliers Use implicit user feedback to find the

desired rankings

Page 23: A Search Engine That Learns

VI. Summary Proof of concept Testing on real world websites will

strengthen results and open other areas of study.

Page 24: A Search Engine That Learns

VII. Questions Thanks for attending

Any questions?