A Search Engine That Learns

Post on 10-Feb-2016

23 views 4 download

description

A Search Engine That Learns. Jeff Elser – jelser@cs.montana.edu John Paxton – paxton@cs.montana.edu Montana State University - Bozeman. Presentation Outline. Problem Background Information Approach Preliminary Results Future Work Summary Questions. I. Problem. RightNow software use - PowerPoint PPT Presentation

Transcript of A Search Engine That Learns

A Search Engine That Learns

Jeff Elser – jelser@cs.montana.eduJohn Paxton – paxton@cs.montana.edu

Montana State University - Bozeman

Presentation OutlineI. ProblemII. Background InformationIII. ApproachIV. Preliminary ResultsV. Future WorkVI. SummaryVII. Questions

I. Problem RightNow software use Spidering and searching Website optimization

• Page by page is tedious and time consuming• Dual ownership should allow perfect optimization

Solutions• Search engine adjustments• Suggesting specific web page changes

II. Background – Search Engine Spidering Indexing Weighting factors

Weight Identifier Default Value First Test Case Results

backlink 1000.0 510.0

description 150.0 980.0

keywords 100.0 66.0

title 100.0 180.0

meta-description 50.0 920.0

heading 1 5.0 130.0

author 1.0 440.0

multi-match 1.0 170.0

text 1.0 0.0

url text 1.0 540.0

date 0.35 140.0

II. Background – Genetic Algorithms Goldberg’s Simple

GA• Mutation• Crossover• Elitism• Non-overlapping

populations• Several fitness

functions

Individual 1• Fitness = 2

Individual 2• Fitness = 4

0 0 0 0

1 1 1 1

III. ApproachA. ArchitectureB. Training dataC. Testing controls (website source)D. GA specificsE. Fitness functions

A. Architecture

B. Training Data Website source

• 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive

• Hand formatted HTML• Chosen for word count and structure

C. Testing Controls Webmaster

provides training data• List of important

keywords • Associated ranked

pages• Tedious, but trivial

compared to optimizing all pages

D. GA Specifics Random initial population

• Population size 1000• Used GAlib’s built in random number

generator Genome

• 16 real numbers corresponding to the 16 weighting factors

• Range 0.0 – 1000.0

D. GA Specifics GA executes for 10000 generations Elitism is turned on Mutation probability = 0.01 Crossover probability = 0.6

D. Fitness Function 1

∑D D = |(actual ranking) – (desired ranking)| +1 to avoid division by 0

D. Fitness Function 2

+100 penalty for pages that don’t appear -10 reward for pages with a perfect fit

IV. Preliminary Results 12 tests using fitness function #2

• 1 realistic set of desired rankings• 11 random sets

4 tests obtained perfect rankings 4 improved rankings, but did not achieve

optimal 4 tests showed no improvement

IV. Preliminary Results

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

0 25 50 75 100

Generation Number

Dis

tanc

e

Htdig default weights

Fitness Function #2

IV. Preliminary Results

0

0.1

0.2

0.3

0.4

0 25 50 75 100 125 150 175

Generation Number

Fitn

ess

Valu

e

Fitness Function #2

Htdig default weights

V. Future Work – Fitness Function 3Levenshtein Distance D = string 1; A = string 2 Construct a mxn Matrix (M)

where m = |D|+1 and n = |A|+1 M[0,i] = i and M[j,0] = j For each remaining cell:

D[i] == A[j] then cost = 0D[i] != A[j] then cost = 1M[i,j] = MIN {a, b, c} where

a = M[i-1,j] + 1b = M[i,j-1] + 1c = M[i-1,j-1] + cost

Distance = M[m,n]

23334M22223O21112R32101F43210MRAF

V. Future Work – Fitness Function 3Levenshtein Distance

Reduce the url comparison to string comparison

Experiment further using LD as a fitness function• Sigmoid weighting function to increase the importance

of the front of the string

F A R M↓

www.url.com/index.htm www.url.com/ga.htm www.url.com/seo.htm www.url.com/etc.htm

V. Future Work Create more extensive test sets

• dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org

V. Future Work

V. Future Work

V. Future Work For pages that still do not rank properly,

create optimization suggestions Use custom meta tags to properly rank

outliers Use implicit user feedback to find the

desired rankings

VI. Summary Proof of concept Testing on real world websites will

strengthen results and open other areas of study.

VII. Questions Thanks for attending

Any questions?