A Search Engine That Learns
description
Transcript of A Search Engine That Learns
A Search Engine That Learns
Jeff Elser – [email protected] Paxton – [email protected]
Montana State University - Bozeman
Presentation OutlineI. ProblemII. Background InformationIII. ApproachIV. Preliminary ResultsV. Future WorkVI. SummaryVII. Questions
I. Problem RightNow software use Spidering and searching Website optimization
• Page by page is tedious and time consuming• Dual ownership should allow perfect optimization
Solutions• Search engine adjustments• Suggesting specific web page changes
II. Background – Search Engine Spidering Indexing Weighting factors
Weight Identifier Default Value First Test Case Results
backlink 1000.0 510.0
description 150.0 980.0
keywords 100.0 66.0
title 100.0 180.0
meta-description 50.0 920.0
heading 1 5.0 130.0
author 1.0 440.0
multi-match 1.0 170.0
text 1.0 0.0
url text 1.0 540.0
date 0.35 140.0
II. Background – Genetic Algorithms Goldberg’s Simple
GA• Mutation• Crossover• Elitism• Non-overlapping
populations• Several fitness
functions
Individual 1• Fitness = 2
Individual 2• Fitness = 4
0 0 0 0
1 1 1 1
III. ApproachA. ArchitectureB. Training dataC. Testing controls (website source)D. GA specificsE. Fitness functions
A. Architecture
B. Training Data Website source
• 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive
• Hand formatted HTML• Chosen for word count and structure
C. Testing Controls Webmaster
provides training data• List of important
keywords • Associated ranked
pages• Tedious, but trivial
compared to optimizing all pages
D. GA Specifics Random initial population
• Population size 1000• Used GAlib’s built in random number
generator Genome
• 16 real numbers corresponding to the 16 weighting factors
• Range 0.0 – 1000.0
D. GA Specifics GA executes for 10000 generations Elitism is turned on Mutation probability = 0.01 Crossover probability = 0.6
D. Fitness Function 1
∑D D = |(actual ranking) – (desired ranking)| +1 to avoid division by 0
D. Fitness Function 2
+100 penalty for pages that don’t appear -10 reward for pages with a perfect fit
IV. Preliminary Results 12 tests using fitness function #2
• 1 realistic set of desired rankings• 11 random sets
4 tests obtained perfect rankings 4 improved rankings, but did not achieve
optimal 4 tests showed no improvement
IV. Preliminary Results
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
0 25 50 75 100
Generation Number
Dis
tanc
e
Htdig default weights
Fitness Function #2
IV. Preliminary Results
0
0.1
0.2
0.3
0.4
0 25 50 75 100 125 150 175
Generation Number
Fitn
ess
Valu
e
Fitness Function #2
Htdig default weights
V. Future Work – Fitness Function 3Levenshtein Distance D = string 1; A = string 2 Construct a mxn Matrix (M)
where m = |D|+1 and n = |A|+1 M[0,i] = i and M[j,0] = j For each remaining cell:
D[i] == A[j] then cost = 0D[i] != A[j] then cost = 1M[i,j] = MIN {a, b, c} where
a = M[i-1,j] + 1b = M[i,j-1] + 1c = M[i-1,j-1] + cost
Distance = M[m,n]
23334M22223O21112R32101F43210MRAF
V. Future Work – Fitness Function 3Levenshtein Distance
Reduce the url comparison to string comparison
Experiment further using LD as a fitness function• Sigmoid weighting function to increase the importance
of the front of the string
F A R M↓
www.url.com/index.htm www.url.com/ga.htm www.url.com/seo.htm www.url.com/etc.htm
V. Future Work Create more extensive test sets
• dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org
V. Future Work
V. Future Work
V. Future Work For pages that still do not rank properly,
create optimization suggestions Use custom meta tags to properly rank
outliers Use implicit user feedback to find the
desired rankings
VI. Summary Proof of concept Testing on real world websites will
strengthen results and open other areas of study.
VII. Questions Thanks for attending
Any questions?