Searching by Authority Artificial Intelligence CMSC 25000 February 12, 2008.

Searching by Authority

Artificial Intelligence

CMSC 25000

February 12, 2008

“A Conversation with Students”

• Speaker: Bill Gates

• Title: Bill Gates Unplugged: On Software, Innovation, Entrepreneurshop, and Giving Back

• Date: February 20, 2008

• Tickets: By lottery• http://studentactivities.uchicago.edu/billgates

Authoritative Sources

• Based on vector space alone, what would you expect to get searching for “search engine”?– Would you expect to get Google?

Conferring Authority

• Authorities rarely link to each other– Competition

• Hubs:– Relevant sites point to prominent sites on topic

• Often not prominent themselves• Professional or amateur

• Good Hubs Good Authorities

Google’s PageRank

• Identifies authorities– Important pages are those pointed to by many other

pages• Better pointers, higher rank

– Ranks search results

– t: page pointing to A; C(t): number of outbound links• d: damping measure

– Actual ranking on logarithmic scale– Iterate

))(/)(...)(/)(()1()( 11 nn tCtprtCtprddApr

Contrasts

• Internal links– Large sites carry more weight

• If well-designed

– H&A ignores site-internals

• Outbound links explicitly penalized

• Lots of tweaks….

Web Search

• Search by content– Vector space model

• Word-based representation• “Aboutness” and “Surprise”• Enhancing matches• Simple learning model

• Search by structure– Authorities identified by link structure of web

• Hubs confer authority

Medical Decision MakingLearning: Decision Trees

Artificial Intelligence

CMSC 25000

February 12, 2008

Agenda

• Decision Trees:– Motivation: Medical Experts: Mycin– Basic characteristics– Sunburn example– From trees to rules– Learning by minimizing heterogeneity– Analysis: Pros & Cons

Expert Systems

• Classic example of classical AI– Narrow but very deep knowledge of a field

• E.g. Diagnosis of bacterial infections

– Manual knowledge engineering• Elicit detailed information from human experts

Expert Systems

• Knowledge representation– If-then rules

• Antecedent: Conjunction of conditions• Consequent: Conclusion to be drawn

– Axioms: Initial set of assertions

• Reasoning process– Forward chaining:

• From assertions and rules, generate new assertions

– Backward chaining:• From rules and goal assertions, derive evidence of assertion

Medical Expert Systems: Mycin

• Mycin:– Rule-based expert system– Diagnosis of blood infections– 450 rules: ~experts, better than junior MDs

– Rules acquired by extensive expert interviews• Captures some elements of uncertainty

Medical Expert Systems: Issues

• Works well but..– Only diagnoses blood infections

• NARROW

– Requires extensive expert interviews• EXPENSIVE to develop

– Difficult to update, can’t handle new cases• BRITTLE

Modern AI Approach

• Machine learning– Learn diagnostic rules from examples– Use general learning mechanism– Integrate new rules, less elicitation

• Decision Trees– Learn rules– Duplicate MYCIN-style diagnosis

• Automatically acquired

• Readily interpretablecf Neural Nets/Nearest Neighbor

Learning: Identification Trees

• (aka Decision Trees)

• Supervised learning

• Primarily classification

• Rectangular decision boundaries– More restrictive than nearest neighbor

• Robust to irrelevant attributes, noise

• Fast prediction

Sunburn Example

Name Hair Height Weight Lotion Result

Sarah Blonde Average Light No Burn

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Burn

Emily Red Average Heavy No Burn

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

Learning about Sunburn

• Goal:– Train on labeled examples– Predict Burn/None for new instances

• Solution??– Exact match: same features, same output

• Problem: 2*3^3 feature combinations– Could be much worse

– Nearest Neighbor style• Problem: What’s close? Which features matter?

– Many match on two features but differ on result

Learning about Sunburn

• Better Solution: – Identification tree:– Training:

• Divide examples into subsets based on feature tests

• Sets of samples at leaves define classification

– Prediction:• Route NEW instance through tree to leaf based on

feature tests• Assign same value as samples at leaf

Sunburn Identification Tree

Hair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

Simplicity• Occam’s Razor:

– Simplest explanation that covers the data is best

• Occam’s Razor for ID trees:– Smallest tree consistent with samples will be

best predictor for new data

• Problem: – Finding all trees & finding smallest: Expensive!

• Solution:– Greedily build a small tree

Building ID Trees

• Goal: Build a small tree such that all samples at leaves have same class

• Greedy solution:– At each node, pick test such that branches

are closest to having same class• Split into subsets with least “disorder”

– (Disorder ~ Entropy)

– Find test that minimizes disorder

Minimizing Disorder

Hair Color

BlondeRed

Alex: NPete: NJohn: N

Emily: BSarah: BDana: NAnnie: BKatie: N

Height

WeightLotion

Short AverageTall

Alex:NAnnie:BKatie:N

Sarah:BEmily:BJohn:N

Dana:NPete:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAlex:NAnnie:B

Emily:BPete:NJohn:N

No Yes

Sarah:BAnnie:BEmily:BPete:NJohn:N

Dana:NAlex:NKatie:N

Minimizing Disorder

Height

WeightLotion

Short AverageTall

Annie:BKatie:N

Sarah:B Dana:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAnnie:B

No Yes

Sarah:BAnnie:B

Dana:NKatie:N

Measuring Disorder

• Problem: – In general, tests on large DB’s don’t yield

homogeneous subsets

• Solution:– General information theoretic measure of

disorder– Desired features:

• Homogeneous set: least disorder = 0• Even split: most disorder = 1

Measuring Entropy

• If split m objects into 2 bins size m1 & m2, what is the entropy?

loglog

0 0.2 0.4 0.6 0.8 1 1.2

Disorder

Measuring DisorderEntropy

the probability of being in bin i

ii pp 2log

mmp ii /

Entropy (disorder) of a split

00log0 2 Assume

-½ log2½ - ½ log2½ = ½ +½ = 1½½

-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼

-1log21 - 0log20 = 0 - 0 = 001

Entropyp2p1

mmp ii /

Computing Disorder

classc i

nrAvgDisorde

Disorder of class distribution on branch i

Fraction of samples down branch i

N instances

Branch1 Branch 2

N1 a N1 b

N2 aN2 b

Entropy in Sunburn Example

classc i

nrAvgDisorde

Hair color = 4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0 = 0.5

Height = 0.69Weight = 0.94Lotion = 0.61

Entropy in Sunburn Example

classc i

nrAvgDisorde

Height = 2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 0.5Weight = 2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 0

Building ID Trees with Disorder

• Until each leaf is as homogeneous as possible – Select an inhomogeneous leaf node– Replace that leaf node by a test node

creating subsets with least average disorder

• Effectively creates set of rectangular regions– Repeatedly draws lines in different axes

Features in ID Trees: Pros

• Feature selection:– Tests features that yield low disorder

• E.g. selects features that are important!

– Ignores irrelevant features

• Feature type handling:– Discrete type: 1 branch per value– Continuous type: Branch on >= value

• Need to search to find best breakpoint

• Absent features: Distribute uniformly

Features in ID Trees: Cons

• Features – Assumed independent– If want group effect, must model explicitly

• E.g. make new feature AorB

• Feature tests conjunctive

From Trees to Rules

• Tree:– Branches from root to leaves =– Tests => classifications– Tests = if antecedents; Leaf labels=

consequent– All ID trees-> rules; Not all rules as trees

From ID Trees to Rules

Hair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))

Identification Trees• Train:

– Build tree by forming subsets of least disorder

• Predict:– Traverse tree based on feature tests– Assign leaf node sample label

• Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading

• Cons: Poor feature combination, dependency, optimal tree build intractable

C4.5 vs Mycin

• C4.5: Decision tree implementation• Learning diagnosis

– Trains on symptom set + diagnosis for blood infections (like Mycin)

– Constructs decision trees/rules– Classification accuracy comparable to Mycin

• Diagnosis training requires only records– Automatically manages rule ranking– Automatically extracts expert-type rules

Searching by Authority Artificial Intelligence CMSC 25000 February 12, 2008.

Documents

Transcript of Searching by Authority Artificial Intelligence CMSC 25000 February 12, 2008.

CMSC$601:$ Topics$ · CMSC$601:$ Topics$ Adapted’from’slides’by’ Prof.’Marie’desJardins’ February 2011

25000 Pedriza

UMBC CMSC 331 Java Introduction to JAVA CMSC 331.

Directorate Of Minorities Bengaluru Final... · 85 si1819my0083 asha rani s christian 25000 86 si1819my0084 kavitha j christian 25000 87 si1819my0202 supreetha christian 25000 ...

fileIshta siddhi vinayagar kovil street Venkateshwara Nagar Pozhichalur Chennai 600 074. STATUS Available Available Available (3 PHASE) 25000 25000 25000 TOTAL 4516978 4014166 3719263

CMSC 330: Organization of Programming Languages · Regular Expressions A way of describing patterns or sets of strings •Searching and matching •Formally describing strings ØThe

CMSC 466 / 666

CMSC 754 - graphics.stanford.edu

Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC 25000 February 26, 2008.

N-gram Models CMSC 25000 Artificial Intelligence March 1, 2005.

General Bed 25000 EC1

Rule-based Systems: Mechanisms & Efficiency Artificial Intelligence CMSC 25000 January 10, 2002.

MILANO - sarom300 ± 5% 100% Polyester 100% Polyester 100% Polyester Color Fastness 4 4 4 Abrasion Resistance End Use > = 25000 > = 25000 > = 25000 Piling Resistance 4 4 …

CMSC 491/635

CMSC 15100

Retrieval by Authority Artificial Intelligence CMSC 25000 February 1, 2007.

iso 25000 & ieee 83yhy0

Http://askart.com ICCS The best site! Compact design, rich database, fast searching of 25000 American artists – the main page concentrates all.

Bayesian Inference Artificial Intelligence CMSC 25000 February 26, 2002.

CMSC 150 classes