Download - Two Layered HMMs for Search Interface Segmentation

1

2-Layered HMMs for Search Interface Segmentation

Ritu Khare(Under the Supervision of Dr Yuan An, Assistant Professor, iSchool)

2

Order of Presentation Background

Deep Web What is Search

Interface Understanding?

What is Interface Segmentation?

Why is Segmentation Challenging?

Our Approach for Segmentation

Interface Representation

HMM: The Artificial Designer

2-Layered Approach Architecture Experimentation

Parameters Result Contributions Future Work References

3

Background: Deep Web What is Deep Web:

The data that exists on the Web but is not returned by search engines through traditional crawling and indexing. The primary way to access this data is by filling up HTML forms on search interfaces.

Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale

Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21]

increase content visibility on existing search engines [17, 12]

derive ontologies from search interfaces [1]. A pre-requisite to attain these goals is an understanding

of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation (slide 5) portion of the problem of search interface understanding.

4

Background: What is Search Interface Understanding? Understanding semantics of a search interface (shown in

figure) is an intricate process [4] .

It involves 4 stages. 1. Representation: A suitable interface representation scheme is

chosen; semantic labels (slide 8) to be assigned to interface components are decided. An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form.

2. Parsing: Components are parsed into a suitable structure.

3. Segmentation: The interface components are assigned semantic labels , and related components are grouped together The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next)

are also answered in this stage.

4. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.

5

What is Interface Segmentation? This project focuses on Segmentation, the 3rd

stage of this process. Figure shows a segmented interface. The related components are grouped together.

The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).

6

Why is Segmentation Challenging? From a user’s (or designer’s) standpoint,

By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query.

On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located

very far apart in the HTML source code, A machine does not implicitly have any search experience that can be

leveraged to identify a segment boundary.

This project aims to investigate whether a machine can “learn” how to understand and segment an interface.

Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create

complete segments. they [23, 7] use rules and heuristics to segment a search interface. These

techniques have problems in handling scalability and heterogeneity [10].

7

Our Approach for Segmentation We incorporate the first-hand implicit

knowledge using which a human designer is assumed to have designed an interface.

This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction).

We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes

(slide 9)

Assignment of semantic labels (attribute-name, operator, and

operand described in slide 9) to interface components.

8

Interface Representation In figure, each component of the lower

segment is marked with a label, which we term as a semantic label.

The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint.

Logical Attribute

Logical Attribute

Attribute-name OperatorOperand

Search Entity

9

Interface Representation Attribute-name: Attribute-name denotes the criteria available for

searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”.

Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database.

Operator: The user may also be given an option of specifying the operator that further qualifies an operand.

Filling up an HTML form is similar to writing SQL queries. Assuming the underlying database table name is “Gene”, the SQL queries for figure would be:

SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’;

Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute.

10

HMM: The Artificial Designer We assume that an HMM can act like a human

designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components.

The designing process is similar to statistically choosing one component from a bag of components (a

superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator)

of the component in mind.

Bag of Components

Search Interface

2-Layered HMM(Artificial Designer)

Designing

Decoding

Segments & Tagged

Components

Knowledge of Semantic Labels

11

HMM: The Artificial Designer While the components are observable, their semantic roles

appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components).

The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface.

Attribute Name

OperandOperatorAttribute

NameOperand

Text(Gene ID)

TextboxText

(Gene Name)RadioButton

GroupTextbox

12

2-Layered HMM The problem of decoding that we address

in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with

appropriate semantic labels (attribute-name, operator, and operand).

The second layer S-HMM segments the interface into logical attributes.

2-Layered Approach Architecture

13

Training Interfaces

Manually tagged State Sequences

T-HMM Specs

Test interfaces

PredictedState

Sequences

DOM-TREE PARSING

T-HMM TRAININGT-HMM

TESTING

S-HMM TRAININGManually

tagged State Sequences

S-HMM Specs

Test interfaces

PredictedState

Sequences

S-HMM TESTING

T-HMM

S-HMM

14

Experimentation Parameters Data-Set: 200 interfaces (NAR collection)

http://www3.oup.co.uk/nar/database/c/

Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order .

Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples.

Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively.

15

Results

Semantic label Total CorrectlyIdentified

Accuracy Reason for Misidentification

Segment 459 395 86.05 misidentification of one of the states by T-HMM.

Operator 47 40 85.10 because an operator state was mistaken as one of the other 3 states.

Operand 429 423 98.60 misidentified as an operator state

Attribute-name 546 492 90.109 In the rest of the cases, attribute-name was misidentified as text-trivial

16

Contributions We studied a challenging stage (segmentation) of the

process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together.

We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces.

The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics.

Moreover, we tested our method on a less-explored domain (biology), and found promising results.

17

Future Work To recover the schema of deep Web databases

by extraction of finer details such as data type and constraints of logical attribute.

To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains.

To improve the degree of automation we want to investigate the use of Baum Welch training algorithm.

To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .

18

References1. Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised

ontology from data-intensive web application: An HTML forms-based reverse engineering approach. Informatica, 18(4), 511-534.

2. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36.

3. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224.

4. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. 5. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM

SIGMOD International Conference on Management of Data , San Diego, California. 217-228. 6. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the

ACM, 50(5), 94-101. 7. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search

interfaces of the deep web. World Wide Web, 10(2), 133 - 155. 8. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-

integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. 9. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on

PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. 10. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer

Convention on Information Extraction, 77-91. 11. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems

2003 (pp. 997-1013) Springer Berlin / Heidelberg. 12. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web

crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.

19

References13. Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB

Endowment , Auckland, New Zealand. , 1(1) 684-694. 14. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity

from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. 15. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces.

Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55.

16. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

17. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138.

18. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. 19. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for

information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42.

20. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg.

21. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419.

22. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106.

23. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118.

24. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.

20

Thank You

Questions, Comments, Ideas???