1
2-Layered HMMs for Search Interface Segmentation
Ritu Khare(Under the Supervision of Dr Yuan An, Assistant Professor, iSchool)
2
Order of Presentation Background
Deep Web What is Search
Interface Understanding?
What is Interface Segmentation?
Why is Segmentation Challenging?
Our Approach for Segmentation
Interface Representation
HMM: The Artificial Designer
2-Layered Approach Architecture Experimentation
Parameters Result Contributions Future Work References
3
Background: Deep Web What is Deep Web:
The data that exists on the Web but is not returned by search engines through traditional crawling and indexing. The primary way to access this data is by filling up HTML forms on search interfaces.
Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale
Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21]
increase content visibility on existing search engines [17, 12]
derive ontologies from search interfaces [1]. A pre-requisite to attain these goals is an understanding
of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation (slide 5) portion of the problem of search interface understanding.
4
Background: What is Search Interface Understanding? Understanding semantics of a search interface (shown in
figure) is an intricate process [4] .
It involves 4 stages. 1. Representation: A suitable interface representation scheme is
chosen; semantic labels (slide 8) to be assigned to interface components are decided. An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form.
2. Parsing: Components are parsed into a suitable structure.
3. Segmentation: The interface components are assigned semantic labels , and related components are grouped together The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next)
are also answered in this stage.
4. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.
5
What is Interface Segmentation? This project focuses on Segmentation, the 3rd
stage of this process. Figure shows a segmented interface. The related components are grouped together.
The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).
6
Why is Segmentation Challenging? From a user’s (or designer’s) standpoint,
By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query.
On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located
very far apart in the HTML source code, A machine does not implicitly have any search experience that can be
leveraged to identify a segment boundary.
This project aims to investigate whether a machine can “learn” how to understand and segment an interface.
Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create
complete segments. they [23, 7] use rules and heuristics to segment a search interface. These
techniques have problems in handling scalability and heterogeneity [10].
7
Our Approach for Segmentation We incorporate the first-hand implicit
knowledge using which a human designer is assumed to have designed an interface.
This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction).
We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes
(slide 9)
Assignment of semantic labels (attribute-name, operator, and
operand described in slide 9) to interface components.
8
Interface Representation In figure, each component of the lower
segment is marked with a label, which we term as a semantic label.
The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint.
Logical Attribute
Logical Attribute
Attribute-name OperatorOperand
Search Entity
9
Interface Representation Attribute-name: Attribute-name denotes the criteria available for
searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”.
Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database.
Operator: The user may also be given an option of specifying the operator that further qualifies an operand.
Filling up an HTML form is similar to writing SQL queries. Assuming the underlying database table name is “Gene”, the SQL queries for figure would be:
SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’;
Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute.
10
HMM: The Artificial Designer We assume that an HMM can act like a human
designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components.
The designing process is similar to statistically choosing one component from a bag of components (a
superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator)
of the component in mind.
Bag of Components
Search Interface
2-Layered HMM(Artificial Designer)
Designing
Decoding
Segments & Tagged
Components
Knowledge of Semantic Labels
11
HMM: The Artificial Designer While the components are observable, their semantic roles
appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components).
The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface.
Attribute Name
OperandOperatorAttribute
NameOperand
Text(Gene ID)
TextboxText
(Gene Name)RadioButton
GroupTextbox
12
2-Layered HMM The problem of decoding that we address
in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with
appropriate semantic labels (attribute-name, operator, and operand).
The second layer S-HMM segments the interface into logical attributes.
2-Layered Approach Architecture
13
Training Interfaces
Manually tagged State Sequences
T-HMM Specs
Test interfaces
PredictedState
Sequences
DOM-TREE PARSING
T-HMM TRAININGT-HMM
TESTING
S-HMM TRAININGManually
tagged State Sequences
S-HMM Specs
Test interfaces
PredictedState
Sequences
S-HMM TESTING
T-HMM
S-HMM
14
Experimentation Parameters Data-Set: 200 interfaces (NAR collection)
http://www3.oup.co.uk/nar/database/c/
Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order .
Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples.
Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively.
15
Results
Semantic label Total CorrectlyIdentified
Accuracy Reason for Misidentification
Segment 459 395 86.05 misidentification of one of the states by T-HMM.
Operator 47 40 85.10 because an operator state was mistaken as one of the other 3 states.
Operand 429 423 98.60 misidentified as an operator state
Attribute-name 546 492 90.109 In the rest of the cases, attribute-name was misidentified as text-trivial
16
Contributions We studied a challenging stage (segmentation) of the
process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together.
We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces.
The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics.
Moreover, we tested our method on a less-explored domain (biology), and found promising results.
17
Future Work To recover the schema of deep Web databases
by extraction of finer details such as data type and constraints of logical attribute.
To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains.
To improve the degree of automation we want to investigate the use of Baum Welch training algorithm.
To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .
18
References1. Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised
ontology from data-intensive web application: An HTML forms-based reverse engineering approach. Informatica, 18(4), 511-534.
2. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36.
3. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224.
4. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. 5. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM
SIGMOD International Conference on Management of Data , San Diego, California. 217-228. 6. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the
ACM, 50(5), 94-101. 7. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search
interfaces of the deep web. World Wide Web, 10(2), 133 - 155. 8. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-
integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. 9. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on
PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. 10. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer
Convention on Information Extraction, 77-91. 11. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems
2003 (pp. 997-1013) Springer Berlin / Heidelberg. 12. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web
crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.
19
References13. Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB
Endowment , Auckland, New Zealand. , 1(1) 684-694. 14. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity
from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. 15. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces.
Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55.
16. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
17. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138.
18. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. 19. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for
information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42.
20. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg.
21. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419.
22. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106.
23. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118.
24. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.
20
Thank You
Questions, Comments, Ideas???
Top Related