Data Mining for XML Query (Repaired)

48
 CHAPTER 1 1.INTRODUCTION 1.1 OVERVIEW In recent years the database research field has concentrated on XML (eXtensible Markup Language) as a flexible hierarchical model suitable to represent huge amounts of data with no absolute and fixed schema, and a  possibly irregular and incomplete structure. There are two main approaches to XML document access: keyword-based search and query-answering. The first one comes from the tradition of information retrieval, where most searches are  performed on the textual content of the document; this means that no advantage is derived from the semantics conveyed by the document structure. As for query-answering, since query languages for semistructured data rely on the document structure to convey its semantics, in order for query formulation to be effective users need to know this structure in advance, which is often not the case. In fact, it is not mandat ory for an XML document to have a defined schema: 50% of the documents on the web do not possess one. When users specify queries without knowing the document structure, they may fail to retrieve information which was there, but under a different structure. This limitation is a crucial problem which did not emerge in the context of relational database management systems. Frequent, dramatic outcomes of this situation are either the information overload problem, where too much data are included in the answer because the set of keywords specified for the search captures too many meanings, or the information deprivation problem, where either the use of inappropriate keywords, or the wrong formulation of the query, prevent the user from receiving the correct answer. This paper addresses the need of getting the gist of the document before querying it, both in terms of content and structure. Discovering recurrent

Transcript of Data Mining for XML Query (Repaired)

CHAPTER 11.INTRODUCTION1.1 OVERVIEWIn recent years the database research field has concentrated on XML (eXtensible Markup Language) as a flexible hierarchical model suitable to represent huge amounts of data with no absolute and fixed schema, and a possibly irregular and incomplete structure. There are two main approaches to XML document access: keyword-based search and query-answering. The first one comes from the tradition of information retrieval, where most searches are performed on the textual content of the document; this means that no advantage is derived from the semantics conveyed by the document structure. As for query-answering, since query languages for semistructured data rely on the document structure to convey its semantics, in order for query formulation to be effective users need to know this structure in advance, which is often not the case. In fact, it is not mandatory for an XML document to have a defined schema: 50% of the documents on the web do not possess one. When users specify queries without knowing the document structure, they may fail to retrieve information which was there, but under a different structure. This limitation is a crucial problem which did not emerge in the context of relational database management systems. Frequent, dramatic outcomes of this situation are either the information overload problem, where too much data are included in the answer because the set of keywords specified for the search captures too many meanings, or the information deprivation problem, where either the use of inappropriate keywords, or the wrong formulation of the query, prevent the user from receiving the correct answer.This paper addresses the need of getting the gist of the document before querying it, both in terms of content and structure. Discovering recurrent patterns inside XML documents provides high-quality knowledge about the document content.Frequent patterns are in fact intensional information about the data contained in the document itself, that is, they specify the document in terms of a set of properties rather than by means of data. As opposed to the detailed and precise information conveyed by the data, this information is partial and often approximate, but synthetic, and concerns both the document structure and its content. In particular, the idea of mining association rules to provide summarized representations of XML documents has been investigated in many proposals either by using languages (e.g. XQuery) and techniques developed in the XML context, or by implementing graph- or tree-based algorithms. In this paper I introduce a proposal for mining and storing TARs (Tree-based Association Rules) as a means to represent intensional knowledge in native XML. Intuitively, a TAR represents intensional knowledge in the form SB SH, where SB is the body tree and SH the head tree of the rule and SB is a subtree of SH. The rule SB SH states that, if the tree SB appears in an XML document D, it is likely that the wider, tree SH also appears in D. Graphically, we render the nodes of the body of a rule by means of black circles, and the nodes of the head by empty circles. TARs can be queried to obtain fast, although approximate, answers. This is particularly useful not only when quick answers are needed but also when the original documents are unavailable. In fact, once extracted, TARs can be stored in a (smaller) document and be accessed independently of the dataset they were extracted from. Summarizing, TARs are extracted for two main purposes: to get a concise idea the gist of both the structure and the content of an XML document, and to use them for intensional query answering, that is, allowing the user to query the extracted TARs rather than the original document.By querying such summaries, investigators obtain initial knowledge about specific entities in the vast dataset, and are able to devise more specific queries for deeper investigation. An important side-effect of using such a technique is that only the most promising specific queries are issued towards the integrated data, dramatically reducing time and cost.

1.2 PROBLEM DEFINITIONFor query-answering, since query languages for semistructured data rely on the document structure to convey its semantics, in order for query formulation to be effective users need to know this structure in advance, which is often not the case. In fact, it is not mandatory for an XML document to have a defined schema: 50% of the documents on the web do not possess one. When users specify queries without knowing the document structure, they may fail to retrieve information which was there, but under a different structure. This limitation is a crucial problem which did not emerge in the context of relational database management systems. Frequent, dramatic outcomes of this situation are either the information overload problem, where too much data are included in the answer because the set of keywords specified for the search captures too many meanings, or the information deprivation problem, where either the use of inappropriate keywords, or the wrong formulation of the query, prevent the user from receiving the correct answer.

1.3 OBJECTIVE This project provides a method for deriving intensional Knowledge from XML documents in the form of TARs, and then storing these TARs as an alternative, synthetic dataset to be queried for providing quick and summarized answers. The procedure is characterized by the following key aspects: a) It works directly on the XML documents, without transforming the data into any intermediate format, b) It looks for general association rules, without the need to impose what should be contained in the antecedent and consequent of the rule, c) It stores association rules in XML format, and d) It translates the queries on the original dataset into queries on the TARs set. The aim of our proposal is to provide a way to use intensional knowledge as a substitute of the original document during querying and not to improve the execution time of the queries over the original XML dataset.

CHAPTER 2 LITERATURE SURVEY

Literature survey is the most important step in the software development process. Before developing the tool it is necessary to determine the time factor, economy and company strength. Once these things are satisfied, ten next steps are to determine which operating system and language can be used for developing the tool.Once the programmers start building the tool, the programmers need lot of external support. This support can be obtained from senior programmers, from book or from websites 2.1 Answering xml queries by means of data summaries E. Baralis, P. Garza, E. Quintarelli, and L. Tanca published in 2007The idea of using association rules as summarized representations of XML documents was also introduced in this is based on the extraction of rules both on the structure (schema patterns) and on content (instance patterns) of XML datasets. The limitation of this approach are: i) the root of the rule is established a-priori and ii) the patterns, used to describe general properties of the schema applying to all instances, are not mined, but derived as an abstraction of similar instance patterns and are less precise and reliable.

2.2 Relational computation for mining association rules from xml data H. C. Liu and J. Zeleznikow published in 2005,It uses Fixpoint operator which works only on Relational format. In this technique, XML document is first preprocessed to to transform into an Object-Relational database.2.3 A new method for mining association rules from a collection of xml documents J. Paik, H. Y. Youn, and U. M. Kim. in 2005An approach to extract association rules, i.e. to mine all frequent rules, without any a-priori knowledge of the XML dataset.The paper introduced HoPS, an algorithm for extracting association rules from a set of XML documents. Such rules are called XML association rules and are implications of the form X Y , where X and Y are fragments of an XML document. The two trees X and Y have to be disjunct; moreover, both X and Y are embedded subtrees of the XML documents which means that they do not always represent the actual structure of the data. Another limitation of the proposal is that it does not consider the possibility to mine general association rules within a single XML dataset, achieving this feature is one of our goals.

2.4 Extracting association rules from xml documents using xquery J. W. W. Wan and G. Dobbie published in 2003Use XQuery to extract association rules from simple XML documents. They propose a set of functions, written in XQuery, which implement the Apriori algorithm. This approach performs well on simple XML documents but it is very difficult to apply to complex XML documents with an irregular structure.we have used the PathJoin algorithm to find frequent subtrees in XML documents. Such algorithm performed exponentially which is a disadvantage.

2.5 Discovering interesting information in xml data with association rules D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi published in 2003It overcome the disadvantage of xQuery with data mining and knowledge discovery capabilities, by introducing XMINE RULE, an operator for mining association rules for native XML documents. They formalize the syntax and semantics for the operator and propose some examples of complex association rules.

CHAPTER 3 SYSTEM ANALYSIS

3.1 EXISTING SYSTEMExtracting information from semi structured documents is a very hard task, and is going to become more and more critical as the amount of digital information available on the internet grows. Indeed, documents are often so large that the dataset returned as answer to a query may be too big to convey interpretable knowledge. The existing version of the TARs extraction algorithm introduced in, which was based on PathJoin and CMTreeMiner to mine frequent subtrees from XML documents;

3.2 PROPOSED SYSTEMThis work describes an approach based on Tree-based Association Rules (TARs) mined rules, which provide approximate, intensional information on both the structure and the contents of XML documents, and can be stored in XML format as well. TARs can be queried to obtain fast, although approximate, answers. This is particularly useful not only when quick answers are needed but also when the original documents are unavailable. In fact, once extracted, TARs can be stored in a (smaller) document and be accessed independently of the dataset they were extracted from. Summarizing, TARs are extracted for two main purposes: 1) to get a concise idea the gist of both the structure and the content of an XML document, and 2) to use them for intensional query answering,

The intensional information embodied in TARs provides a valid support in several cases:

(i) When a user faces a dataset for the first time, s/he does not knowits features and frequent patterns provide a way to understand quickly what is contained in the dataset.

(ii) Besides intrinsically unstructured documents, there is a significant portion of XML documents which have some structure, but only implicitly, that is, their structure has not been declared via a DTD or an XML-Schema. Since most work on XML query languages has focused on documents having a known structure, querying the above-mentioned documents is quite difficult because users have to guess the structure to specify the query conditions correctly. TARs represent a dataguide that helps users to be more effective in query formulation.

(iii) It supports query optimization design, first of all because recurrent structures can be used for physical query optimization, to support the construction. of indexes and the design of efficient access methods for frequent queries, and also because frequent patterns allow to discover hidden integrity constraints, that can be used for semantic optimization. (iv) for privacy reasons, a document answer might expose a controlled set of TARs instead of the original document, as a summarized view that masks sensitive details.

CHAPTER 4PROJECT DESCRIPTION4.1ARCHITECTURE

Figure 4.1The purpose of this framework is to perform data mining on XML and obtain intentional knowledge. The intentional knowledge is also in the form of XML. This is nothing but rules with supports and confidence. In other words the result of data mining is TARs (Tree-based Association Rules). The framework is to have data mining for XML query answering support. When XML file is given as input, DOM parser will parse it for wellformedness and validness. If the given XML document is valid, it is parsed and loaded into a DOM object which can be navigated easily. The parsed XML file is given to data mining sub system which is responsible for sub tree generation and also TAR extraction. The generated TARs are used by Query Processor Sub System. This module takes XML query from end user and makes use of mined knowledge to answer the query quickly.

4.2TREE-BASED ASSOCIATION RULESAssociation rules describe the co-occurrence of data items in a large amount of collected data and are represented as implications of the form X Y , where X and Y are two arbitrary sets of data items, such that X Y = . The quality of an association rule is measured by means of support and confidence. Support corresponds to the frequency of the set X Y in the dataset, while confidence corresponds to the conditional probability of finding Y , having found X and is given by supp(X Y )/supp(X). In this work we extend the notion of association rule introduced in the context of relational databases to adapt it to the hierarchical nature of XML documents. Following the Infoset conventions, we represent an XML document as a tree2 _N,E, r, _, c_ where N is the set of nodes, r N is the root of the tree, E is the set of edges, _ : N L is the label function which returns the tag of nodes and c : N C {} is the content function which returns the content of nodes. We consider the element-only Infoset content model [28], where XML nonterminal tags include only other elements and/or attributes, while the text is confined to terminal elements. We are interested in finding relationships among subtrees of XML documents. Thus, since both textual content of leaf elements and values of attributes convey content, we do not distinguish between them. As a consequence, for the sake of readability, we do not report the edge label and the node type label in the figures. Attributes and elements are characterized by empty circles, whereas the textual content of elements, or the value of attributes, is reported under the outgoing edge of the element or attribute it refers to.

4.3 FUNDAMENTAL CONCEPTSGiven two trees T = _NT,ET , rT , _T , cT _ and S = _NS,ES, rS, _S, cS_, S is an induced subtree of T Without loss of generality, we do not consider namespaces, ordering label, referencing formalism through ID-IDREF attributes, URIs, Links and entity nodes because they are not relevant to the present work.only if there exists a mapping : NS NT such that for each node ni NS, _T (ni) = _S(nj) and cT (ni) = cS(nj), where (ni) = nj , and for each edge e = (n1, n2) ES, ((n1), (n2)) ET . Moreover, S is a rooted subtree of T if and only if S is an induced subtree of T and rS = rT .Given a tree T = _NT,ET , rT , _T , cT _, a subtree of T, t = _Nt,Et, rt, _t, ct_ and a user-fixed support threshold smin: (i) t is frequent if its support is greater or at least equal to smin; (ii) t is maximal if it is frequent and none of its proper supertrees is frequent; (iii) t is closed if none of its proper supertrees has support greater than that of t. A Tree-based Association Rule (TAR) is a tuple of the form Tr = _SB, SH, sTr, cTr _, where SB = _NB,EB, rB, _B, cB_ and SH = _NH,EH, rH, _H, cH_ are trees and sTr and cTr are real numbers in the interval [0,1] representing the support and confidence of the rule respectively. A TAR describes the co-occurrence of the two trees SB and SH in an XML document. For the sake of readability we shall often use the short notation SB SH; SB is called the body or antecedent of Tr while SH is the head or consequent of the rule. Furthermore, SB is a subtree of SH with an additional property on the node labels: the set of tags of SB is contained in the set of tags of SH with the addition of the empty label _: _SB(NSB) _SB(NSB){_}. The empty label is introduced because the body of a rule may contain nodes with unspecified tags, that is, blank nodes a rooted TAR (RTAR) is a TAR such that SB is a rooted subtree of SH an extended TAR (ETAR) is a TAR such that SB is an induced subtree of SHLet count(S,D) denote the number of occurrences of a subtree S in the tree D and cardinality(D) denote the number of nodes of D. We formally define the support of a TAR SB SH as count(SH,D)/cardinality(D) and its confidence as count(SH,D)/count(SB,D). Notice that TARs, in addition to associations between data values, also provide information about the structure of frequent portions of XML document.Thus they are more expressive than classical association rules which only provide frequent correlations of flat values. It is worth pointing out that TARs are different from XML association rules as defined in [24], because, given a rule X Y where both X and Y are subtrees of an XML document, that paper require that (X _ Y )(Y _ X), i.e. the two trees X and Y have to be disjoint; on the contrary TARs require X to be an induced subtree of Y . Given an XML document, we extract two types of TARs: A TAR is a structure TAR (sTAR) iff, for each node n contained in SH, cH(n) = , that is, no data value is present in sTARs, i.e. they provide information only on the structure of the document. A TAR, SB SH, is an instance TAR (iTAR) iff SH contains at least one node n such that cH(n) _= , that is, iTARs provide information both on the structure and on the data values contained in a document. According to the definitions above we have: structure- Rooted-TARs (sRTARs), structure-Extended-TARs (sETARs), instance-Rooted-TARs (iRTARs) and i(iETARs). Since TARs provide an approximate view of both the content and the structure of an XML document, (1) sTARs can be used as an approximate DataGuide of the original document, to help users formulate queries; (2) iTARs can be used to provide intensional, approximate answers to user queries.It shows a sample XML document and some sTARs. Rules (1) and (3) are rooted sTARs, rule (2) is an extended sTAR. Rule (1) states that, if there is a node labeled A in the document, with a 86% probability that node has a child labeled B. Rule (2) states that, if there is a node labeled B, with a 75% probability its parent is labeled A. Finally, Rule (3) states that, if a node A is the grandparent of a node C , with a 75% probability the child of A and parent of C, is labeled B.By observing sTARs users can guess the structure of an XML document, and thus use this approximate schema to formulate a query when no DTD or schema is available: as DataGuides , sTARs represent a concise structural summary of XML documents The proposed XML query answering support framework. The purpose of this framework is to perform data mining on XML and obtain intentional knowledge. The intentional knowledge is also in the form of XML. This is nothing but rules with supports and confidence. In other words the result of data mining is TARs (Tree-based Association Rules).

4.4 ALGORITHMTAR mining is a process composed of two steps: 1) mining frequent subtrees, that is, subtrees with a support above a userdefined threshold, from the XML document; 2) computing interesting rules, that is, rules with a confidence above a user defined threshold, from the frequent subtrees.

Algorithm 1 presents our extension to a generic frequentsubtree mining algorithm in order to compute interesting TARs. The inputs of Algorithm 1 are the XML document D, the threshold for the support of the frequent subtrees minsupp, and the threshold for the confidence of the rules, minconf. Algorithm 1 finds frequent subtrees and then hands each of them over to a function that computes all the possible rules. Depending on the number of frequent subtrees and their cardinality, the amount of rules generated by a naive Compute-Rules function may be very high. Given a subtree with n nodes, we could generate 2n 2 rules, making the algorithm exponential. This explosion occurs in the relational context too, thus, based on similar considerations , it is possible to state the following property, that allows us to propose the optimized vesion of Compute-Rules shown in Function 2.

If the confidence of a rule SB SH is below the established threshold minconf then the confidence of every other rule SBi then SHi , such that its body SBi is an induced subtree of the body SB, is no greater than minconf. It a frequent subtree and three possible TARs mined from the tree; all the three rules have the same support k and confidence to be determined.Let the support of the body tree of rule (1) be s. Since the body trees of rules (2) and (3) are subtrees of the body tree of rule (1), their support is at least s, and possibly higher. This means that the confidences of rules (2) and (3) are equal, or lower, than the confidence of rule (1).In Function 2, TARs are mined exploiting by generating first the rules with the highest number of nodes in the body tree. Consider two rules Tr1 and Tr2 whose body trees contain one and three nodes respectively; suppose both rules have confidence below the fixed threshold. If the algorithm considers rule Tr2 first, all rules whose bodies are induced subtrees of Tr2 will be discarded when Tr2 is eliminated. Therefore, it is more convenient to first generate rule Tr2 and in general, to start the mining process from the rules with a larger body. Using this solution we can lower the complexity of the algorithm, though not enough to make it perform better than exponentially. However, notice that the process of deriving TARs from XML documents is only done periodically. Since intensional knowledge represents frequent information, to update it, it is desirable to perform such process after a big amount of updates have been made on the original document. Therefore, in the case of stable documents the algorithm has to be applied few times or only once.Once the mining process has finished and frequent TARs have been extracted, they are stored in XML format. This decision has been taken to allow the use of the same language such as xQuery for querying both the original dataset and the mined rules. Each rule is saved inside a element which contains three attributes for the ID, support and confidence of the rule.

CHAPTER 5 SYSTEM REQUIREMENTS

5.1 SOFTWARE REQUIREMENTS:Language : ASP.NET, C#.NET Technologies : Microsoft.NET FrameworkIDE : Visual Studio 2008Operating System : Microsoft Windows XP SP2 or Later VersionBackend : XML

5.2 HARDWARE REQUIREMENTS:Processor: Intel Pentium or moreRAM: 512 MB (Minimum)Processor Speed : 3.06 GHz Hard Disk Drive : 250 GBFloppy Disk Drive : SonyCD-ROM Drive : SonyMonitor: 17 inchesKeyboard : TVS GoldMouse : Logitech

CHAPTER 6 6. MODULE DESCRIPTIONImplementation is the stage of the project when the theoretical design is turned out into a working system. Thus it can be considered to be the most critical stage in achieving a successful new system and in giving the user, confidence that the new system will work and be effective.The implementation stage involves careful planning, investigation of the existing system and its constraints on implementation, designing of methods to achieve changeover and evaluation of changeover methods.

1. Admin2. User3. Xml Query Answering

6.1Admin:Admin maintains the total information about the Whole application.Admin maintain the data in XML format only. Admin only has the authority to create user name and password for new users. This is mainly for the purpose of security so that no one can misuse it.

Figure 6.1 We can retrieve data as a whole file or can retrieve some specific data needed. It also maintains the table for registered users for further uses. Admin has the retrieve module which is used to retrieve the whole XML document. If needed, specific root node can be retrieved from whole document.

Figure 6.26.2User:User search queries and he got the reply in xml format. Here users are provided with unique id and password. It is provided by the administrator for security purpose. User can log in using their login id and password to accomplish two tasks. In first one, Data is created by using Create data. And then subnode and all the child nodes are filled with data by the user. This is for storing the data in xml format.

Figure 6.3It is also designed in such a way this at the same root node is appended in same and only one xml file. In second one, the stored xml data is retrieved by using root node. The user is asked for root node which takes the specific data from entire xml document.

Figure 6.46.2.1 Xml Query Answering:In this project user search the information in semistructure document and they got reply in xml format only. This module is designed with the aim to provide fast and efficient retrieval of data in xml format. The data are stored in xml as tree structured format so that retrieval would be fast and efficient

Figure 6.5

CHAPTER 7SYSTEM DESIGN

7.1 ARCHITECTURAL DESIGN

Figure 7.1The purpose of this framework is to perform data mining on XML and obtain intentional knowledge. The intentional knowledge is also in the form of XML. This is nothing but rules with supports and confidence. In other words the result of data mining is TARs (Tree-based Association Rules).

7.2 UML DIAGRAM:7.2.1 DATAFLOW DIAGRAMS:A data flow diagram is graphical tool used to describe and analyze movement of data through a system. These are known as the logical data flow diagrams.

ADMIN

Figure 7.2In this figure, admin and user can login using id and password to access the applications.

Figure 7.3

This Data flow diagram is used for admin to use semistructure document from which TAR rules are extracted and xml data is efficiently retrieved.

USER MODULE

Figure 7.4This figure is used for the user to search and retrieve data. It also used to display the searched data.

Fig ure 7.5

This figure is used by the user to enter xml query. It will search data by answering to the query.

7.2.2 USE CASE DIAGRAMADMIN:

Figure 7.6 This figure is used to represents users and their operations and how they are related. Here admin has the right to enter data and upload them.

Figure 7.7

This figure represents users operations such as login, registration, search and xml query Answering. Here they can create and retrieve data.

7.2.3CLASSDIAGRAM

Figure 7.8

The class diagram has the information about class members and class variables. In this figure, three classes are there. They are user, admin and xml query answering.

7.2.4 ER-DIAGRAM

Figure 7.9This diagram is used to depict the relationship between user, admin and database. User and admin can retrieve and view data.

7.2.5 SEQUENCE DIAGRAMS:A sequence diagram is a kind of interaction diagram in UML that shows how processes operate one with another and in what order. A sequence diagram shows, as parallel vertical lines ("lifelines"), different processes or objects that live simultaneously, and, as horizontal arrows, the messages exchanged between them, in the order in which they occur.

Figure 7.10

This diagram depicts sequential flow of admin, user and database. Admin send xml data to the database and user can search and retrieve data.

7.2.6 COLLOBORATION DIAGRAM

Figure 7.11

This diagram shows operations in collaboration. Here numbering is used to find correct sequence.

CHAPTER 8TESTING

8.1TESTINGA process of executing a program with the explicit intention of finding errors, that is making the program fail.

8.1Software Testing:

It is the process of testing the functionality and correctness of software by running it. Process of executing a program with the intent of finding an error.

A good test case is one that has a high probability of finding an as yet undiscovered error. A successful test is one that uncovers an as yet undiscovered error. Software Testing is usually performed for one of two reasons: Defect detection Reliability estimation

8.1.1 Black Box Testing:

Applies to software systems or module, tests functionality in terms of inputs and outputs at interfaces.Test reveals if the software function is fully operational with reference to requirements specification.

8.1.2 White Box Testing:

Knowing the internal workings i.e., to test if all internal operations are performed according to program structures and data structures.To test if all internal components have been adequately exercised.

8.2 Software Testing Strategies:A strategy for software testing will begin in the following order:1. Unit testing2. Integration testing3. Validation testing4. System testing

UNIT TESTING It concentrates on each unit of the software as implemented in source code and is a white box oriented. Using the component level design description as a guide, important control paths are tested to uncover errors within the boundary of the module. In the unit testing, The step can be conducted in parallel for multiple components.

INTEGRATION TESTING:Here focus is on design and construction of the software architecture. Integration testing is a systematic technique for constructing the program structure while at the same time conducting tests to uncover errors associated with interfacing. The objective is to take unit tested components and build a program structure that has been dictated by design.

VALIDATION TESTING:In this, requirements established as part of software requirements analysis are validated against the software that has been constructed i.e., validation succeeds when software functions in a manner that can reasonably expected by the customer.

CHAPTER 9 CONCLUSION

The main goals we have achieved in this work are:

1) mine all frequent association rules without imposing any a-priori restriction on the structure and the content of the rules;

2)store mined information in XML format;

3)use extracted knowledge to gain information about the original datasets.

4)We have developed a C++ prototype that has been used to test the effectiveness of our proposal. We have not discussed the updatability of both the document storing TARs and their index.

FUTURE ENHANCEMENT

In future , I will study how to incrementally Update mined TARs when the original XML datasets change and how to further optimize our mining algorithm. Moreover for the moment I deal with a (substantial) fragment of XQuery , I would like to find the exact fragment of XQuery which lends itself to translation into intensional queries.

10. APPENDIX10.1 SNAP SHOTS

Figure 10.1

This is the home page of my project. It has three main modules such as admin,user and retrieve xml data. Admin button redirects us to the administrator page. User button redirects us to the user login page.

Figure 10.2

This figure is used to sign in to admin page. It redirects us to admin page where we can register new users, retrieve total information in xml format and also possible to retrieve some specific details only.

Figure 10. 3

In this page, user can login using their unique id and password. New user can register by clicking new user registration. This page redirects us to the create data page.

Figure 10. 4

This page is used to create data by the users. First of all, a root node is created and then root element and further node elements are stored by the user. It is also designed in such a way that new data can be appended to the existing one.

Figure 10. 5

This page is used to search the document using the key element.the stored xml data is retrieved by using root node. The user is asked for root node which takes the specific data from entire xml document.

Figure 10. 6

This page is used find and display the paths of the document which we searched using root element. Using this path, We can easily retrieve data using these paths and also used for efficient retrieving.

Figure 10. 7

This page is used by the user to retrieve information which is stored in xml format. It can retrieve full xml document or can retrieve some specific information as needed. Retrieve button is clicked to invoke the function.

Figure 10. 8

On clicking the retrieve button the needed information gets displayed in the small drop down list box.

Figure 10. 9

This page gives us information about how data stored in xml file. The data entered by users are automatically stored as tags in tree format.

10.2 SAMPLE CODE

USER LOGINusing System;using System.Collections;using System.Configuration;using System.Data;using System.Web;using System.Web.Security;using System.Web.UI;using System.Web.UI.HtmlControls;using System.Web.UI.WebControls;using System.Web.UI.WebControls.WebParts;////using System.Xml.Linq;using System.Data.SqlClient;

public partial class Newuser : System.Web.UI.Page{ SqlConnection conn = new SqlConnection(@"Data Source=.\SQLEXPRESS;AttachDbFilename=E:\data_mining\xml query answering\App_Data\ASPNETDB.MDF;Integrated Security=True;User Instance=True"); protected void Button1_Click(object sender, EventArgs e) { try { conn.Open(); SqlCommand cmd = new SqlCommand("insert into login values('" + TextBox1.Text + "','" + TextBox2.Text + "')", conn); cmd.ExecuteNonQuery(); conn.Close(); Label6.Visible = true; Label6.Text = "Successfully Created"; Button2.Visible = true; } catch(Exception ex) { Label6.Text = "Username Already Exist !!!"; }CREATE DATAusing System;using System.Collections;using System.Configuration;using System.Data;////using System.Linq;using System.Web;using System.Web.Security;using System.Web.UI;using System.Web.UI.HtmlControls;using System.Web.UI.WebControls;using System.Web.UI.WebControls.WebParts;////using System.Xml.Linq;using System.Data.SqlClient;using System.Xml;public partial class CreateData : System.Web.UI.Page{ string path1, path2, path3, path4, path5; SqlConnection cnn = new SqlConnection(@"Data Source=.\SQLEXPRESS;AttachDbFilename=E:\data_mining\xml query answering\App_Data\ASPNETDB.MDF;Integrated Security=True;User Instance=True"); protected void Button1_Click(object sender, EventArgs e) { string xmlFile = Server.MapPath("~/Files/" + TextBox1.Text +".xml"); XmlDocument xd = new XmlDocument(); xd.Load(xmlFile); XmlNode rootelement = xd.CreateNode(XmlNodeType.Element, TextBox1.Text, null); XmlNodeList list = xd.GetElementsByTagName("root"); list[0].AppendChild(rootelement); xd.Save(xmlFile); string xmlFile1 = Server.MapPath("~/XMLFile.xml"); XmlDocument xd1 = new XmlDocument(); xd1.Load(xmlFile1); XmlNode rootelement1 = xd1.CreateNode(XmlNodeType.Element, TextBox1.Text, null); XmlNodeList list1 = xd1.GetElementsByTagName("root"); list1[0].AppendChild(rootelement1); xd1.Save(xmlFile1); Label4.Visible = true; Label4.Text = "node created successfully"; } protected void Button3_Click1(object sender, EventArgs e) { Session["gold"] = TextBox2.Text; ViewState["count"] = Convert.ToInt32(ViewState["count"]) + 1; Response.Write(Convert.ToInt32(ViewState["count"])); string xmlFile = Server.MapPath("~/Files/" + TextBox1.Text + ".xml"); XmlDocument xd = new XmlDocument(); xd.Load(xmlFile); XmlNode rootelement = xd.CreateNode(XmlNodeType.Element, TextBox1.Text, null); XmlNodeList list = xd.GetElementsByTagName("root"); list[0].AppendChild(rootelement); xd.Save(xmlFile); //XmlDocument doc = new XmlDocument(); //string xmlFile = System.Web.HttpContext.Current.Server.MapPath("~/XMLFile.xml"); //doc.Load(xmlFile); XmlNode node1 = xd.CreateNode(XmlNodeType.Element, TextBox3.Text, null); XmlElement ele1 = xd.CreateElement(TextBox4.Text); ele1.InnerText = TextBox9.Text; path1 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox4.Text; XmlElement ele2 = xd.CreateElement(TextBox5.Text); ele2.InnerText = TextBox10.Text; path2 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox5.Text; XmlElement ele3 = xd.CreateElement(TextBox6.Text); ele3.InnerText = TextBox11.Text; path3 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox6.Text; XmlElement ele4 = xd.CreateElement(TextBox7.Text); ele4.InnerText = TextBox12.Text; path4 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox7.Text; XmlElement ele5 = xd.CreateElement(TextBox8.Text); ele5.InnerText = TextBox13.Text; path5 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox8.Text; node1.AppendChild(ele1); node1.AppendChild(ele2); node1.AppendChild(ele3); node1.AppendChild(ele4); node1.AppendChild(ele5); // XmlNodeList list = xd.GetElementsByTagName(TextBox2.Text); list[0].AppendChild(node1); xd.Save(xmlFile);

xmlcreate() string sql = "insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path1.ToString() + "') insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path2.ToString() + "') insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path3.ToString() + "')insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path4.ToString() + "') insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path5.ToString() + "')"; SqlCommand cmd1 = new SqlCommand(sql, cnn); cnn.Open(); cmd1.ExecuteNonQuery(); Label12.Visible = true; Label12.Text = "created data successfully in XMLFile.xml"; cnn.Close();

}

RETRIEVE DATAusing System;using System.Collections;using System.Configuration;using System.Data;using System.Web;using System.Web.Security;using System.Web.UI;using System.Web.UI.HtmlControls;using System.Web.UI.WebControls;using System.Web.UI.WebControls.WebParts;using System.Xml;using System.Data.SqlClient;

public partial class UserRetiveData : System.Web.UI.Page{ SqlConnection con = new SqlConnection(@"Data Source=.\SQLEXPRESS;AttachDbFilename=E:\data_mining\xml query answering\App_Data\ASPNETDB.MDF;Integrated Security=True;User Instance=True"); protected void btnreteive_Click(object sender, EventArgs e) { ListBox1.Items.Clear(); con.Open(); SqlCommand cmd = new SqlCommand("select pathexp from paths where pathname='" + txtrname.Text + "'", con); SqlDataReader dr = cmd.ExecuteReader(); if (dr.HasRows == true) { gv.DataSource = dr; gv.DataBind(); } else { Response.Write("alert('Incorrect Root element ')"); gv.DataSource = dr; gv.DataBind(); } dr.Close();

XmlTextReader reader = new XmlTextReader(Server.MapPath("~/Files/" + txtrname.Text +".xml"));

reader.WhitespaceHandling = WhitespaceHandling.None; XmlDocument xmlDoc = new XmlDocument(); //Load the file into the XmlDocument

xmlDoc.Load(reader); //Close off the connection to the file.

reader.Close(); //Add and item representing the document to the listbox

ListBox1.Items.Add("XML Document");

//Find the root nede, and add it togather with its childeren

XmlNode xnod = xmlDoc.DocumentElement; AddWithChildren(xnod, 1);

}

10.REFERENCES

(1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. of the 20th Int. Conf. on Very Large Data Bases, Morgan Kaufmann Publishers Inc., 1994.

(2) T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa.Efficient substructure discovery from large semi-structured data. In Proc. of the SIAM Int. Conf. on Data Mining, 2002.

(3) T. Asai, H. Arimura, T. Uno, and S. Nakano. Discovering frequent substructures in large unordered trees. In Technical Report DOI-TR 216, Department of Informatics, Kyushu University., 2003.

(4) E. Baralis, P. Garza, E. Quintarelli, and L. Tanca. Answering xml queriesby means of data summaries. ACM Transactions on Information Systems,25(3):10, 2007.

(5) D. Barbosa, L. Mignet, and P. Veltri. Studying the xml web: Gatheringstatistics from an xml sample. World Wide Web, 8(4):413438, 2005.

SITES REFEREDhttp://www.i.kyushuu.achttp://www.ieee.orghttp://www.computer.org/publications/dlib