MDSN Report

7/31/2019 MDSN Report

1/59

Chapter 1

Introduction

Currently, the World Wide Web is the largest source of information. Huge amount of data

is present on the Web and large amount of data is added to the web constantly. User searches for

the required information by using particular keywords. We are specially dealing with News. As the

large number of news available on web, so does the need to provide high-quality summaries in

order to allow the user to quickly locate the desired information. i.e. to get summary of different

news from variety of newspapers about same topic as per the query specification.

Summarization is the process of condensing a source text into a shorter version preserving its

information content. It can serve several goals from survey analysis of a scientific field to quick

indicative notes on the general topic of a text. In other words summarization is the process of

automatically creating a compressed version of a given text that provides useful information for the

user. The information content of a summary depends on users needs. Topic-oriented summaries

focus on a users topic of interest, and extract the information in the text that is related to the

specified topic. Indicative summaries, which can be used to quickly decide whether a text is worth

reading, are naturally easier to produce

Query-oriented summarization (QS) tries to extract a summary for a given query. It is a

common task in many text mining applications. For example, a user submits a query to a search

engine and the search engine usually returns a lot of result documents. To click-and-view each of

the returned documents is obviously tedious and infeasible in many cases. One challenging issue is

how to help the user digest the re turned documents. Typically, the documents talk about different

perspectives of the query. An ideal solution might be that the system automatically generates a

concise and informative summary for each perspective of the query. Much work has been done for

document(s) summarization. Generally, document(s) summarization can be classified into three

categories:

1. Single document summarization (SDS)2. Multi-document summarization (MDS)

3. Query oriented summarization (QS)

SDS is to extract a summary from a single document; while MDS is to extract a summary from

multiple documents. The two tasks have been intensively investigated and many methods have been

Multi-Document Extractive Summarization for News Page 1 of 59


2/59

proposed. The methods for document(s) summarization can be further categorized into two groups:

unsupervised and supervised. The unsupervised method is mainly based on scoring sentences in the

documents by combining a set of predefined features.

In the supervised method, summarization is treated as a classification or a sequential

labeling problem and the task is formalized as identifying whether a sentence should be included in

the summary or not. However, the method requires training examples. Query-oriented

summarization (QS) is different from the SDS and the MDS tasks. The document cluster denotes

the information source and the query denotes the information need. A document cluster is a sub set

of the entire document collection. A compelling application of document summarization is the

snippets generated by Web search engines for each query result, which assist users in further

exploring individual results. The Information Retrieval (IR) community has largely viewed text

documents as linear sequences of words for the purpose of summarization. Although this model

has proven quite successful in efficiently answering keyword queries, it is clearly not optimal since

it ignores the inherent structure in documents.

Furthermore, most summarization techniques are query-independent and follow one of the

following two extreme approaches:

1. Either they simply extract relevant passages viewing the document as an unstructured set of

passages.

2. Employ Natural Language Processing techniques.

The former approach ignores the structural information of documents while the latter is too

expensive for large datasets (e.g., the Web) and sensitive to the writing style of the documents.

Here a method to add structure, in form of a graph, to text documents in order to allow effective

query specificsummarization is discussed. That is a document is viewed as a set of interconnected

text fragments.

Main focus is on keyword queries since keyword search is the most popular information

discovery method on documents, because of its power and ease of use. This technique has the

following key steps: First, at the preprocessing stage, a structure is added to every document,

which can then be viewed as a labeled, weighted graph, called the document graph. Then, at query

time, given a set of keywords, a keyword proximity search is performed on the document graphs to

discover how the keywords are associated in the document graphs. For each document its summary

is the minimum spanning tree on the corresponding document graph that contains all the keywords



3/59

(or equivalent based on a thesaurus). So data from the minimum spanning tree nodes is collected

and presented as a summary of the document.

Automatic summarization is the creation of a shortened version of a text by a

Computer program which contains important information of the original documents

1.1 History:

1. in 1950s: First systems surface level approaches

Term frequency (Luhn, Rath)

2. 2.in 1960s: First entity level approaches

Syntactic analysis

Surface Level: Location features (Edmundson 1969)

3. 3.in 1970s:

Surface Level: Cue phrases (Pollock and Zamora)

Entity Level

First Discourse Level: Stroy grammars

4. 4.in 1980s:

Entity Level (AI): Use of scripts, logic and production rules, semantic networks (Dejong 1982,

Fum et al.1985)

Hybrid (Aretoulaki 1994)

5. 5.from 1990s-:explosuion of all

1.2 Literature survey:

1.2.1 Aim:

Our aim is to achieve multi-document news summarization.

1. In this case we are parsing the HTML document(s) and extracting the text file(s) from it.

As we are dealing with the text only, we have chosen the nearest neighbor algorithm for

clustering. As it is less complex and sufficient for text.

2. For the same we are dealing with extractive summary along with the query Specification.

1.2.2 Extractive and Abstractive Summarization

Extractive Summarization:



4/59

Produces a summary by selecting indicative sentences, passages or paragraphs from an original

document according to a predefined target summarization ratio.

Abstractive summarization:

Provides a fluent and concise abstract of a certain length that reflects the key

summarization. This requires highly sophisticated techniques, including semantic representation

and inference, as well as natural language generation concept of the document. In recent years,

researchers have tended to focus on extractive years spectrum of Text Summarization Research.

1.2.3 A System for Query-Specific Document Summarization

Mr. Ramakrishna Varadarajan & Vangelis Hristidis presented a method to create query specific

summaries by identifying the most query-relevant fragments and combining them using the

semantic associations within the document. In particular, structure is added to the documents in the

preprocessing stage and converted them to document graphs. Then, the best summaries are

computed by calculating the top spanning trees on the document graphs. This paper presents and

experimentally evaluates efficient algorithms that support computing summaries in interactive

time. Furthermore, the quality of the summarization method is compared to current approaches

using a user survey.

In this work a structure-based technique is presented to create query-specific summaries

for text documents. In particular, the document graph of a document is created; to represent the

hidden semantic structure of the document and then perform keyword proximity search on this

graph. It is shown in the paper that with a user survey that our approach performs better than other

state of the art approaches. Furthermore, feasibility of the approach with a performance evaluation

is shown at last.

In this approach document graph was built and processing was done on text document, we

are implementing somewhat similar methodology but in addition HTML to text parser is added,

i.e. we are processing on HTML files.

1.2.4 An Incremental Summary Generation System



5/59

Mr. C Ravindranath Chowdary & P Sreenivasa Kumar presented an algorithm strategy to finds

pair of sentences, one from the current summary and other from the new document that is to be

swapped to improve the quality of the summary. For a given query, quality of a summary is

determined by its informativeness, coherence and completeness. A scoring function that captures

these features to calculate the quality of a summary is proposed. The process of

updating/improving summary is continued iteratively till the improvement in quality measure

becomes negligible. Experimental results, both qualitative and quantitative, show that performance

of the proposed approach for incremental summary generation is quite encouraging.

This paper deals with updating the available extractive summary in the scenario where the

initial documents used for summarization are not accessible. The proposed algorithm updates the

available summary as and when a new document is made available to the system.

In this approach extractive summarization in used but the original document is not

accessible. We are also dealing with extractive summarization but original document is accessible

moreover a highlighted feature is added for convenience of user.

1.2.5 Automatic Text Summarization

Mr. Mohamed Abdel Fattah & Fuji Ren investigates the effect of each sentence feature on the

summarization task. Then they used all features score function to train genetic algorithm (GA) and

mathematical regression (MR) models to obtain a suitable combination of feature weights. The

proposed approach performance is measured at several compressions rates on a data corpus

composed of 100 English religious articles. The results of the proposed approach are promising.

This paper investigates the use of genetic algorithm GA), mathematical regression (MR),

for automatic text summarization task. This new approach is applied on a sample of 100 English

religious articles. The approach results outperform the baseline approach results. The approaches

have been used the feature extraction criteria which gives researchers opportunity to use many

varieties of these features based on the used language and the text type.

In this approach the algorithm used for summarization was Genetic algorithm while we are

using Nearest Neighbor algorithm, moreover it is automatic summarization strategy while we are

using extractive summarization strategy.

1.2.6 Multi-topic based Query-oriented Summarization



6/59

Mr. Jie Tang , Limin Yao & Dewei Chen tries to break limitations of the existing methods and

study a new setup of the problem of multi-topic based query-oriented summarization. More

specifically, this paper proposed two strategies to incorporate the query information into a

probabilistic model. Experimental results on two different genres of data show that our proposed

approach can effectively extract a multi-topic summary from a document collection and the

summarization performance is better than baseline methods. The approach is quite general and can

be applied to many other mining tasks, for example product opinion analysis and question

answering.

This paper investigates the problem of multi-topic based query-oriented summarization.

The paper formalizes the major tasks and proposes a probabilistic approach to solve the tasks. Two

strategies are studied for simultaneously modeling document contents and the query information.

We are also dealing with query oriented multi-document summarization and we have

specified it for news.

1.2.7 Proposed modules:

1. HTML to text parser

2. Processing the input text file and creating the document graph

3. Adding weighted edges to document graph

4. Document Clustering

5. Create clustered document graph

6. Adding weight to nodes in clustered document graph

7. Generate closure graph and find minimal clusters

8. Result

1.2.8 Clustering:

Clustering is one of the most important unsupervised learning processes that organizing

objects into groups whose members are similar in some way. Clustering finds structures in a

collection of unlabeled data. A cluster is a collection of objects which are similar between them

and are dissimilar to the objects belonging to other clusters

1.2.8.1Uses of Clustering:



7/59

If a collection is well clustered, we can search only the cluster that will contain relevant

documents.

Searching a smaller collection should improve effectiveness and efficiency.

1.2.8.2Nearest Neighbour Algorithm

1. Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).

2. Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each

step, and stops when the desired number of clusters is reached.

Step1:Nearest Neighbor, Level 2, k = 7 clusters.

Step 2: Nearest Neighbor, Level 3, k = 6 clusters.




8/59


Step 5: Nearest Neighbor, Level 6, and k = 3 clusters.



9/59

Step6: Nearest Neighbor, Level 7, k = 2clusters.

Step 7:Nearest Neighbor, Level 8, k = 1 cluster.



10/59

1.3 Advantages:

1. Because of the multi-document news summarization, there is no need to go through all the

newspapers.

2. As we are dealing with query specific summarization, user can easily have news summeryaccording to his/her interest.

3. The accuracy of the result is depend upon initial Edge Threshold and Cluster threshold as

well as Result accuracy percentage, so user can control the relevance.

1.4 Limitations:

1. Takes too long time to process the text files more than 50 KB or having more than 200

paragraphs due to heavy computational loops.

2. The text in the images or in the Flash contents cannot be parsed.

3. The text accessed through web services cannot be parsed.

Chapter 2



11/59

System Requirement and Specification

2.1 Scope of the Project

This project can create the query dependent summary generated by clustering algorithm.

Here we have considered nearest neighbor clustering algorithm. As every file format can be

converted into text file. This algorithm can be applied on text file. Nodes in text file i.e. contents in

every newline are clustered and query dependent summary can be generated.

2.2 Requirement Specifications

Requirements are the desired characteristics of the software being developed. The first

activity in most projects is the identification and documentation of the requirements. Requirements

cover both requirements engineering (identification, analysis and capture) and requirements

management (managing change, creating and maintaining agreement with customers, trace ability

and metrics).

The development of large, complex systems presents many challenges to systems

engineers. Foremost among these is the ability to ensure that the final system satisfies the needs of

users and provide for easy maintenance and enhancement of these systems during their deployed

lifetime. These systems often change and evolve throughout their life cycle. This makes it difficult

to track the implemented system against the original and evolving user requirements.

Requirements establish an understanding of users need and also provide the final yardstick

against which implementation success is measured. Various studies have shown that roughly halfof the application errors can be traced to requirement errors and deficiencies. Thorough

documentation and properly managing requirements are the keys to developing quality

applications. By allowing project teams to define and document requirement data including user

defined attributes, priority, status, acceptance criteria and traceability, detection and correction of

missing, contradictory or inadequately defined requirements can be done the following

requirements and constraints were considered during the requirement analysis phase. For clustering

we have to take text as input file. If any other file format is there, it is firstly converted into text

format. Then clustering of nodes in text document and query dependent summarization is done.

2.2.1 Product performance requirements

Input file must be text file. As the size of the text document changes, performance of the

algorithms also changes .So if file is larger, we should have better hardware facilities.

2.2.2 Hardware Requirements



12/59

Processor : Pentium IV or higher.

Ram : Minimum 256 MB.

Hard Disk : 40 GB.

Input device : Standard Keyboard and Mouse.

Output device : VGA and High Resolution Monitor.

2.2.3 Software Requirements

Software components required for building the Project are:

Operating System : WINDOWS XP or above

Technique : Microsoft Visual Studio 2010 (.NET Framework 3.5)

Internet explorer.

2.3 Functional requirements

The project includes the following modules:

HTML to text parser: Processing the input HTML files parsing the HTML contents and extracting

the text lines.

Uploading and processing: In this module a text file is uploaded and processed. Every

single line is considered as a node and the data within that node is displayed.

Building document graph: The weight between every node to every other node is

calculated.

Clustering and making clustered graph: Nearest neighboring method and agglomerativehierarchical clustering technique is used for making clusters of previous step document

graph and clustered graph is prepared.

Query firing and getting minimal cluster: Here we are firing the query and finding the

minimal cluster. Minimal cluster is the cluster, which contains the part of the fired query.

Here we are getting summary the result.

2.4 Feasibility Study

Not everything imaginable is feasible!, therefore it is necessary to evaluate feasibility of

project at the earliest stage.

The software feasibility has 3 solid dimensions:



13/59

2.4.1 Technology:

Technical feasibility is study of functions, performance, and constraints that may affect the

ability to achieve an acceptable system. This project is technically feasible to implement. The user

does not require any extra hardware or any higher-end technology. The software can execute on a

single client machine operating on a WINDOWS XP or a higher version of Operating System.

2.4.2 Finance:

Financial feasibility is the evaluation of the development cost weighed against the ultimate

income or benefits derived from the developed system. The resources that are required for the

system can be available easily. The system is developed basically for study purpose so economical

feasibility is not a major issue.

This project is financially feasible because the software does not require any extra hardware or any

additional supporting technology which in turn adds no extra cost to the software. Thus the cost is

only for the development. Thus the project is financially feasible.

2.4.3 Resources:

The organization that wishes to implement this system requires only a single or multiple

machines. Thus no additional resources are required to implement the system. Thus the software is

also resource feasible.

2.5 .NET Framework :

.NET Framework is designed for cross-language compatibility. Cross-language

compatibility means .NET components can interact with each other irrespective of the languages

they are written in. An application written in VB .NET can reference a DLL file written in C# or a

C# application can refer to a resource written in VC++, etc. This language interoperability extends

to Object-Oriented inheritance. This cross-language compatibility is possible due to common

language runtime.

2.5.1 .NET Framework Advantages:

The .NET Framework offers a number of advantages to developers.

Different programming languages have different approaches for doing a task. For example,

accessing data with a VB 6.0 application and a VC++ application is totally different. When using

different programming languages to do a task, a disparity exists among the approach developers



14/59

use to perform the task. The difference in techniques comes from how different languages interact

with the underlying system that applications rely on. With .NET, for example, accessing data with

a VB .NET and a C# .NET looks very similar apart from slight syntactical differences. Both the

programs need to import the System. Data namespace, both the programs establish a connection

with the database and both the programs run a query and display the data on a data grid.

.NET v/s Java :

Java is one of the greatest programming languages created by humans. Java doesn't have a

visual interface and requires us to write heaps of code to develop applications. On the other hand,

with .NET, the Framework supports around 20 different programming languages which are better

and focus only on business logic leaving all other aspects to the Framework.

Visual Studio .NET comes with a rich visual interfaces and supports drag and drop. Many

applications were developed, tested and maintained to compare the differences between .NET and

Java and the end result was a particular application developed using .NET requires less lines of

code, less time to develop and lower deployment costs along with other important issues.

Personally, I don't mean to say that Java is gone or .NET based applications are going to dominate

the Internet but I think .NET definitely has an extra edge as it is packed with features that simplify

application development.

2.5.2 Main features of C#:

C# was developed as a language that would combine the best features of previously

existing Web and Windows programming languages. Many of the features in C# language are

preexisted in various languages such as C++, Java, Pascal, and Visual Basic.

Main features:

1. C# is a simple, modern, object oriented language derived from C++ and Java.

2. It combine the high productivity of Visual Basic and the raw power of C++.

3. It is a part of Microsoft Visual Studio7.0.

4. Visual studio supports VB, VC++, C++, VBscript, and Jscript. All of these languages

provide access to the Microsoft .NET platform.

5. .NET includes a Common Execution engine and a rich class library.

6. Microsoft's JVM equiv. is Common language run time (CLR).



15/59

7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET,

C++.

8. Source code --->Intermediate Language code.

9. The classes and data types are common to all of the .NET languages.

10. We may develop Console application, Windows application, and Web application

using C#.

11. In C# Microsoft has taken care of C++ problems such as Memory management,

pointers etc.

12. It supports garbage collection, automatic memory management and a lot.

Here is a list of some of the primary characteristics of C# language.

Modern and Object Oriented

Simple and Flexible

Type safety

Interoperability

Scalable and Updateable

2.6 Risk Management

The software development process is inherently subjects to risks, the consequence of which

are manifested as financial failures (time scale overrun, budget overrun) and technical failures

(failures to meet required functionality, reliability or maintainability).The objectives of risk

management are to identify, analyze and give priorities to risk items before they become either

threats to successful operation or major sources of expensive software rework, to establish a

balanced and integrated strategy for eliminating or reducing the various sources of risk, and to

monitor and control the execution of the strategy.

2.7 Data Flow Diagrams

Data Flow Diagrams serves two purposes:

1. To provide an indication of how data are transformed as they move through the system.

2. To depict the functions that transforms the data flow.



16/59

The DFD provides additional information that is used during the analysis of the

information domain and serves as a basis for the modeling of function. A description for each

function presented in the DFD is contained in a process specification.

As information moves through software, it is modified by a series of transformations. A

data flow diagram is a graphical representation of information flow and transforms that are applied

as data moves from input to output. The basic form of data flow diagram is also known as data

flow graph or bubble chart.

The data flow diagram may be used to represent a system or software at any level of abstraction. In

fact, DFDs may be partitioned into levels that represent increasing information flow and functional

detail. Therefore, the DFD provides a mechanism for functional modeling as well as information

flow modeling.

Figure2.1 DFD (Level 0)

Data flow diagram( level 1) provides more details of the data flow diagram level zero. It represents

information flow and transforms that are applied as data moves from input to output.

Input HTML File(s)


SYSTEM

HTML

FileSummary of

Html file

Query

Uploading

&

processin

g i/p file

HTML To

Text

Converter

Building

document

graph


17/59

Clustering algorithm Threshold

I/P Query

Figure2.2 DFD (Level 1)

Chapter 3

Design and analysis

3.1 Design Overview


Creating

weighted

graph

Clustering

and

building

clustered

graph

Highlighting

Text in the

HTML

Document

as Result.

Generati

ng

Summar

y


18/59

A specialist has to check for the dataflow and have to manually where the data flows.

Analysts have proved that it would take more time for an experienced specialist to note the

dataflow.

We are accepting HTML/text file only. Newline contents are forming a node, hence a

single cluster. If there is no newline content then only one node will be there, hence only on

cluster. This will degrade the performance of the algorithms, as the cluster size is very big.

3.2 Software Architecture:

Figure 3.1 Architecture Diagram

Figure 3.1 shows the architecture diagram of the system. As shown in figure there are five

main blocks : a block for uploading and processing html file(s) by parsing text from html

document and making document graph, a block for clustering and making clustered graph, a block

for making weighted clustered document graph., the last block for generating summary for fired

query.

Block 1: HTML to Text conversion:

This block accept the input files in the form of Html, and then convert it into text files.

After conversion of html to text, the text file passes to the next block as input.



19/59

Block 2: Processing input file and generating document graph:

This block is needed to accept the text file only. It is responsible to upload text file, to

process the file i.e. to form nodes for every newline contents. It is also responsible for generating

weight from each node to very other node

Block 3: Clustering node and building clustered graph:

This block is responsible for choosing a clustering algorithm out of two. It also accepts the

threshold, so that can check the similarity between the clusters up to that level. It is responsible for

making clusters.

Block 4: Creating weighted document clustered graph:

This block is responsible to accept the fired query. It is responsible to check the similarities

between the query a contents and the contents in the clusters. It then build weighted clustered

document graph.

Block 5.Summary generation:

This block is responsible for generating the summary of the clusters we formed, as a

response for fired query. It generated the minimal clusters and after finding the weight of the node

for fired query, it gives top most summaries.

3.3 Team Work Graph:

Team Members: Mr. Athar Nawaz Khan

Mr. Nikhil Vilasrao Ubale

Miss. Shraddha B. Ahire.



20/59

Fig.3.2 Deviation of work

3.4 Software Engineering Model used i.e. Incremental Model:

3.4.1 Communication:

The software development process starts with communication between customer and

developer. In this phase we communicated with following principles of communication phase. We

prepare before the communication i.e. we decide agenda of the meeting for concentrating on the

News Summarization. Our leader directs our team and drawn out all the requirement of the user i.e.

what they are actually needed, what is input, output format of system.

3.4.2 Planning:

It includes complete estimation and scheduling and risk analysis. In this phase we planned

about when estimated release the software, cost estimation, risk in the project regarding application

etc. Finally in this phase we estimated the cost of the project including all expenditure of software,

releasing software according to user deadline with his participation.

3.4.3 Modeling:

It includes detail requirement analysis and project design. Flowchart shows complete

pictorial flow of program where the algorithm is step by step solution of problem. We analyze the

requirement of the user according to that we drawn the block diagrams of the system. That is

nothing but behavioral structure of the system using UML 2.0 i.e. Class Diagram, Use Case report,

component diagram, communication diagram,activity diagram, state machine diagram.



21/59

3.4.4 Construction:

It includes Coding and Testing Steps

Coding:

Design details are implemented using appropriate programming language. In coding we

choose the platform i.e.ASP.NET

Testing:

Testing is carried out by analyzing the system i.e. we first develop the prototype of the

system and step by step find out input and output errors such as interface errors, data structure

errors, initialization errors etc. Therefore here Black Box testing strategy is useful.

3.4.5 Deployment:

It includes software delivery, support and feedback from customer. If customer suggest

some corrections, or demands additional features are added into this software.

3.5 Analysis of Work:

The following table shows you the way of work that we followed in the period.

Table3.1: Analysis of Work

Sr. No. Name of Task Subtask Period

1 Information Gathering 1.Problem Definition:

Collecting detail information

of the system to be

implemented.

17/07/2011

TO

28/07/2011

2.Literature Survey:

Visiting different websites

studying existing system with

its limitations

Going through Journals,magazines

Studying the reference books.

2 Analysis Project Plan:

Preparing complete project

pla.n

07/08/2011

TO

24/08/2011



22/59

Requirement Analysis:

Software requirements

Hardware requirements

3 Design Architectural Design:

Describing relationships

between modules and sub

modules

04/09/2011

TO

25/09/2011

UML documentation:

Use case diagram

Class Diagram

Sequence Diagram

Activity Diagram

State Machine Diagram

Component Diagram

Form Design:

Showing relationship among

different menus and sub

menus

4 GUI Output screens:

Preparing for detail output

screens.

22/2/2012

TO

29/02/2012

Report Submission:

Submission of report of

Analysis and Design

5 Construction of System Coding:

Implementation of design

details using Programming

language c# .net

12/03/2012

TO

19/03/2012

Testing:

Testing the system for

expected results

6 Deployment System Deployment: 25/03/2012



23/59

Delivery of Project

Support

Feedback

Modification

TO

15/04/2012

7 Final Document

Preparation and

Submission

Project Submission:

Preparing final project Report

Submission of final Project

Report

15/04/2012

TO

23/04/2012

3.6 Risk Assessment:

The risk always involves two characteristics

Uncertainty: The risk may or may not occur there are no 100% probable risks.

Loss: If the risk becomes a reality, unwanted consequences or losses will occur.

3.6.1 Risk projection:

Risk projection, also called as risk estimation, attempts to rate each risk in two ways

The like hood or probability that risk is real.

Consequences of the problems associated with the risk should it occur.

3.6.2 Risk Identification:

Risk Identification is systematic attempts to specify threats to the project plan.

Generic risk: These are potential threats to every software project.

Product Specific: These risks can be identified only by those with a clear understanding of

the technology and the environment that is specific to the project at hand.

Following are the Risks involved:

1. Technology to be built: Risks associated with the complexity of the system to be built and

the newness of the technology to be packaged by the system.

2. Development Environment: Risks associated with availability and quality of the tools to be

used built this system.

3. Risk related to Time.

4. Risk related to Functionality of the system.

3.7 Requirement Analysis:

Requirement analysis results in the specification of softwares operational characteristics.



24/59

Requirements gathering comprises of the following:

1. Elaboration

2. Negotiation

3. Specification

4. Validation

Software requirement specification is produced at the culmination of analysis task. The functions

and performance allocation to Software as part of system engineering are refined by establishing

following:

A complete information description

A detailed functional description

A representation of system behavior

An indication of performance requirement and design constraints

Appropriate validation criteria

3.8 UML Documentation:

A UML diagram is a representation of the components or elements of a system or process

model and, depending on the type of diagram, how those elements are connected or how they

interact from a particular perspective. We are developing following UML diagrams to show the

elements and connection of the elements in the diagram.

There are two types of UML diagrams:

1. Structural Diagrams

2. Behavioral Diagrams

These two are major grouping of UML diagrams. We are developing some of the diagrams

which are sufficient to show the flow and elements of working project. Those diagrams are listed

below:

Use case Diagram

Class Diagram

Sequence Diagram

Activity Diagram

State Machine Diagram

Component Diagram



25/59

Description about the UML diagram:

3.8.1 Use case Model:

A Use Case diagram captures Use Cases and relationships between Actors and the subject

(system). It describes the functional requirements of the system, the manner in which outside

things (Actors) interact at the system boundary, and the response of the system.

Components of Use case diagram:

1. Actor

2. Use Cases

3. Association

4. Include

5. Extends

An Actor is a user of the system; user can mean a human user, a machine, or even another

system or subsystem in the model. Anything that interacts with the system from the outside or

system boundary is termed an Actor. Actors are typically associated with Use Cases. Here in

Interactive System we are using five actors. They are as follows:

System

End User

Use Case Diagram



26/59

uc use

Minimal clusters

Adding weight to clustered graph

Clustering

Weighted graph

Split

Parsing

End User System

Browse News (HTML

file) from internet

Provide News to

system

Parsing

Extract text files

from body tag

Processing

text files

Split documents

into nodes

(paragraph)

Find similarity

between nodes

Build documentgraph

Document

clustering

Use Nearest

Neighbour

aglorithm

Create clustered

document graph

Enter the query

Add weight to

nodes in graph

Enter threshold for

minimal cluster

Find minimal

clusters

Show result

include

include

include

Figure3.3 Use Case Diagram

3.8.2 Sequence Model:



27/59

A Sequence diagram is a structured representation of behavior as a series of sequential

steps over time. It is used to depict work flow, message passing and how elements in general

cooperate over time to achieve a result.

Each sequence element is arranged in a horizontal sequence, with messages passing back

and forward between elements. An Actor element can be used to represent the user initiating the

flow of events. Stereotyped elements, such as Boundary, Control and Entity, can be used to

illustrate screens, controllers and database items, respectively. Each element has a dashed stem

called a lifeline, where that element exists and potentially takes part in the interactions.

Components of Sequence diagrams:

1. Actor

2. Lifeline

3. Message

4. Self Message

5. End point

An Actor is a user of the system; user can mean a human user, a machine, or even another

system or subsystem in the model. Anything that interacts with the system from the outside or

system boundary is termed an Actor. Actors also represent the role of a user in Sequence

Diagrams. Enterprise Architect supports a stereotyped Actor element for business modeling. A

Lifeline is an individual participant in an interaction.

Sequence Diagram



28/59

sd

Parser Spl i tter Graph

creation

Relation

manager

Cluster

algori thm

Minimal

spanning

treeDisplay screenUser

alt Input query

[ I f avai lable]

[If not]

Calculates similarity

between Query and

Each Cluster (as node)

Clusters relevant to the

query. (i.e. Clusters

having maximum

weight with query)

Provide HTML d ocuments ()

Edge thrshold val ue()

Extract text file from body tag()

Text document()

Split into nodes()

Provide no des()

Build document graph()

Provide graph()

Calculate wei ght of edges()

Dispay nodes

()

Form cl usters()

Display clusters()

Ask thresho ld

value()

Ask for the

Query()

Provide qu ery()

Show

clusters()

Provide clusters()

Caculates weight()

Form weighted cluster graph()

Show weighte d cluster graph()

Calculate minimal clusters()

Display Result()

Figure3.4 Sequence Diagram

3.8.3 Class Model:



29/59

Class diagrams capture the logical structure of the system, the Classes and objects that

make up the model, describing what exists and what attributes and behavior it has. The Class

diagram captures the logical structure of the system: the Classes - including Active and

Parameterized (template) Classes - and things that make up the model. It is a static model,

describing what exists and what attributes and behavior it has, rather than how something is done.

Class diagrams are most useful to illustrate relationships between Classes and Interfaces.

Generalizations, Aggregations and Associations are all valuable in reflecting inheritance,

composition or usage, and connections, respectively.

Components of Class diagram:

1. Class

2. Associate

3. Compose

4. Realize

A Class is a representation of objects that reflects their structure and behavior within the

system. It is a template from which actual running instances are created, although a Class can be

defined either to control its own execution or as a template or parameterized Class that specifies

parameters that must be defined by any binding class. A Class can have attributes (data) and

methods (operations or behavior). Classes can inherit characteristics from parent Classes and

delegate behavior to other Classes. Class models usually describe the logical structure of the

system and are the building blocks from which components are built.

Class Diagram



30/59

c l a s s C la s s M o d e l

U s e r

- P a s sw o r d : i n t

- U s e r _ i d : i n t

- U s e r_ n a m e : i n t

+ A u t h e n t i c a t io n () : v o i d

+ I n p u t ( ) : v o i d

+ I n p u t _ t h r e s h o l d ( ) : v o i d

I n p u t

- F i l e _ n a m e : i n t

- F i l e _ s i z e : i n t

+ C o n v e r si o n ( ) : v o i d

+ F i r e q u e r y () : vo i d

+ I n p u t t h r e sh o l d ( ) : v o i d

S p l i t t e r


- N o . o f st o p w o r d s: i n t

+ E x t r a c t ( ) : v o i d

+ P a r si n g ( ) : v o i d

+ S p l i t ( ) : v o i d

R e l a t io n M a n a g e r


- n o . o f n o d e s : i n t

+ B u i l t g r a p h ( ) : v o i d

+ C a l c u l a t e c l u st e r w e i g h t () : v o i d

+ C a l c u l a t e w e i g h t o f n o d e s( ) : v o i d

+ C r e a t e n o d e s () : v o i d

+ D i s p l a y n o d e s () : v o i d

+ D i s p l a y w e i g h t () : vo i d

C l u s t e r i n g a l g o r i th m

- N o . o f c l u s t e r : i n t

- T h r e sh o l d : i n t

+ C r e a t e c l u s t e r g r a p h ( ) : vo i d

+ D i s p l a y c l u s te r ( ) : v o i d

+ F o r m c l u s t e r () : v o i d

I n p u t q u e r y

- K e y w o r d s: i n t

+ A s si g n w e i g h t ( ) : v o i d

+ C o m p a r e n o d e s () : v

+ F i r e q u e r y () : v o i d

O u t p u t

- R e s u l t : i n t

+ D i s p l a y c l u s te r ( ) : v o i d

+ D i s p l a y n o d e s( ) : v o i d

+ D i s p l a y r e s u l t ( ) : v o i d

Figure3.5 Class Diagram

3.8.4. Activity Diagram



31/59

Activity diagrams are used to model the behaviors of a system, and the way in which these

behaviors are related in an overall flow of the system. The logical paths a process follows, based

on various conditions, concurrent processing, data access, interruptions and other logical path

distinctions, are all used to construct a process, system or procedure.

act activ i ty

Parsing State

In i t i a l s ta te

HTML doc.to

systemParsing

Extract text

fi le from

body tagText file

Split State

Processing

text fi le

Split doc.into

nodes

P ro v i d e

Nodes

Built doc.graph

Clustering

Find

similariry

b e tw e e n

nodes

U s e n e a re s t

neighbour

algorithm

Doc.clustering

Minimal c luster formation

Creat

doc.cluster

graph

Add we ights

to nodes in

graph

Query by

u s e r Calculate

similarity

b e tw e e n

query and

each cluster

F ind minimal

cluster

Resul t

Enter

theshold for

c luster

S h o w re s u l t

Fina l s ta te

Figure3.6 Activity Diagram

3.8.5. State machine Diagram



32/59

A State Machine diagram illustrates how an element can move between states, classifying

its behavior according to transition triggers and constraining guard

stm

Input

HTML input files

Parsing

Text files

Processing text file Split document into

nodes

Initial

Assign weigh to each

nodeWeighted graph of

nodes

Clustering

Extract

Clustering by nearest

neighbour

Create node upto

threshold v alue

Assign weight to

cluster

Create graph

Result

User query Comparison between

query and clusters

Display topmost result

threshold value

Figure3.7 State Machine Diagram

3.8.6. Component Diagram



33/59

A Component diagram illustrates the pieces of software, embedded controllers and such

that make up a system, and their organization and dependencies. A Component diagram has a

higher level of abstraction than a Class diagram; usually a component is implemented by one or

more Classes (or Objects) at runtime. They are building blocks, built up so that eventually a

component can encompass a large portion of a system.

cmp Component Model

Result

Minimal spanning

treeClustering

algorithm

Relation manager Graph creation

Splitter Parser

Input file

Threshold and input fi le

Figure3.8 Component Diagram

Chapter 4



34/59

Implementation

Implementation is the stage in the project where the theoretical design is turned into a

working system and is giving confidence on the new system for the users, which it will work

efficiently and effectively. It involves careful planning, investigation of the current System and its

constraints on implementation, design of methods to achieve the change over, an evaluation, of

change over methods. Apart from planning major task of preparing the implementation are

education and training of users. The more complex system being implemented, the more involved

will be the system analysis and the design effort required just for implementation.

An implementation co-ordination committee based on policies of individual organization

has been appointed. The implementation process begins with preparing a plan for the

implementation of the system. According to this plan, the activities are to be carried out,

discussions made regarding the equipment and resources and the additional equipment has to be

acquired to implement the new system.

Implementation is very important phase, the most critical stage in achieving a successful

new system and in giving the users confidence. That the new system will work is effective. After

the system is implemented the testing can be done. This method also offers the greatest security

since the old system can take over if the errors are found or inability to handle certain type of

transactions while using the new system.

Main Functions implemented for project are listed as below in the project modules.4.1 System implementation

The input is a text file contains new line keyword. The contents are separated by new line

are the contents of the node which are the paragraph. If there are no new line in the file, then whole

file contents becomes a single node and hence a single cluster, which can degrade the performance

of the result.

The total workflow is divided into following modules:

Module 1: HTML to text parser

Processing the input HTML files parsing the HTML contents and extracting the text lines.

Module 2: Processing the input text file and creating the document graph

Functions Used:

Split ()



35/59

The system accepts input text file. The file is read and stored into a string. The string is

then split by the newline keyword. The split file is assigned to the string array as the split function

returns the string array. The array contains paragraphs which are further treated as nodes.

string [] nodeList = null;

NodeList = File.ReadAllLines (txtInputFile.Text);

The next stage is to find the similarity between the nodes that means finding the similarity

edges between nodes and finding their similarity or weight.

Each paragraph becomes a node in the document graph.

The document graph G (V, E) of a document dis defined as follows:

vdis split to a set of non-overlapping nodes t (v),v V.

An edge e (u, v)Eis added between nodes u, v Vif there is an association between t (u) and t (v) in

d.Hence, we can view G as an equivalent representation ofd, where the associations between text

fragments ofdare depicted.

Module 3: Adding Weighted Edges to Document Graph

(Note: Adding weighted edge is query independent)

A weighted edge is added to the document graph between two nodes if they either

correspond to adjacent node or if they are semantically related, and the weight of an edge denotes

the degree of the relationship. Here two nodes are considered to be related if they share common

words (not stop words) and the degree of relationship is calculated by Semantic parsing. Also

notice that the edge weights are query-independent, so they can be pre-computed.

The following input parameters are required at the pre computation stage to create the

document graph:

4.1.1 Threshold for edge weights:

Only edges with weight not below threshold will be created in the document graph. (A

threshold is user configurable value that controls the formation of edges)

Adding weighted edge is the next step after generating document graph. Here for each pair of

nodes u, v we compute the association degree between them, that is, the score (weight) EScore (e)

of the edge e (u, v). If Score (e) threshold, then e is added to E. The score of edge e (u, v) where

nodes u, v have text fragments t(u), t(v) respectively is:



36/59

Where t f (d, w) is the number of occurrences of w in d,

Id f (w) is the inverse of the number of documents containing w, and

size(d) is the size of the document (in words).That is, for every word w appearing in both text

fragments we add a quantity equal to the tf/idf score of w. Notice that stop words are ignored.

Functions Used:

Remove Common Words ()

The common words are eliminated from the nodes as they can degrade the performance of

calculating the similarity between two nodes also they can degrade the system performance

because of number of computational loops increases. E.g. a, an, the, he, she, they, as, it, and, are,

were, there etc.

The filtered two nodes are passed as parameters to the Relation Manager Class for finding

the similarity between them.

Relation Manager ()

The relation manager function takes two nodes as a parameter and returns the semantic

relation in the form of weight (EScore) between two nodes by traditional edge weight formula

specified as below:

If EScore >= Threshold, the edge is added to the document graph.

The graph is stored into tabular form as shown below

Table 4.1. Nodes and Node weights

First Node Second Node Edge Weight



37/59

1 2 0.5

1 3 0.7

.

.

.

.

.

.

30 31 0.8

30 32 0.6

Module 4: Document Clustering

Clustering is grouping of similar nodes (The nodes which shows degree of closure greater

than or equal to the Cluster Threshold specified by the user) into a group. The following approach

of clustering is used Nearest Neighbor.

Algorithm for Nearest Neighbor Clustering:

1. Set i = 1 and k = 1. Assign pattern to cluster .

2. Set i = i + 1. Find nearest neighbor of among the patterns already assigned to clusters.

Let denote the distance from to its nearest neighbor. Suppose the nearest neighbor

is in cluster m.

3. If greater than or equal to t then assign to where t is the threshold specified by

the user. Otherwise set k = k+1 and assign to a new cluster .

4. If every pattern has been considered then stop else go to step 2.

Functions Used:

FindMaxWeight ()

FindMaxWeight returns the pair of nodes having maximum edge weight with their weight

from document graph. E.g.

Table 4.2.Nodes and the max weight

First Node Second Node Max Weight

1 22 2.5

2 19 1.2



38/59

3 31 3.5

.

.

.

.

.

.

31 12 2.7

NearestNeighborCluster ()

The first pair of nodes in the above table is added in first Cluster because they have

maximum weight. Here Node 1 and 22 are closely related hence added to the first cluster. So

Cluster_1 contains 2 nodes 1 and 22. Cluster_1 :- 1,22

Next node node 2 shows maximum weight with node 19 but none of the node (node 2 and node

19 )are in previous clusters so they forms new cluster Cluster_2

Cluster_2:- 2, 19

Similarly Node 3 and 13 are forming new cluster. Cluster_3:- 3, 31

Now next pair (node 31 and 12) contains node 31 which is already in cluster_3 hence node 12 is

added into cluster_3, so cluster_3 now becomes Cluster_3:-3, 31, 12.

The above procedure is repeated till the end of the node pairs.

Module 5: Creating Clustered Document Graph

After the clusters are formed either by Nearest Neighbor or agglomerative hierarchical, the

similarity edges between two similar clusters are calculated. This is same as creating document

graph and adding the similarity edges between two similar nodes. Every cluster is split intoindividual nodes and this grouping of nodes is passed to the relation manager in order to find the

weight between two set of nodes or Clusters.[5,7,10]

Module 6: Adding Weight to Nodes In Clustered Document Graph

When a query Q arrives, the nodes in V are assigned query-dependent weights according to

their relevance to Q. In particular, we assign to each node v corresponding to a text fragment t(v)

node score NScore(v) defined by the Okapi formula as given below.

NScore (V) =

Tf- is the terms frequency in document,

Qtf- is the terms frequency in query,

N -is the total number of documents in the collection,



39/59

df is the number of documents that contain the term,

dl is the document length (in words),

avdl is the average document length and

k1 (between 1.02.0), b (usually 0.75), and k3 (between 01000) are constants.

Functions Used:

CalculateClusterWeight ()

All the values mentioned above are computed and passed as parameters to the okapi formula.

The returned Node Weight is stored in the table. e.g.

Table 4.3.Cluster node and weight of cluster

Cluster No Nodes Cluster Weight

Cluster_1 1,22,1, 32 2.4

Cluster_2 9,17,24 2.5

Cluster_3 34,12,10 0Cluster_4 4,14,23 0

Module 7: Generating Closure Graph and Finding Minimal Clusters

Closure graph contains minimal clusters. Minimal clusters are the clusters which shows

non zero weight with the in out query. In Above example (Tab 3) only Cluster_1 and Cluster_2 are

the minimal clusters. The minimal clusters are the clusters which appear in the result.

Module 8: Result

After getting the minimal clusters, the result can be displayed in two ways:

Top 1 Result Summary

Multi-Result Summary

In top 1 result summary, the minimal cluster having highest weight with the input query is

returned, and in multi-result summary all the minimal clusters are returned as result. Before

displaying the result as a cluster, the cluster is split into its nodes and the weight of every node

with the input query is calculated. The nodes are displayed in decreasing order of the weight with

the input query. Means the node having highest weight is displayed at the top and lowest at the

bottom.



40/59

Chapter 5

Testing5.1 Testing strategies

Testing is an important phase in the Software Development Life Cycle. Testing should be

planned and conducted systematically.

Generic aspects of a test strategy



41/59

1. Testing begins at the module level and works outwards.

2. Different testing techniques are used at different points of time.

3. Testing is done by developers and mainly for larger projects, by an independent test group.

4. Testing and debugging are two different activities, but debugging should be incorporated

into any testing strategy.

5.2 Testing Techniques

5.2.1 Black box testing

Black box testing focuses on the functional requirements of the software. That is, black box

testing enables the software engineer to derive sets of input conditions that will fully exercise all

functional requirements for a program. Black box testing is not an alternative to white box testing

techniques; rather it is a complementary approach that is likely to uncover a different class of

errors than white box methods.

Black box testing attempts to find errors in the following categories

1. Incorrect or missing functions.

2. Interface errors.

3. Errors in data structures or external database access.

4. Performance errors.

5. Initialization and termination errors

Unlike white box testing, which is performed early in the testing process, black box testing

tends to be applied during later stages of testing. Because black box testing purposely disregards

control structure, attention is focused on the information domain.

5.2.2 White box testing

Using white box testing methods, the software engineer can derive test cases that can

1. Guarantee that all dependent paths within a module have been exercised at least once.

2. Exercise all logical decisions on their true and false sides.

3. Execute all loops at their boundaries and within their operational bounds

4. Exercise internal data structures to assure their validity.

Need for white box testing arises because of different reasons:

1. Errors tend to creep into the work when we design and implement function, conditions or

controls that are out of the main stream.



42/59

2. We often believe that a logical is not likely to be executed when, in fact, it may be executed

on the regular basis.

3. Typographical errors are random. When a program is translated into programming

language source code, it is likely that some typing errors will occur.

I have used White Box testing and Black Box Testing. It is also called behavioral testing which

focuses on functional requirements of the software. In this testing the software is tested as a black

box without considering its internal details. Required sets of input were supplied and the desired

outputs are obtained.

5.3 Test Cases:

Table 5.1. Test cases

Test

Case

ID

Test Case

Name

Description Steps Carried out Expected

Results

Actual Result

TC1 HTML file Validation

of the input

file

1. Enter a correct

text file in which

at least 2

paragraphs are

there.

2. Enter an

incorrect folder

in which web

pages are stored

and click set

dataset folder.

Accepted the

text file.

Then create

nodes and

show data in a

particular

node

Error Message

= Other

format file

than text will

not be

uploaded

Accepted the

text file.

Then create

nodes and show

data in a

particular node

Error Message

= Other

format file than

text will not be

uploaded

TC2 Changing

threshold

for

clustering

To check

how

threshold

value

Selecting the

threshold value

for clustering

The size of

cluster

increases and

no of cluster

The size of

cluster

increased and

no of cluster



43/59

affects the

size of

cluster and

performanc

e of

algorithm

decreases.

Due to big

size of cluster,

looping in

case of nearest

neighbour

algorithm

increases.

Hence its

performance

decreases.

decreased. Due

to big size of

cluster, looping

in case of

nearest

neighbour

algorithm

increased.

Hence its

performance

decreased.

TC3 Top mostsummary

To checkwhether in

result the

first result

is the best

result

Afterclustering ,get

minimal cluster

and find the

summary as

result

The clustercontaining the

best result for

fired query

should appear

at top. In that

cluster a node

containing the

best result

should come

at first

position.

The cluster containing the

best result for

fired query

appeared at top.

In thet cluster a

node

containing the

best result

came at first.

TC4 Similarity

calculation

Checking

similarity

between

two clusters

Provide two

clusters as input

Clustering of

similar

clusters

Clustering of

similar clusters

TC5 Weight

Calculation

Calculating

weight

between

two nodes

Provide two

nodes as input

Weight

between two

nodes

Weight

between two

nodes

TC6 Removal of Checking Providing text After After removing



44/59

common

words

the effect

after

removal of

common

words

file with and

without common

words

removing

common

words, as less

no of words

remained it is

easy to

calculate

document

graph, similar

nods. Hence

performance

of system get

increases

common

words, as less

no of words

remained it is

easy to

calculate

document

graph, similar

nods. Hence

performance of

system get

increases

TC7 GUI Alignment

of

Controls

Color of all

buttons

should be

uniform.

All

textboxes

should be

aligned in a

straight line

Textboxes

should be

properly

aligned.

Should be

uniform

Should be

aligned

Textboxes

should be

properly

aligned.

Should be

uniform

Should be

aligned



45/59

5.4 User Interface (Screenshots)

The basic user interface consists of at least three windows. First window is needed to input

the text file. For this user has to give input as text file only.

The second interface is to display different clustering techniques .Here threshold for

clustering is also taken as input for clustering. For this user has to select a clustering algorithm out

of two. He has to give threshold value for clustering.

The third interface is to display the query and to take % of the correlation of cluster with

the query.

5.4.1 Before uploading the HTML file(s).



46/59

Figure5.1. before uploading the HTML Files

5.4.2 Uploading HTML file(s).



47/59

Figure5.2:Uploading HTML file(s).



48/59

5.4.3: Browsing the HTML file(s).

Figure5.3: Browsing the HTML file(s).



49/59

5.4.4: After browsing the HTML file(s).

Figure5.4: After browsing the HTML file(s).



50/59

5.4.5: Processing HTML file(s) and Display node relations.

Figure5.5: Processing HTML file(s) and Display node relations



51/59

5.4.6: Before clustering of nodes.

Figure5.6: Before clustering of nodes.



52/59

5.4.7: Clusters formation and building clustered graph

Figure 5.7: Clusters formation and building clustered graph



53/59

5.4.8: Taking input query and thresholds for minimal combination of clusters in %.

Figure 5.8: Taking input query and thresholds for minimal combination of clusters in %.



54/59

5.4. 9: Display minimal cluster as result along with link to actual web page(s).

Figure 5.9: Display minimal cluster as result along with link to actual web page(s).



55/59

5.4.10: Display actual web page(s) we are currently dealing with and highlighting the output

data.



56/59

Figure 5.10: Display actual web page(s) we are currently dealing with and highlighting the output

data.

Chapter 6

Future Scope

Future scope

The sentence ordering module can be used to define ordering among those topic sentences.

Another important aspect is that our system can be tuned to generate summary with custom size

specified by users. It is shown that our system can generate summary for other non-English

documents also if some simple resources of the language are available. In future we will use some

dictionary to use all the synonyms of the query words and as well As of the keywords as the extra

keywords to search the relevant information, so the quality of the summary will increase.

In the News domain Update Summary is a very important and useful concept. On a same

news topic every day or every hour there are some new or updated news arrived. So one who

already read the previous news article, (s) he will not be interested to read the whole article again.

(S)He will want to know the updated News only. With the help of the Update summary, reader can

read and track news very easily. We can develop a system which will produce the update summary

too.



57/59

Conclusion

We are specially dealing with generating the summary for the News domain. Summary is a

very important and useful concept. On a same news topic every day or every hour there are some

new or updated news arrived. So one cannot go through all the newspapers and each and every

news article. (S)He will want to know the summary only. With the help of the News summary,

reader can read and track news very easily.

As we are providing facility of query dependent news summary, user can easily have news

summery according to his/her interest.

We are directly dealing with the HTML pages so user can retrieve online news and directly

get the summary for the same. In this work we present a graph based approach for query dependent

multi document. Summarization system along with the nearest neighbor clustering technique. It

works efficiently in case of news summarization.

Because of the multi-document news summarization, there is no need to go through all the

newspapers. As we are dealing with query specific summarization, user can easily have news

summery according to his/her interest. As well as the accuracy of the result is depend upon initial

Edge Threshold and Cluster threshold as well as Result accuracy percentage, so user can control

the relevance.



58/59

Bibilography

References:

[1] Beyond Single-Page Web Search Results: Ramakrishna Varadarajan, Vagelis Hristidis, Tao Li

Published in IEEE TKDE, 2008 (Journal paper).

http://pages.cs.wisc.edu/~ramkris/Document_Summarization.pdf

[2] R. Varadarajan, V Hristidis : A System for Query-Specific Document Summarization ,

CIKM06, November 511, 2006, Arlington, Virginia, USA.

Copyright 2006 ACM 1-59593-433-2/06/0011.

[3] R. Varadarajan, V Hristidis: Structure-Based Query-Specific Document Summarization. Poster

paper at CIKM 2005

[4] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and

Browsing in Databases using BANKS. ICDE, 2002.

[5] M. White, T. Korelsky, C. Cardie, V. Ng, D. Pierce, and K. Wagstaff.: Multidocument

Summarization via Information Extraction. HLT, 2001

[6] Paladhi, S., Bandyopadhyay, S. 2008.

A Document Graph Based Query Focused Multi- Document Summarizer.

http://www.sivajibandyopadhyay.com/pinaki/papers/paclic24_SB_PB.pdf

[7] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and

Browsing in Databases using BANKS. ICDE, 2002.

[8]R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text

summarization, in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,

(Morristown, NJ, USA), p. 20, Association for Computational Linguistics, 2004.

Books refered:



59/59

[1]Analyzing the hierarchical Clustering Algorithm for Categorical Attributes, By Parul Agarwal,

M. Afshar Alam, Ranjit Biswas.

International journal of innovation, Management and Technology, Vol. 1, No. 2, June 2010

ISSN: 2010-024

http://www.ijimt.org/papers/34-K033.pdf

[2] Professional asp.net 3.5

Author: Bil evjen, scott hanselman,Devin rade Chapter2: pp 63-10, Chapter20: pp 929

[3] Asp.net website programming Chapter 1: pp 15 - 38

[4] C# 2008 Programmers Reference

Author: Wei-Meng Lee

[5] Hardy, H., Shimizu, N., Strzalkowski, T., Ting, L., Wise, G. B., Zhang. X. 2002.

Crossdocument summarization by concept classification. SIGIR, pp. 65--69.

[6]Barzilay, R. and M. Elhadad. 1999. Using lexical chains for text summarization. In Mani and

Maybury (1999), 111 21.
http://www.ijimt.org/papers/34-K033.pdfhttp://www.ijimt.org/papers/34-K033.pdf

MDSN Report

Documents

Transcript of MDSN Report