MDSN Report
Transcript of MDSN Report
-
7/31/2019 MDSN Report
1/59
Chapter 1
Introduction
Currently, the World Wide Web is the largest source of information. Huge amount of data
is present on the Web and large amount of data is added to the web constantly. User searches for
the required information by using particular keywords. We are specially dealing with News. As the
large number of news available on web, so does the need to provide high-quality summaries in
order to allow the user to quickly locate the desired information. i.e. to get summary of different
news from variety of newspapers about same topic as per the query specification.
Summarization is the process of condensing a source text into a shorter version preserving its
information content. It can serve several goals from survey analysis of a scientific field to quick
indicative notes on the general topic of a text. In other words summarization is the process of
automatically creating a compressed version of a given text that provides useful information for the
user. The information content of a summary depends on users needs. Topic-oriented summaries
focus on a users topic of interest, and extract the information in the text that is related to the
specified topic. Indicative summaries, which can be used to quickly decide whether a text is worth
reading, are naturally easier to produce
Query-oriented summarization (QS) tries to extract a summary for a given query. It is a
common task in many text mining applications. For example, a user submits a query to a search
engine and the search engine usually returns a lot of result documents. To click-and-view each of
the returned documents is obviously tedious and infeasible in many cases. One challenging issue is
how to help the user digest the re turned documents. Typically, the documents talk about different
perspectives of the query. An ideal solution might be that the system automatically generates a
concise and informative summary for each perspective of the query. Much work has been done for
document(s) summarization. Generally, document(s) summarization can be classified into three
categories:
1. Single document summarization (SDS)2. Multi-document summarization (MDS)
3. Query oriented summarization (QS)
SDS is to extract a summary from a single document; while MDS is to extract a summary from
multiple documents. The two tasks have been intensively investigated and many methods have been
Multi-Document Extractive Summarization for News Page 1 of 59
-
7/31/2019 MDSN Report
2/59
proposed. The methods for document(s) summarization can be further categorized into two groups:
unsupervised and supervised. The unsupervised method is mainly based on scoring sentences in the
documents by combining a set of predefined features.
In the supervised method, summarization is treated as a classification or a sequential
labeling problem and the task is formalized as identifying whether a sentence should be included in
the summary or not. However, the method requires training examples. Query-oriented
summarization (QS) is different from the SDS and the MDS tasks. The document cluster denotes
the information source and the query denotes the information need. A document cluster is a sub set
of the entire document collection. A compelling application of document summarization is the
snippets generated by Web search engines for each query result, which assist users in further
exploring individual results. The Information Retrieval (IR) community has largely viewed text
documents as linear sequences of words for the purpose of summarization. Although this model
has proven quite successful in efficiently answering keyword queries, it is clearly not optimal since
it ignores the inherent structure in documents.
Furthermore, most summarization techniques are query-independent and follow one of the
following two extreme approaches:
1. Either they simply extract relevant passages viewing the document as an unstructured set of
passages.
2. Employ Natural Language Processing techniques.
The former approach ignores the structural information of documents while the latter is too
expensive for large datasets (e.g., the Web) and sensitive to the writing style of the documents.
Here a method to add structure, in form of a graph, to text documents in order to allow effective
query specificsummarization is discussed. That is a document is viewed as a set of interconnected
text fragments.
Main focus is on keyword queries since keyword search is the most popular information
discovery method on documents, because of its power and ease of use. This technique has the
following key steps: First, at the preprocessing stage, a structure is added to every document,
which can then be viewed as a labeled, weighted graph, called the document graph. Then, at query
time, given a set of keywords, a keyword proximity search is performed on the document graphs to
discover how the keywords are associated in the document graphs. For each document its summary
is the minimum spanning tree on the corresponding document graph that contains all the keywords
Multi-Document Extractive Summarization for News Page 2 of 59
-
7/31/2019 MDSN Report
3/59
(or equivalent based on a thesaurus). So data from the minimum spanning tree nodes is collected
and presented as a summary of the document.
Automatic summarization is the creation of a shortened version of a text by a
Computer program which contains important information of the original documents
1.1 History:
1. in 1950s: First systems surface level approaches
Term frequency (Luhn, Rath)
2. 2.in 1960s: First entity level approaches
Syntactic analysis
Surface Level: Location features (Edmundson 1969)
3. 3.in 1970s:
Surface Level: Cue phrases (Pollock and Zamora)
Entity Level
First Discourse Level: Stroy grammars
4. 4.in 1980s:
Entity Level (AI): Use of scripts, logic and production rules, semantic networks (Dejong 1982,
Fum et al.1985)
Hybrid (Aretoulaki 1994)
5. 5.from 1990s-:explosuion of all
1.2 Literature survey:
1.2.1 Aim:
Our aim is to achieve multi-document news summarization.
1. In this case we are parsing the HTML document(s) and extracting the text file(s) from it.
As we are dealing with the text only, we have chosen the nearest neighbor algorithm for
clustering. As it is less complex and sufficient for text.
2. For the same we are dealing with extractive summary along with the query Specification.
1.2.2 Extractive and Abstractive Summarization
Extractive Summarization:
Multi-Document Extractive Summarization for News Page 3 of 59
-
7/31/2019 MDSN Report
4/59
Produces a summary by selecting indicative sentences, passages or paragraphs from an original
document according to a predefined target summarization ratio.
Abstractive summarization:
Provides a fluent and concise abstract of a certain length that reflects the key
summarization. This requires highly sophisticated techniques, including semantic representation
and inference, as well as natural language generation concept of the document. In recent years,
researchers have tended to focus on extractive years spectrum of Text Summarization Research.
1.2.3 A System for Query-Specific Document Summarization
Mr. Ramakrishna Varadarajan & Vangelis Hristidis presented a method to create query specific
summaries by identifying the most query-relevant fragments and combining them using the
semantic associations within the document. In particular, structure is added to the documents in the
preprocessing stage and converted them to document graphs. Then, the best summaries are
computed by calculating the top spanning trees on the document graphs. This paper presents and
experimentally evaluates efficient algorithms that support computing summaries in interactive
time. Furthermore, the quality of the summarization method is compared to current approaches
using a user survey.
In this work a structure-based technique is presented to create query-specific summaries
for text documents. In particular, the document graph of a document is created; to represent the
hidden semantic structure of the document and then perform keyword proximity search on this
graph. It is shown in the paper that with a user survey that our approach performs better than other
state of the art approaches. Furthermore, feasibility of the approach with a performance evaluation
is shown at last.
In this approach document graph was built and processing was done on text document, we
are implementing somewhat similar methodology but in addition HTML to text parser is added,
i.e. we are processing on HTML files.
1.2.4 An Incremental Summary Generation System
Multi-Document Extractive Summarization for News Page 4 of 59
-
7/31/2019 MDSN Report
5/59
Mr. C Ravindranath Chowdary & P Sreenivasa Kumar presented an algorithm strategy to finds
pair of sentences, one from the current summary and other from the new document that is to be
swapped to improve the quality of the summary. For a given query, quality of a summary is
determined by its informativeness, coherence and completeness. A scoring function that captures
these features to calculate the quality of a summary is proposed. The process of
updating/improving summary is continued iteratively till the improvement in quality measure
becomes negligible. Experimental results, both qualitative and quantitative, show that performance
of the proposed approach for incremental summary generation is quite encouraging.
This paper deals with updating the available extractive summary in the scenario where the
initial documents used for summarization are not accessible. The proposed algorithm updates the
available summary as and when a new document is made available to the system.
In this approach extractive summarization in used but the original document is not
accessible. We are also dealing with extractive summarization but original document is accessible
moreover a highlighted feature is added for convenience of user.
1.2.5 Automatic Text Summarization
Mr. Mohamed Abdel Fattah & Fuji Ren investigates the effect of each sentence feature on the
summarization task. Then they used all features score function to train genetic algorithm (GA) and
mathematical regression (MR) models to obtain a suitable combination of feature weights. The
proposed approach performance is measured at several compressions rates on a data corpus
composed of 100 English religious articles. The results of the proposed approach are promising.
This paper investigates the use of genetic algorithm GA), mathematical regression (MR),
for automatic text summarization task. This new approach is applied on a sample of 100 English
religious articles. The approach results outperform the baseline approach results. The approaches
have been used the feature extraction criteria which gives researchers opportunity to use many
varieties of these features based on the used language and the text type.
In this approach the algorithm used for summarization was Genetic algorithm while we are
using Nearest Neighbor algorithm, moreover it is automatic summarization strategy while we are
using extractive summarization strategy.
1.2.6 Multi-topic based Query-oriented Summarization
Multi-Document Extractive Summarization for News Page 5 of 59
-
7/31/2019 MDSN Report
6/59
Mr. Jie Tang , Limin Yao & Dewei Chen tries to break limitations of the existing methods and
study a new setup of the problem of multi-topic based query-oriented summarization. More
specifically, this paper proposed two strategies to incorporate the query information into a
probabilistic model. Experimental results on two different genres of data show that our proposed
approach can effectively extract a multi-topic summary from a document collection and the
summarization performance is better than baseline methods. The approach is quite general and can
be applied to many other mining tasks, for example product opinion analysis and question
answering.
This paper investigates the problem of multi-topic based query-oriented summarization.
The paper formalizes the major tasks and proposes a probabilistic approach to solve the tasks. Two
strategies are studied for simultaneously modeling document contents and the query information.
We are also dealing with query oriented multi-document summarization and we have
specified it for news.
1.2.7 Proposed modules:
1. HTML to text parser
2. Processing the input text file and creating the document graph
3. Adding weighted edges to document graph
4. Document Clustering
5. Create clustered document graph
6. Adding weight to nodes in clustered document graph
7. Generate closure graph and find minimal clusters
8. Result
1.2.8 Clustering:
Clustering is one of the most important unsupervised learning processes that organizing
objects into groups whose members are similar in some way. Clustering finds structures in a
collection of unlabeled data. A cluster is a collection of objects which are similar between them
and are dissimilar to the objects belonging to other clusters
1.2.8.1Uses of Clustering:
Multi-Document Extractive Summarization for News Page 6 of 59
-
7/31/2019 MDSN Report
7/59
If a collection is well clustered, we can search only the cluster that will contain relevant
documents.
Searching a smaller collection should improve effectiveness and efficiency.
1.2.8.2Nearest Neighbour Algorithm
1. Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
2. Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each
step, and stops when the desired number of clusters is reached.
Step1:Nearest Neighbor, Level 2, k = 7 clusters.
Step 2: Nearest Neighbor, Level 3, k = 6 clusters.
Step 3: Nearest Neighbor, Level 4, k = 5 clusters.
Multi-Document Extractive Summarization for News Page 7 of 59
-
7/31/2019 MDSN Report
8/59
Step 4: Nearest Neighbor, Level 5, k = 4 clusters.
Step 5: Nearest Neighbor, Level 6, and k = 3 clusters.
Multi-Document Extractive Summarization for News Page 8 of 59
-
7/31/2019 MDSN Report
9/59
Step6: Nearest Neighbor, Level 7, k = 2clusters.
Step 7:Nearest Neighbor, Level 8, k = 1 cluster.
Multi-Document Extractive Summarization for News Page 9 of 59
-
7/31/2019 MDSN Report
10/59
1.3 Advantages:
1. Because of the multi-document news summarization, there is no need to go through all the
newspapers.
2. As we are dealing with query specific summarization, user can easily have news summeryaccording to his/her interest.
3. The accuracy of the result is depend upon initial Edge Threshold and Cluster threshold as
well as Result accuracy percentage, so user can control the relevance.
1.4 Limitations:
1. Takes too long time to process the text files more than 50 KB or having more than 200
paragraphs due to heavy computational loops.
2. The text in the images or in the Flash contents cannot be parsed.
3. The text accessed through web services cannot be parsed.
Chapter 2
Multi-Document Extractive Summarization for News Page 10 of 59
-
7/31/2019 MDSN Report
11/59
System Requirement and Specification
2.1 Scope of the Project
This project can create the query dependent summary generated by clustering algorithm.
Here we have considered nearest neighbor clustering algorithm. As every file format can be
converted into text file. This algorithm can be applied on text file. Nodes in text file i.e. contents in
every newline are clustered and query dependent summary can be generated.
2.2 Requirement Specifications
Requirements are the desired characteristics of the software being developed. The first
activity in most projects is the identification and documentation of the requirements. Requirements
cover both requirements engineering (identification, analysis and capture) and requirements
management (managing change, creating and maintaining agreement with customers, trace ability
and metrics).
The development of large, complex systems presents many challenges to systems
engineers. Foremost among these is the ability to ensure that the final system satisfies the needs of
users and provide for easy maintenance and enhancement of these systems during their deployed
lifetime. These systems often change and evolve throughout their life cycle. This makes it difficult
to track the implemented system against the original and evolving user requirements.
Requirements establish an understanding of users need and also provide the final yardstick
against which implementation success is measured. Various studies have shown that roughly halfof the application errors can be traced to requirement errors and deficiencies. Thorough
documentation and properly managing requirements are the keys to developing quality
applications. By allowing project teams to define and document requirement data including user
defined attributes, priority, status, acceptance criteria and traceability, detection and correction of
missing, contradictory or inadequately defined requirements can be done the following
requirements and constraints were considered during the requirement analysis phase. For clustering
we have to take text as input file. If any other file format is there, it is firstly converted into text
format. Then clustering of nodes in text document and query dependent summarization is done.
2.2.1 Product performance requirements
Input file must be text file. As the size of the text document changes, performance of the
algorithms also changes .So if file is larger, we should have better hardware facilities.
2.2.2 Hardware Requirements
Multi-Document Extractive Summarization for News Page 11 of 59
-
7/31/2019 MDSN Report
12/59
Processor : Pentium IV or higher.
Ram : Minimum 256 MB.
Hard Disk : 40 GB.
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.
2.2.3 Software Requirements
Software components required for building the Project are:
Operating System : WINDOWS XP or above
Technique : Microsoft Visual Studio 2010 (.NET Framework 3.5)
Internet explorer.
2.3 Functional requirements
The project includes the following modules:
HTML to text parser: Processing the input HTML files parsing the HTML contents and extracting
the text lines.
Uploading and processing: In this module a text file is uploaded and processed. Every
single line is considered as a node and the data within that node is displayed.
Building document graph: The weight between every node to every other node is
calculated.
Clustering and making clustered graph: Nearest neighboring method and agglomerativehierarchical clustering technique is used for making clusters of previous step document
graph and clustered graph is prepared.
Query firing and getting minimal cluster: Here we are firing the query and finding the
minimal cluster. Minimal cluster is the cluster, which contains the part of the fired query.
Here we are getting summary the result.
2.4 Feasibility Study
Not everything imaginable is feasible!, therefore it is necessary to evaluate feasibility of
project at the earliest stage.
The software feasibility has 3 solid dimensions:
Multi-Document Extractive Summarization for News Page 12 of 59
-
7/31/2019 MDSN Report
13/59
2.4.1 Technology:
Technical feasibility is study of functions, performance, and constraints that may affect the
ability to achieve an acceptable system. This project is technically feasible to implement. The user
does not require any extra hardware or any higher-end technology. The software can execute on a
single client machine operating on a WINDOWS XP or a higher version of Operating System.
2.4.2 Finance:
Financial feasibility is the evaluation of the development cost weighed against the ultimate
income or benefits derived from the developed system. The resources that are required for the
system can be available easily. The system is developed basically for study purpose so economical
feasibility is not a major issue.
This project is financially feasible because the software does not require any extra hardware or any
additional supporting technology which in turn adds no extra cost to the software. Thus the cost is
only for the development. Thus the project is financially feasible.
2.4.3 Resources:
The organization that wishes to implement this system requires only a single or multiple
machines. Thus no additional resources are required to implement the system. Thus the software is
also resource feasible.
2.5 .NET Framework :
.NET Framework is designed for cross-language compatibility. Cross-language
compatibility means .NET components can interact with each other irrespective of the languages
they are written in. An application written in VB .NET can reference a DLL file written in C# or a
C# application can refer to a resource written in VC++, etc. This language interoperability extends
to Object-Oriented inheritance. This cross-language compatibility is possible due to common
language runtime.
2.5.1 .NET Framework Advantages:
The .NET Framework offers a number of advantages to developers.
Different programming languages have different approaches for doing a task. For example,
accessing data with a VB 6.0 application and a VC++ application is totally different. When using
different programming languages to do a task, a disparity exists among the approach developers
Multi-Document Extractive Summarization for News Page 13 of 59
-
7/31/2019 MDSN Report
14/59
use to perform the task. The difference in techniques comes from how different languages interact
with the underlying system that applications rely on. With .NET, for example, accessing data with
a VB .NET and a C# .NET looks very similar apart from slight syntactical differences. Both the
programs need to import the System. Data namespace, both the programs establish a connection
with the database and both the programs run a query and display the data on a data grid.
.NET v/s Java :
Java is one of the greatest programming languages created by humans. Java doesn't have a
visual interface and requires us to write heaps of code to develop applications. On the other hand,
with .NET, the Framework supports around 20 different programming languages which are better
and focus only on business logic leaving all other aspects to the Framework.
Visual Studio .NET comes with a rich visual interfaces and supports drag and drop. Many
applications were developed, tested and maintained to compare the differences between .NET and
Java and the end result was a particular application developed using .NET requires less lines of
code, less time to develop and lower deployment costs along with other important issues.
Personally, I don't mean to say that Java is gone or .NET based applications are going to dominate
the Internet but I think .NET definitely has an extra edge as it is packed with features that simplify
application development.
2.5.2 Main features of C#:
C# was developed as a language that would combine the best features of previously
existing Web and Windows programming languages. Many of the features in C# language are
preexisted in various languages such as C++, Java, Pascal, and Visual Basic.
Main features:
1. C# is a simple, modern, object oriented language derived from C++ and Java.
2. It combine the high productivity of Visual Basic and the raw power of C++.
3. It is a part of Microsoft Visual Studio7.0.
4. Visual studio supports VB, VC++, C++, VBscript, and Jscript. All of these languages
provide access to the Microsoft .NET platform.
5. .NET includes a Common Execution engine and a rich class library.
6. Microsoft's JVM equiv. is Common language run time (CLR).
Multi-Document Extractive Summarization for News Page 14 of 59
-
7/31/2019 MDSN Report
15/59
7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET,
C++.
8. Source code --->Intermediate Language code.
9. The classes and data types are common to all of the .NET languages.
10. We may develop Console application, Windows application, and Web application
using C#.
11. In C# Microsoft has taken care of C++ problems such as Memory management,
pointers etc.
12. It supports garbage collection, automatic memory management and a lot.
Here is a list of some of the primary characteristics of C# language.
Modern and Object Oriented
Simple and Flexible
Type safety
Interoperability
Scalable and Updateable
2.6 Risk Management
The software development process is inherently subjects to risks, the consequence of which
are manifested as financial failures (time scale overrun, budget overrun) and technical failures
(failures to meet required functionality, reliability or maintainability).The objectives of risk
management are to identify, analyze and give priorities to risk items before they become either
threats to successful operation or major sources of expensive software rework, to establish a
balanced and integrated strategy for eliminating or reducing the various sources of risk, and to
monitor and control the execution of the strategy.
2.7 Data Flow Diagrams
Data Flow Diagrams serves two purposes:
1. To provide an indication of how data are transformed as they move through the system.
2. To depict the functions that transforms the data flow.
Multi-Document Extractive Summarization for News Page 15 of 59
-
7/31/2019 MDSN Report
16/59
The DFD provides additional information that is used during the analysis of the
information domain and serves as a basis for the modeling of function. A description for each
function presented in the DFD is contained in a process specification.
As information moves through software, it is modified by a series of transformations. A
data flow diagram is a graphical representation of information flow and transforms that are applied
as data moves from input to output. The basic form of data flow diagram is also known as data
flow graph or bubble chart.
The data flow diagram may be used to represent a system or software at any level of abstraction. In
fact, DFDs may be partitioned into levels that represent increasing information flow and functional
detail. Therefore, the DFD provides a mechanism for functional modeling as well as information
flow modeling.
Figure2.1 DFD (Level 0)
Data flow diagram( level 1) provides more details of the data flow diagram level zero. It represents
information flow and transforms that are applied as data moves from input to output.
Input HTML File(s)
Multi-Document Extractive Summarization for News Page 16 of 59
SYSTEM
HTML
FileSummary of
Html file
Query
Uploading
&
processin
g i/p file
HTML To
Text
Converter
Building
document
graph
-
7/31/2019 MDSN Report
17/59
Clustering algorithm Threshold
I/P Query
Figure2.2 DFD (Level 1)
Chapter 3
Design and analysis
3.1 Design Overview
Multi-Document Extractive Summarization for News Page 17 of 59
Creating
weighted
graph
Clustering
and
building
clustered
graph
Highlighting
Text in the
HTML
Document
as Result.
Generati
ng
Summar
y
-
7/31/2019 MDSN Report
18/59
A specialist has to check for the dataflow and have to manually where the data flows.
Analysts have proved that it would take more time for an experienced specialist to note the
dataflow.
We are accepting HTML/text file only. Newline contents are forming a node, hence a
single cluster. If there is no newline content then only one node will be there, hence only on
cluster. This will degrade the performance of the algorithms, as the cluster size is very big.
3.2 Software Architecture:
Figure 3.1 Architecture Diagram
Figure 3.1 shows the architecture diagram of the system. As shown in figure there are five
main blocks : a block for uploading and processing html file(s) by parsing text from html
document and making document graph, a block for clustering and making clustered graph, a block
for making weighted clustered document graph., the last block for generating summary for fired
query.
Block 1: HTML to Text conversion:
This block accept the input files in the form of Html, and then convert it into text files.
After conversion of html to text, the text file passes to the next block as input.
Multi-Document Extractive Summarization for News Page 18 of 59
-
7/31/2019 MDSN Report
19/59
Block 2: Processing input file and generating document graph:
This block is needed to accept the text file only. It is responsible to upload text file, to
process the file i.e. to form nodes for every newline contents. It is also responsible for generating
weight from each node to very other node
Block 3: Clustering node and building clustered graph:
This block is responsible for choosing a clustering algorithm out of two. It also accepts the
threshold, so that can check the similarity between the clusters up to that level. It is responsible for
making clusters.
Block 4: Creating weighted document clustered graph:
This block is responsible to accept the fired query. It is responsible to check the similarities
between the query a contents and the contents in the clusters. It then build weighted clustered
document graph.
Block 5.Summary generation:
This block is responsible for generating the summary of the clusters we formed, as a
response for fired query. It generated the minimal clusters and after finding the weight of the node
for fired query, it gives top most summaries.
3.3 Team Work Graph:
Team Members: Mr. Athar Nawaz Khan
Mr. Nikhil Vilasrao Ubale
Miss. Shraddha B. Ahire.
Multi-Document Extractive Summarization for News Page 19 of 59
-
7/31/2019 MDSN Report
20/59
Fig.3.2 Deviation of work
3.4 Software Engineering Model used i.e. Incremental Model:
3.4.1 Communication:
The software development process starts with communication between customer and
developer. In this phase we communicated with following principles of communication phase. We
prepare before the communication i.e. we decide agenda of the meeting for concentrating on the
News Summarization. Our leader directs our team and drawn out all the requirement of the user i.e.
what they are actually needed, what is input, output format of system.
3.4.2 Planning:
It includes complete estimation and scheduling and risk analysis. In this phase we planned
about when estimated release the software, cost estimation, risk in the project regarding application
etc. Finally in this phase we estimated the cost of the project including all expenditure of software,
releasing software according to user deadline with his participation.
3.4.3 Modeling:
It includes detail requirement analysis and project design. Flowchart shows complete
pictorial flow of program where the algorithm is step by step solution of problem. We analyze the
requirement of the user according to that we drawn the block diagrams of the system. That is
nothing but behavioral structure of the system using UML 2.0 i.e. Class Diagram, Use Case report,
component diagram, communication diagram,activity diagram, state machine diagram.
Multi-Document Extractive Summarization for News Page 20 of 59
-
7/31/2019 MDSN Report
21/59
3.4.4 Construction:
It includes Coding and Testing Steps
Coding:
Design details are implemented using appropriate programming language. In coding we
choose the platform i.e.ASP.NET
Testing:
Testing is carried out by analyzing the system i.e. we first develop the prototype of the
system and step by step find out input and output errors such as interface errors, data structure
errors, initialization errors etc. Therefore here Black Box testing strategy is useful.
3.4.5 Deployment:
It includes software delivery, support and feedback from customer. If customer suggest
some corrections, or demands additional features are added into this software.
3.5 Analysis of Work:
The following table shows you the way of work that we followed in the period.
Table3.1: Analysis of Work
Sr. No. Name of Task Subtask Period
1 Information Gathering 1.Problem Definition:
Collecting detail information
of the system to be
implemented.
17/07/2011
TO
28/07/2011
2.Literature Survey:
Visiting different websites
studying existing system with
its limitations
Going through Journals,magazines
Studying the reference books.
2 Analysis Project Plan:
Preparing complete project
pla.n
07/08/2011
TO
24/08/2011
Multi-Document Extractive Summarization for News Page 21 of 59
-
7/31/2019 MDSN Report
22/59
Requirement Analysis:
Software requirements
Hardware requirements
3 Design Architectural Design:
Describing relationships
between modules and sub
modules
04/09/2011
TO
25/09/2011
UML documentation:
Use case diagram
Class Diagram
Sequence Diagram
Activity Diagram
State Machine Diagram
Component Diagram
Form Design:
Showing relationship among
different menus and sub
menus
4 GUI Output screens:
Preparing for detail output
screens.
22/2/2012
TO
29/02/2012
Report Submission:
Submission of report of
Analysis and Design
5 Construction of System Coding:
Implementation of design
details using Programming
language c# .net
12/03/2012
TO
19/03/2012
Testing:
Testing the system for
expected results
6 Deployment System Deployment: 25/03/2012
Multi-Document Extractive Summarization for News Page 22 of 59
-
7/31/2019 MDSN Report
23/59
Delivery of Project
Support
Feedback
Modification
TO
15/04/2012
7 Final Document
Preparation and
Submission
Project Submission:
Preparing final project Report
Submission of final Project
Report
15/04/2012
TO
23/04/2012
3.6 Risk Assessment:
The risk always involves two characteristics
Uncertainty: The risk may or may not occur there are no 100% probable risks.
Loss: If the risk becomes a reality, unwanted consequences or losses will occur.
3.6.1 Risk projection:
Risk projection, also called as risk estimation, attempts to rate each risk in two ways
The like hood or probability that risk is real.
Consequences of the problems associated with the risk should it occur.
3.6.2 Risk Identification:
Risk Identification is systematic attempts to specify threats to the project plan.
Generic risk: These are potential threats to every software project.
Product Specific: These risks can be identified only by those with a clear understanding of
the technology and the environment that is specific to the project at hand.
Following are the Risks involved:
1. Technology to be built: Risks associated with the complexity of the system to be built and
the newness of the technology to be packaged by the system.
2. Development Environment: Risks associated with availability and quality of the tools to be
used built this system.
3. Risk related to Time.
4. Risk related to Functionality of the system.
3.7 Requirement Analysis:
Requirement analysis results in the specification of softwares operational characteristics.
Multi-Document Extractive Summarization for News Page 23 of 59
-
7/31/2019 MDSN Report
24/59
Requirements gathering comprises of the following:
1. Elaboration
2. Negotiation
3. Specification
4. Validation
Software requirement specification is produced at the culmination of analysis task. The functions
and performance allocation to Software as part of system engineering are refined by establishing
following:
A complete information description
A detailed functional description
A representation of system behavior
An indication of performance requirement and design constraints
Appropriate validation criteria
3.8 UML Documentation:
A UML diagram is a representation of the components or elements of a system or process
model and, depending on the type of diagram, how those elements are connected or how they
interact from a particular perspective. We are developing following UML diagrams to show the
elements and connection of the elements in the diagram.
There are two types of UML diagrams:
1. Structural Diagrams
2. Behavioral Diagrams
These two are major grouping of UML diagrams. We are developing some of the diagrams
which are sufficient to show the flow and elements of working project. Those diagrams are listed
below:
Use case Diagram
Class Diagram
Sequence Diagram
Activity Diagram
State Machine Diagram
Component Diagram
Multi-Document Extractive Summarization for News Page 24 of 59
-
7/31/2019 MDSN Report
25/59
Description about the UML diagram:
3.8.1 Use case Model:
A Use Case diagram captures Use Cases and relationships between Actors and the subject
(system). It describes the functional requirements of the system, the manner in which outside
things (Actors) interact at the system boundary, and the response of the system.
Components of Use case diagram:
1. Actor
2. Use Cases
3. Association
4. Include
5. Extends
An Actor is a user of the system; user can mean a human user, a machine, or even another
system or subsystem in the model. Anything that interacts with the system from the outside or
system boundary is termed an Actor. Actors are typically associated with Use Cases. Here in
Interactive System we are using five actors. They are as follows:
System
End User
Use Case Diagram
Multi-Document Extractive Summarization for News Page 25 of 59
-
7/31/2019 MDSN Report
26/59
uc use
Minimal clusters
Adding weight to clustered graph
Clustering
Weighted graph
Split
Parsing
End User System
Browse News (HTML
file) from internet
Provide News to
system
Parsing
Extract text files
from body tag
Processing
text files
Split documents
into nodes
(paragraph)
Find similarity
between nodes
Build documentgraph
Document
clustering
Use Nearest
Neighbour
aglorithm
Create clustered
document graph
Enter the query
Add weight to
nodes in graph
Enter threshold for
minimal cluster
Find minimal
clusters
Show result
include
include
include
Figure3.3 Use Case Diagram
3.8.2 Sequence Model:
Multi-Document Extractive Summarization for News Page 26 of 59
-
7/31/2019 MDSN Report
27/59
A Sequence diagram is a structured representation of behavior as a series of sequential
steps over time. It is used to depict work flow, message passing and how elements in general
cooperate over time to achieve a result.
Each sequence element is arranged in a horizontal sequence, with messages passing back
and forward between elements. An Actor element can be used to represent the user initiating the
flow of events. Stereotyped elements, such as Boundary, Control and Entity, can be used to
illustrate screens, controllers and database items, respectively. Each element has a dashed stem
called a lifeline, where that element exists and potentially takes part in the interactions.
Components of Sequence diagrams:
1. Actor
2. Lifeline
3. Message
4. Self Message
5. End point
An Actor is a user of the system; user can mean a human user, a machine, or even another
system or subsystem in the model. Anything that interacts with the system from the outside or
system boundary is termed an Actor. Actors also represent the role of a user in Sequence
Diagrams. Enterprise Architect supports a stereotyped Actor element for business modeling. A
Lifeline is an individual participant in an interaction.
Sequence Diagram
Multi-Document Extractive Summarization for News Page 27 of 59
-
7/31/2019 MDSN Report
28/59
sd
Parser Spl i tter Graph
creation
Relation
manager
Cluster
algori thm
Minimal
spanning
treeDisplay screenUser
alt Input query
[ I f avai lable]
[If not]
Calculates similarity
between Query and
Each Cluster (as node)
Clusters relevant to the
query. (i.e. Clusters
having maximum
weight with query)
Provide HTML d ocuments ()
Edge thrshold val ue()
Extract text file from body tag()
Text document()
Split into nodes()
Provide no des()
Build document graph()
Provide graph()
Calculate wei ght of edges()
Dispay nodes
()
Form cl usters()
Display clusters()
Ask thresho ld
value()
Ask for the
Query()
Provide qu ery()
Show
clusters()
Provide clusters()
Caculates weight()
Form weighted cluster graph()
Show weighte d cluster graph()
Calculate minimal clusters()
Display Result()
Figure3.4 Sequence Diagram
3.8.3 Class Model:
Multi-Document Extractive Summarization for News Page 28 of 59
-
7/31/2019 MDSN Report
29/59
Class diagrams capture the logical structure of the system, the Classes and objects that
make up the model, describing what exists and what attributes and behavior it has. The Class
diagram captures the logical structure of the system: the Classes - including Active and
Parameterized (template) Classes - and things that make up the model. It is a static model,
describing what exists and what attributes and behavior it has, rather than how something is done.
Class diagrams are most useful to illustrate relationships between Classes and Interfaces.
Generalizations, Aggregations and Associations are all valuable in reflecting inheritance,
composition or usage, and connections, respectively.
Components of Class diagram:
1. Class
2. Associate
3. Compose
4. Realize
A Class is a representation of objects that reflects their structure and behavior within the
system. It is a template from which actual running instances are created, although a Class can be
defined either to control its own execution or as a template or parameterized Class that specifies
parameters that must be defined by any binding class. A Class can have attributes (data) and
methods (operations or behavior). Classes can inherit characteristics from parent Classes and
delegate behavior to other Classes. Class models usually describe the logical structure of the
system and are the building blocks from which components are built.
Class Diagram
Multi-Document Extractive Summarization for News Page 29 of 59
-
7/31/2019 MDSN Report
30/59
c l a s s C la s s M o d e l
U s e r
- P a s sw o r d : i n t
- U s e r _ i d : i n t
- U s e r_ n a m e : i n t
+ A u t h e n t i c a t io n () : v o i d
+ I n p u t ( ) : v o i d
+ I n p u t _ t h r e s h o l d ( ) : v o i d
I n p u t
- F i l e _ n a m e : i n t
- F i l e _ s i z e : i n t
+ C o n v e r si o n ( ) : v o i d
+ F i r e q u e r y () : vo i d
+ I n p u t t h r e sh o l d ( ) : v o i d
S p l i t t e r
- F i l e _ s i z e : i n t
- N o . o f st o p w o r d s: i n t
+ E x t r a c t ( ) : v o i d
+ P a r si n g ( ) : v o i d
+ S p l i t ( ) : v o i d
R e l a t io n M a n a g e r
- F i l e _ s i z e : i n t
- n o . o f n o d e s : i n t
+ B u i l t g r a p h ( ) : v o i d
+ C a l c u l a t e c l u st e r w e i g h t () : v o i d
+ C a l c u l a t e w e i g h t o f n o d e s( ) : v o i d
+ C r e a t e n o d e s () : v o i d
+ D i s p l a y n o d e s () : v o i d
+ D i s p l a y w e i g h t () : vo i d
C l u s t e r i n g a l g o r i th m
- N o . o f c l u s t e r : i n t
- T h r e sh o l d : i n t
+ C r e a t e c l u s t e r g r a p h ( ) : vo i d
+ D i s p l a y c l u s te r ( ) : v o i d
+ F o r m c l u s t e r () : v o i d
I n p u t q u e r y
- K e y w o r d s: i n t
+ A s si g n w e i g h t ( ) : v o i d
+ C o m p a r e n o d e s () : v
+ F i r e q u e r y () : v o i d
O u t p u t
- R e s u l t : i n t
+ D i s p l a y c l u s te r ( ) : v o i d
+ D i s p l a y n o d e s( ) : v o i d
+ D i s p l a y r e s u l t ( ) : v o i d
Figure3.5 Class Diagram
3.8.4. Activity Diagram
Multi-Document Extractive Summarization for News Page 30 of 59
-
7/31/2019 MDSN Report
31/59
Activity diagrams are used to model the behaviors of a system, and the way in which these
behaviors are related in an overall flow of the system. The logical paths a process follows, based
on various conditions, concurrent processing, data access, interruptions and other logical path
distinctions, are all used to construct a process, system or procedure.
act activ i ty
Parsing State
In i t i a l s ta te
HTML doc.to
systemParsing
Extract text
fi le from
body tagText file
Split State
Processing
text fi le
Split doc.into
nodes
P ro v i d e
Nodes
Built doc.graph
Clustering
Find
similariry
b e tw e e n
nodes
U s e n e a re s t
neighbour
algorithm
Doc.clustering
Minimal c luster formation
Creat
doc.cluster
graph
Add we ights
to nodes in
graph
Query by
u s e r Calculate
similarity
b e tw e e n
query and
each cluster
F ind minimal
cluster
Resul t
Enter
theshold for
c luster
S h o w re s u l t
Fina l s ta te
Figure3.6 Activity Diagram
3.8.5. State machine Diagram
Multi-Document Extractive Summarization for News Page 31 of 59
-
7/31/2019 MDSN Report
32/59
A State Machine diagram illustrates how an element can move between states, classifying
its behavior according to transition triggers and constraining guard
stm
Input
HTML input files
Parsing
Text files
Processing text file Split document into
nodes
Initial
Assign weigh to each
nodeWeighted graph of
nodes
Clustering
Extract
Clustering by nearest
neighbour
Create node upto
threshold v alue
Assign weight to
cluster
Create graph
Result
User query Comparison between
query and clusters
Display topmost result
threshold value
Figure3.7 State Machine Diagram
3.8.6. Component Diagram
Multi-Document Extractive Summarization for News Page 32 of 59
-
7/31/2019 MDSN Report
33/59
A Component diagram illustrates the pieces of software, embedded controllers and such
that make up a system, and their organization and dependencies. A Component diagram has a
higher level of abstraction than a Class diagram; usually a component is implemented by one or
more Classes (or Objects) at runtime. They are building blocks, built up so that eventually a
component can encompass a large portion of a system.
cmp Component Model
Result
Minimal spanning
treeClustering
algorithm
Relation manager Graph creation
Splitter Parser
Input file
Threshold and input fi le
Figure3.8 Component Diagram
Chapter 4
Multi-Document Extractive Summarization for News Page 33 of 59
-
7/31/2019 MDSN Report
34/59
Implementation
Implementation is the stage in the project where the theoretical design is turned into a
working system and is giving confidence on the new system for the users, which it will work
efficiently and effectively. It involves careful planning, investigation of the current System and its
constraints on implementation, design of methods to achieve the change over, an evaluation, of
change over methods. Apart from planning major task of preparing the implementation are
education and training of users. The more complex system being implemented, the more involved
will be the system analysis and the design effort required just for implementation.
An implementation co-ordination committee based on policies of individual organization
has been appointed. The implementation process begins with preparing a plan for the
implementation of the system. According to this plan, the activities are to be carried out,
discussions made regarding the equipment and resources and the additional equipment has to be
acquired to implement the new system.
Implementation is very important phase, the most critical stage in achieving a successful
new system and in giving the users confidence. That the new system will work is effective. After
the system is implemented the testing can be done. This method also offers the greatest security
since the old system can take over if the errors are found or inability to handle certain type of
transactions while using the new system.
Main Functions implemented for project are listed as below in the project modules.4.1 System implementation
The input is a text file contains new line keyword. The contents are separated by new line
are the contents of the node which are the paragraph. If there are no new line in the file, then whole
file contents becomes a single node and hence a single cluster, which can degrade the performance
of the result.
The total workflow is divided into following modules:
Module 1: HTML to text parser
Processing the input HTML files parsing the HTML contents and extracting the text lines.
Module 2: Processing the input text file and creating the document graph
Functions Used:
Split ()
Multi-Document Extractive Summarization for News Page 34 of 59
-
7/31/2019 MDSN Report
35/59
The system accepts input text file. The file is read and stored into a string. The string is
then split by the newline keyword. The split file is assigned to the string array as the split function
returns the string array. The array contains paragraphs which are further treated as nodes.
string [] nodeList = null;
NodeList = File.ReadAllLines (txtInputFile.Text);
The next stage is to find the similarity between the nodes that means finding the similarity
edges between nodes and finding their similarity or weight.
Each paragraph becomes a node in the document graph.
The document graph G (V, E) of a document dis defined as follows:
vdis split to a set of non-overlapping nodes t (v),v V.
An edge e (u, v)Eis added between nodes u, v Vif there is an association between t (u) and t (v) in
d.Hence, we can view G as an equivalent representation ofd, where the associations between text
fragments ofdare depicted.
Module 3: Adding Weighted Edges to Document Graph
(Note: Adding weighted edge is query independent)
A weighted edge is added to the document graph between two nodes if they either
correspond to adjacent node or if they are semantically related, and the weight of an edge denotes
the degree of the relationship. Here two nodes are considered to be related if they share common
words (not stop words) and the degree of relationship is calculated by Semantic parsing. Also
notice that the edge weights are query-independent, so they can be pre-computed.
The following input parameters are required at the pre computation stage to create the
document graph:
4.1.1 Threshold for edge weights:
Only edges with weight not below threshold will be created in the document graph. (A
threshold is user configurable value that controls the formation of edges)
Adding weighted edge is the next step after generating document graph. Here for each pair of
nodes u, v we compute the association degree between them, that is, the score (weight) EScore (e)
of the edge e (u, v). If Score (e) threshold, then e is added to E. The score of edge e (u, v) where
nodes u, v have text fragments t(u), t(v) respectively is:
Multi-Document Extractive Summarization for News Page 35 of 59
-
7/31/2019 MDSN Report
36/59
Where t f (d, w) is the number of occurrences of w in d,
Id f (w) is the inverse of the number of documents containing w, and
size(d) is the size of the document (in words).That is, for every word w appearing in both text
fragments we add a quantity equal to the tf/idf score of w. Notice that stop words are ignored.
Functions Used:
Remove Common Words ()
The common words are eliminated from the nodes as they can degrade the performance of
calculating the similarity between two nodes also they can degrade the system performance
because of number of computational loops increases. E.g. a, an, the, he, she, they, as, it, and, are,
were, there etc.
The filtered two nodes are passed as parameters to the Relation Manager Class for finding
the similarity between them.
Relation Manager ()
The relation manager function takes two nodes as a parameter and returns the semantic
relation in the form of weight (EScore) between two nodes by traditional edge weight formula
specified as below:
If EScore >= Threshold, the edge is added to the document graph.
The graph is stored into tabular form as shown below
Table 4.1. Nodes and Node weights
First Node Second Node Edge Weight
Multi-Document Extractive Summarization for News Page 36 of 59
-
7/31/2019 MDSN Report
37/59
1 2 0.5
1 3 0.7
.
.
.
.
.
.
30 31 0.8
30 32 0.6
Module 4: Document Clustering
Clustering is grouping of similar nodes (The nodes which shows degree of closure greater
than or equal to the Cluster Threshold specified by the user) into a group. The following approach
of clustering is used Nearest Neighbor.
Algorithm for Nearest Neighbor Clustering:
1. Set i = 1 and k = 1. Assign pattern to cluster .
2. Set i = i + 1. Find nearest neighbor of among the patterns already assigned to clusters.
Let denote the distance from to its nearest neighbor. Suppose the nearest neighbor
is in cluster m.
3. If greater than or equal to t then assign to where t is the threshold specified by
the user. Otherwise set k = k+1 and assign to a new cluster .
4. If every pattern has been considered then stop else go to step 2.
Functions Used:
FindMaxWeight ()
FindMaxWeight returns the pair of nodes having maximum edge weight with their weight
from document graph. E.g.
Table 4.2.Nodes and the max weight
First Node Second Node Max Weight
1 22 2.5
2 19 1.2
Multi-Document Extractive Summarization for News Page 37 of 59
-
7/31/2019 MDSN Report
38/59
3 31 3.5
.
.
.
.
.
.
31 12 2.7
NearestNeighborCluster ()
The first pair of nodes in the above table is added in first Cluster because they have
maximum weight. Here Node 1 and 22 are closely related hence added to the first cluster. So
Cluster_1 contains 2 nodes 1 and 22. Cluster_1 :- 1,22
Next node node 2 shows maximum weight with node 19 but none of the node (node 2 and node
19 )are in previous clusters so they forms new cluster Cluster_2
Cluster_2:- 2, 19
Similarly Node 3 and 13 are forming new cluster. Cluster_3:- 3, 31
Now next pair (node 31 and 12) contains node 31 which is already in cluster_3 hence node 12 is
added into cluster_3, so cluster_3 now becomes Cluster_3:-3, 31, 12.
The above procedure is repeated till the end of the node pairs.
Module 5: Creating Clustered Document Graph
After the clusters are formed either by Nearest Neighbor or agglomerative hierarchical, the
similarity edges between two similar clusters are calculated. This is same as creating document
graph and adding the similarity edges between two similar nodes. Every cluster is split intoindividual nodes and this grouping of nodes is passed to the relation manager in order to find the
weight between two set of nodes or Clusters.[5,7,10]
Module 6: Adding Weight to Nodes In Clustered Document Graph
When a query Q arrives, the nodes in V are assigned query-dependent weights according to
their relevance to Q. In particular, we assign to each node v corresponding to a text fragment t(v)
node score NScore(v) defined by the Okapi formula as given below.
NScore (V) =
Tf- is the terms frequency in document,
Qtf- is the terms frequency in query,
N -is the total number of documents in the collection,
Multi-Document Extractive Summarization for News Page 38 of 59
-
7/31/2019 MDSN Report
39/59
df is the number of documents that contain the term,
dl is the document length (in words),
avdl is the average document length and
k1 (between 1.02.0), b (usually 0.75), and k3 (between 01000) are constants.
Functions Used:
CalculateClusterWeight ()
All the values mentioned above are computed and passed as parameters to the okapi formula.
The returned Node Weight is stored in the table. e.g.
Table 4.3.Cluster node and weight of cluster
Cluster No Nodes Cluster Weight
Cluster_1 1,22,1, 32 2.4
Cluster_2 9,17,24 2.5
Cluster_3 34,12,10 0Cluster_4 4,14,23 0
Module 7: Generating Closure Graph and Finding Minimal Clusters
Closure graph contains minimal clusters. Minimal clusters are the clusters which shows
non zero weight with the in out query. In Above example (Tab 3) only Cluster_1 and Cluster_2 are
the minimal clusters. The minimal clusters are the clusters which appear in the result.
Module 8: Result
After getting the minimal clusters, the result can be displayed in two ways:
Top 1 Result Summary
Multi-Result Summary
In top 1 result summary, the minimal cluster having highest weight with the input query is
returned, and in multi-result summary all the minimal clusters are returned as result. Before
displaying the result as a cluster, the cluster is split into its nodes and the weight of every node
with the input query is calculated. The nodes are displayed in decreasing order of the weight with
the input query. Means the node having highest weight is displayed at the top and lowest at the
bottom.
Multi-Document Extractive Summarization for News Page 39 of 59
-
7/31/2019 MDSN Report
40/59
Chapter 5
Testing5.1 Testing strategies
Testing is an important phase in the Software Development Life Cycle. Testing should be
planned and conducted systematically.
Generic aspects of a test strategy
Multi-Document Extractive Summarization for News Page 40 of 59
-
7/31/2019 MDSN Report
41/59
1. Testing begins at the module level and works outwards.
2. Different testing techniques are used at different points of time.
3. Testing is done by developers and mainly for larger projects, by an independent test group.
4. Testing and debugging are two different activities, but debugging should be incorporated
into any testing strategy.
5.2 Testing Techniques
5.2.1 Black box testing
Black box testing focuses on the functional requirements of the software. That is, black box
testing enables the software engineer to derive sets of input conditions that will fully exercise all
functional requirements for a program. Black box testing is not an alternative to white box testing
techniques; rather it is a complementary approach that is likely to uncover a different class of
errors than white box methods.
Black box testing attempts to find errors in the following categories
1. Incorrect or missing functions.
2. Interface errors.
3. Errors in data structures or external database access.
4. Performance errors.
5. Initialization and termination errors
Unlike white box testing, which is performed early in the testing process, black box testing
tends to be applied during later stages of testing. Because black box testing purposely disregards
control structure, attention is focused on the information domain.
5.2.2 White box testing
Using white box testing methods, the software engineer can derive test cases that can
1. Guarantee that all dependent paths within a module have been exercised at least once.
2. Exercise all logical decisions on their true and false sides.
3. Execute all loops at their boundaries and within their operational bounds
4. Exercise internal data structures to assure their validity.
Need for white box testing arises because of different reasons:
1. Errors tend to creep into the work when we design and implement function, conditions or
controls that are out of the main stream.
Multi-Document Extractive Summarization for News Page 41 of 59
-
7/31/2019 MDSN Report
42/59
2. We often believe that a logical is not likely to be executed when, in fact, it may be executed
on the regular basis.
3. Typographical errors are random. When a program is translated into programming
language source code, it is likely that some typing errors will occur.
I have used White Box testing and Black Box Testing. It is also called behavioral testing which
focuses on functional requirements of the software. In this testing the software is tested as a black
box without considering its internal details. Required sets of input were supplied and the desired
outputs are obtained.
5.3 Test Cases:
Table 5.1. Test cases
Test
Case
ID
Test Case
Name
Description Steps Carried out Expected
Results
Actual Result
TC1 HTML file Validation
of the input
file
1. Enter a correct
text file in which
at least 2
paragraphs are
there.
2. Enter an
incorrect folder
in which web
pages are stored
and click set
dataset folder.
Accepted the
text file.
Then create
nodes and
show data in a
particular
node
Error Message
= Other
format file
than text will
not be
uploaded
Accepted the
text file.
Then create
nodes and show
data in a
particular node
Error Message
= Other
format file than
text will not be
uploaded
TC2 Changing
threshold
for
clustering
To check
how
threshold
value
Selecting the
threshold value
for clustering
The size of
cluster
increases and
no of cluster
The size of
cluster
increased and
no of cluster
Multi-Document Extractive Summarization for News Page 42 of 59
-
7/31/2019 MDSN Report
43/59
affects the
size of
cluster and
performanc
e of
algorithm
decreases.
Due to big
size of cluster,
looping in
case of nearest
neighbour
algorithm
increases.
Hence its
performance
decreases.
decreased. Due
to big size of
cluster, looping
in case of
nearest
neighbour
algorithm
increased.
Hence its
performance
decreased.
TC3 Top mostsummary
To checkwhether in
result the
first result
is the best
result
Afterclustering ,get
minimal cluster
and find the
summary as
result
The clustercontaining the
best result for
fired query
should appear
at top. In that
cluster a node
containing the
best result
should come
at first
position.
The cluster containing the
best result for
fired query
appeared at top.
In thet cluster a
node
containing the
best result
came at first.
TC4 Similarity
calculation
Checking
similarity
between
two clusters
Provide two
clusters as input
Clustering of
similar
clusters
Clustering of
similar clusters
TC5 Weight
Calculation
Calculating
weight
between
two nodes
Provide two
nodes as input
Weight
between two
nodes
Weight
between two
nodes
TC6 Removal of Checking Providing text After After removing
Multi-Document Extractive Summarization for News Page 43 of 59
-
7/31/2019 MDSN Report
44/59
common
words
the effect
after
removal of
common
words
file with and
without common
words
removing
common
words, as less
no of words
remained it is
easy to
calculate
document
graph, similar
nods. Hence
performance
of system get
increases
common
words, as less
no of words
remained it is
easy to
calculate
document
graph, similar
nods. Hence
performance of
system get
increases
TC7 GUI Alignment
of
Controls
Color of all
buttons
should be
uniform.
All
textboxes
should be
aligned in a
straight line
Textboxes
should be
properly
aligned.
Should be
uniform
Should be
aligned
Textboxes
should be
properly
aligned.
Should be
uniform
Should be
aligned
Multi-Document Extractive Summarization for News Page 44 of 59
-
7/31/2019 MDSN Report
45/59
5.4 User Interface (Screenshots)
The basic user interface consists of at least three windows. First window is needed to input
the text file. For this user has to give input as text file only.
The second interface is to display different clustering techniques .Here threshold for
clustering is also taken as input for clustering. For this user has to select a clustering algorithm out
of two. He has to give threshold value for clustering.
The third interface is to display the query and to take % of the correlation of cluster with
the query.
5.4.1 Before uploading the HTML file(s).
Multi-Document Extractive Summarization for News Page 45 of 59
-
7/31/2019 MDSN Report
46/59
Figure5.1. before uploading the HTML Files
5.4.2 Uploading HTML file(s).
Multi-Document Extractive Summarization for News Page 46 of 59
-
7/31/2019 MDSN Report
47/59
Figure5.2:Uploading HTML file(s).
Multi-Document Extractive Summarization for News Page 47 of 59
-
7/31/2019 MDSN Report
48/59
5.4.3: Browsing the HTML file(s).
Figure5.3: Browsing the HTML file(s).
Multi-Document Extractive Summarization for News Page 48 of 59
-
7/31/2019 MDSN Report
49/59
5.4.4: After browsing the HTML file(s).
Figure5.4: After browsing the HTML file(s).
Multi-Document Extractive Summarization for News Page 49 of 59
-
7/31/2019 MDSN Report
50/59
5.4.5: Processing HTML file(s) and Display node relations.
Figure5.5: Processing HTML file(s) and Display node relations
Multi-Document Extractive Summarization for News Page 50 of 59
-
7/31/2019 MDSN Report
51/59
5.4.6: Before clustering of nodes.
Figure5.6: Before clustering of nodes.
Multi-Document Extractive Summarization for News Page 51 of 59
-
7/31/2019 MDSN Report
52/59
5.4.7: Clusters formation and building clustered graph
Figure 5.7: Clusters formation and building clustered graph
Multi-Document Extractive Summarization for News Page 52 of 59
-
7/31/2019 MDSN Report
53/59
5.4.8: Taking input query and thresholds for minimal combination of clusters in %.
Figure 5.8: Taking input query and thresholds for minimal combination of clusters in %.
Multi-Document Extractive Summarization for News Page 53 of 59
-
7/31/2019 MDSN Report
54/59
5.4. 9: Display minimal cluster as result along with link to actual web page(s).
Figure 5.9: Display minimal cluster as result along with link to actual web page(s).
Multi-Document Extractive Summarization for News Page 54 of 59
-
7/31/2019 MDSN Report
55/59
5.4.10: Display actual web page(s) we are currently dealing with and highlighting the output
data.
Multi-Document Extractive Summarization for News Page 55 of 59
-
7/31/2019 MDSN Report
56/59
Figure 5.10: Display actual web page(s) we are currently dealing with and highlighting the output
data.
Chapter 6
Future Scope
Future scope
The sentence ordering module can be used to define ordering among those topic sentences.
Another important aspect is that our system can be tuned to generate summary with custom size
specified by users. It is shown that our system can generate summary for other non-English
documents also if some simple resources of the language are available. In future we will use some
dictionary to use all the synonyms of the query words and as well As of the keywords as the extra
keywords to search the relevant information, so the quality of the summary will increase.
In the News domain Update Summary is a very important and useful concept. On a same
news topic every day or every hour there are some new or updated news arrived. So one who
already read the previous news article, (s) he will not be interested to read the whole article again.
(S)He will want to know the updated News only. With the help of the Update summary, reader can
read and track news very easily. We can develop a system which will produce the update summary
too.
Multi-Document Extractive Summarization for News Page 56 of 59
-
7/31/2019 MDSN Report
57/59
Conclusion
We are specially dealing with generating the summary for the News domain. Summary is a
very important and useful concept. On a same news topic every day or every hour there are some
new or updated news arrived. So one cannot go through all the newspapers and each and every
news article. (S)He will want to know the summary only. With the help of the News summary,
reader can read and track news very easily.
As we are providing facility of query dependent news summary, user can easily have news
summery according to his/her interest.
We are directly dealing with the HTML pages so user can retrieve online news and directly
get the summary for the same. In this work we present a graph based approach for query dependent
multi document. Summarization system along with the nearest neighbor clustering technique. It
works efficiently in case of news summarization.
Because of the multi-document news summarization, there is no need to go through all the
newspapers. As we are dealing with query specific summarization, user can easily have news
summery according to his/her interest. As well as the accuracy of the result is depend upon initial
Edge Threshold and Cluster threshold as well as Result accuracy percentage, so user can control
the relevance.
Multi-Document Extractive Summarization for News Page 57 of 59
-
7/31/2019 MDSN Report
58/59
Bibilography
References:
[1] Beyond Single-Page Web Search Results: Ramakrishna Varadarajan, Vagelis Hristidis, Tao Li
Published in IEEE TKDE, 2008 (Journal paper).
http://pages.cs.wisc.edu/~ramkris/Document_Summarization.pdf
[2] R. Varadarajan, V Hristidis : A System for Query-Specific Document Summarization ,
CIKM06, November 511, 2006, Arlington, Virginia, USA.
Copyright 2006 ACM 1-59593-433-2/06/0011.
[3] R. Varadarajan, V Hristidis: Structure-Based Query-Specific Document Summarization. Poster
paper at CIKM 2005
[4] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and
Browsing in Databases using BANKS. ICDE, 2002.
[5] M. White, T. Korelsky, C. Cardie, V. Ng, D. Pierce, and K. Wagstaff.: Multidocument
Summarization via Information Extraction. HLT, 2001
[6] Paladhi, S., Bandyopadhyay, S. 2008.
A Document Graph Based Query Focused Multi- Document Summarizer.
http://www.sivajibandyopadhyay.com/pinaki/papers/paclic24_SB_PB.pdf
[7] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and
Browsing in Databases using BANKS. ICDE, 2002.
[8]R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text
summarization, in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,
(Morristown, NJ, USA), p. 20, Association for Computational Linguistics, 2004.
Books refered:
Multi-Document Extractive Summarization for News Page 58 of 59
-
7/31/2019 MDSN Report
59/59
[1]Analyzing the hierarchical Clustering Algorithm for Categorical Attributes, By Parul Agarwal,
M. Afshar Alam, Ranjit Biswas.
International journal of innovation, Management and Technology, Vol. 1, No. 2, June 2010
ISSN: 2010-024
http://www.ijimt.org/papers/34-K033.pdf
[2] Professional asp.net 3.5
Author: Bil evjen, scott hanselman,Devin rade Chapter2: pp 63-10, Chapter20: pp 929
[3] Asp.net website programming Chapter 1: pp 15 - 38
[4] C# 2008 Programmers Reference
Author: Wei-Meng Lee
[5] Hardy, H., Shimizu, N., Strzalkowski, T., Ting, L., Wise, G. B., Zhang. X. 2002.
Crossdocument summarization by concept classification. SIGIR, pp. 65--69.
[6]Barzilay, R. and M. Elhadad. 1999. Using lexical chains for text summarization. In Mani and
Maybury (1999), 111 21.
http://www.ijimt.org/papers/34-K033.pdfhttp://www.ijimt.org/papers/34-K033.pdf