Automatic Text Summarization Based on Sentences Clustering and Extraction

download Automatic Text Summarization Based on Sentences Clustering and Extraction

of 4

Transcript of Automatic Text Summarization Based on Sentences Clustering and Extraction

  • 7/27/2019 Automatic Text Summarization Based on Sentences Clustering and Extraction

    1/4

    Automatic text summarization based on sentences clustering and extraction

    ZHANG Pei-ying

    College of Computer & Communication Engineering

    China University of Petroleum

    Dongying, Shandong

    e-mail: [email protected]

    LI Cun-he

    College of Computer & Communication Engineering

    China University of Petroleum

    Dongying, Shandong

    e-mail: [email protected]

    AbstractTechnology of automatic text summarization plays

    an important role in information retrieval and text

    classification, and may provide a solution to the information

    overload problem. Text summarization is a process of reducing

    the size of a text while preserving its information content. This

    paper proposes a sentences clustering based summarizationapproach. The proposed approach consists of three steps: first

    clusters the sentences based on the semantic distance among

    sentences in the document, and then on each cluster calculates

    the accumulative sentence similarity based on the multi-

    features combination method, at last chooses the topic

    sentences by some extraction rules. The purpose of present

    paper is to show that summarization result is not only depends

    the sentence features, but also depends on the sentence

    similarity measure. The experimental result on the DUC 2003

    dataset show that our proposed approach can improve the

    performance compared to other summarization methods.

    Keywords- text summari zation ; simi lar ity measur e; sentences

    clusterin g; sentence extractive techni que

    I. INTRODUCTIONThe continuing growth of World Wide Web and on-line

    text collections makes a large volume of informationavailable to users. The information overload either leads towastage of significant time in browsing all the information orelse useful information is missed out. The technology ofautomatic text summarization is maturing and may provide asolution to the information overload problem. Textsummarization is the process of automatically creating acompressed version of a given text that provides usefulinformation to users, and multi-document summarization isto produce a summary delivering the majority of information

    content from a set of documents about an explicit or implicitmain topic (Wan, 2008).Text summarization is a complex task which ideally

    would involve deep natural language processing capacities.In order to simplify the issue, current research is focused onextractive-summary generation. Sentence based extractivesummarization techniques are commonly used in automatictext summarization to produce extractive summaries.Traditional method of summarization uses the sentencefeatures to evaluate the importance of sentences of adocument. Its limitation is not involved with the sentencesemantic similarity computing. This paper proposes asentence similarity computing method based on the threefeatures of the sentences, on the base of analyzing of the

    word form feature, the word order feature and the semanticfeature, using the weight to describe the contribution of eachfeature of the sentence, describes the sentence similaritymore preciously. Determinates the number of the clusters,uses the K-means method to cluster the sentences of the

    document, and extracts the topic sentences to generate theextractive summary for the document. Experiments showthat our method is outperforms than other summarizationmethods using the ROUGE-1, ROUGE-2 and F1-Measureevaluation metrics.

    The rest of this paper is organized as follows: Section 2introduces related works. The proposed sentence-clustering

    based summarization algorithm is presented in Section 3.Section 4 presents evaluation result on DUC2003 dataset.The last section gives the conclusions.

    II. RELATED WORKIn the past, extractive summarizers have been mostly

    based on scoring sentences in the source document. The

    most common and recent text summarization techniques useeither statistical approaches, for example (Zechner, 1996),(Carbonell, 1998), (Strzalkowski, 1998), (Berger and Mittal,2000), (Nomoto, 2001); or linguistic techniques, for example(Klavans, 1995), (Radev, 1998), (Nakao, 2000); or somekind of a linear combination of these: (Goldstein et al., 1999),(Mani, 2002) and (Barzilay, 1997). Our algorithm ismarkedly different from each of these and tries to capture thesemantic distance of the sentences in the document.

    We find that none of the above approaches to textsummarization selects sentences based on the semanticcontent of the sentence and the relative importance of thecontent to the semantic of the text. Our algorithm is based onidentifying semantic relations among sentences and is for

    automatic text summarization unlike almost all previous ones.

    III. SENTENCES CLUSTERING BASED SUMMARIZATIONA. Similarity measure between sentences

    Definition 1: Word Form SimilarityThe word form similarity is mainly used to describe the

    form similarity between two sentences, is measured by thenumber of same words in two sentences. It should be gettingrid of the stop words in the computation. If S1 and S2 are twosentences, the word form similarity is calculated by theformula (1).

    Sim1(S1,S2)=2*(SameWord(S1,S2)/(Len(S1)+Len(S2))) (1)

    167

    _____________________________

    978-1-4244-4520-2/09/$25.00 2009 IEEE

    uthorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 7/27/2019 Automatic Text Summarization Based on Sentences Clustering and Extraction

    2/4

    Here SameWord(S1,S2) is the number of the same wordsin two sentences, Len(S) is the word number in the sentenceS.

    Definition 2: Word Order SimilarityThe word order similarity is mainly used to describe the

    sequence similarity between two sentences. Chinese sentencecan be presented by many kinds of style, the differentsequence of the words stand for different meanings. Here wedescribe the sentence as three vectors as follows:

    V1={d11,d12,,d1n1}V2={d21,d22,,d2n2}V3={d31,d32,,d3n3}Here the weight d1i in vector V1 is the tf-idf value of the

    words; the weight d2i in vector V2 is the bi-gram whetheroccur in the sentence (0 stands for no-occurring, 1 stands foroccurring); the weight d3i in vector V3 is the tri-gram whetheroccur in the sentence. The word order similarity between S1and S

    2is:

    Sim2(S1,S2)=1*Cos(V11,V21)+2*Cos(V12,V22)+3*Cos(V13,V23) (2)

    Here 1+2+3=1. i stands for the ratio of each part.Definition 3: Word Semantic SimilarityThe word semantic similarity is mainly used to describe

    the semantic similarity between two sentences. Here theword semantic similarity computing (Jiang Min, 2008) is

    based on the HowNet[1]. Based on semantic similarity amongwords, we define Word-Sentence Similarity (WSSim) to bethe maximum similarity between the word w and wordswithin the sentence S. Therefore, we estimate WSSim(w,S)with the following formula:

    WSSim(w,S)=max{Sim(w, Wi)|WiS,

    where w and Wi are words} (3)Here the Sim(w,Wi) is the word similarity between w andWi.

    With WSSim(w,S), we define the sentence similarity asfollows:

    21

    12

    213

    1 2

    ),(),(

    )S,(SSimSS

    SwWSSimSwWSSimSw Sw

    ji

    i j

    +

    +

    =

    (4)

    Here S1, S2 are sentences; |S| is the number in thesentence S.

    Definition 4: Sentence SimilarityThe sentence similarity usually described as a number

    between zero and one, zero stands for non-similar, onestands for total similar. The larger the number is, the more

    the sentences similar. The sentence similarity between S1 andS2 is defined as follows:

    Sim(S1,S2)=1*Sim1(S1,S2)+2*Sim2(S1,S2)

    +3*Sim3(S1,S2) (5)Here 1, 2, 3 is the constant, and satisfied the equation:

    1+2 +3=1. In this paper, 1=0.2, 2=0.1, 3=0.7.

    B. Estimating the number of clustersDetermination of the optimal number of sentence clusters

    in a text document is a difficult issue and depends on thecompression ratio of summary and chosen similarity measure,as well as on the document topics. For clustering ofsentences, customers cant predict the latent topic number in

    the document, so its impossible to offer k effectively. Thestrategy that we used to determine the optimal number ofclusters (the number of topics in a document) is based on thedistribution of words in the sentences:

    =

    =

    =

    ==n

    i i

    n

    i i

    n

    i iS

    Sn

    S

    Dnk

    1

    1

    1

    (6)Where |D| is the number of terms in the document D, |Si|

    is the number of terms in the sentence Si, n is the number ofsentences in document D. Here we analyze the property ofthis estimation by two extreme cases, please references the(Ramiz M. Aliguliyev, 2008) if you want to learn moredetailed process of proof.

    (1) The document is constituted by n sentences whichhave the same set of terms. Therefore, the set of terms of thedocument coincides with the set of terms of each sentence:D= (t1, t2, , tm)=Si=S. From the definition (6) follows that

    1

    11

    1

    1

    1====

    ==

    =

    =

    =

    n

    i

    n

    i

    n

    i

    n

    i i

    n

    i i

    SSn

    S

    Sn

    S

    Snk

    .(2) The document is constituted by n sentence which do

    not have any term in common, that is, S iSj= for ij. This

    means that each term belonging to n

    i iSD

    1==

    belongs only

    to one of the sentences Si, therefore == ==

    n

    i i

    n

    i iSSD

    11 ,from which follows that k=n.

    C. Sentences ClusteringOnce determinates the number of sentences clusters, we

    can use the K-means method to cluster the sentences of thedocument. This algorithm can be described as follows:

    Input: n sentencesK: the number of clustersOutput: the sentences clustersStep1: Random select K sentences into K clusters

    respectively, these sentences represent the initial clustercentral sentences.

    Step2: Assign each sentence to the cluster that has theclosest central sentence.

    Step3: When all sentences have been assigned,recalculate the central sentence of each cluster. The centralsentence is the one which own the lowest accumulativesimilarity.

    Step4: Repeat Steps 2 and 3 until the central sentence nolonger move. This produces a separation of the sentences

    into K clusters from which the metric to be minimized can becalculated.

    D. Topic Sentences ExtractionBased on the result of section C, assume the sentences

    clusters is: D = {C1, C2, , Ck}. First, determinates thecentral sentence i of each cluster based on the accumulativesimilarity between the sentence Si and other sentences, thencalculates the similarity between the sentence Si and thecentral sentence i. Assume that the similarity of centralsentence i as 1, sorts the sentences based on its similarityweight, and chooses the high weight sentences as the topicsentences. At the same time, considering the recall rate of thetext summarization, the text summary should include every

    168

    uthorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

  • 7/27/2019 Automatic Text Summarization Based on Sentences Clustering and Extraction

    3/4

    cluster sentences according to the principle of priority extractclusters in the process of extracting sentences.

    IV. EXPERIMENTS AND RESULTSIn this section, we conduct experiments to evaluate the

    performance of the automatic text summarization systembased on sentences clustering.

    A. Evaluation metricsEvaluating summaries and automatic text summarization

    systems is not a straight-forward process. There are manymeasures that can calculate the topical similarities betweentwo summaries. For evaluation the results we use twomethods. The first one is by precision (P), recall (R) and F1-measure which are widely used in Information Retrieval. Foreach document, the manually extracted sentences areconsidered as the reference summary (denoted by Summref).This approach compared the candidate summary (denoted bySummcand) with the reference summary and computes the P,R and F1-measure values as shown in formula (7). (Shen etal., 2007)

    cand

    candref

    Summ

    SummSummP

    =

    refcandref

    Summ

    SummSummR

    =

    RP

    PRF

    +=

    21

    (7)The second measure we use the ROUGE method for

    evaluation, which was adopted by DUC for automaticallysummarization evaluation. It has been shown that ROUGE isvery effective for measuring document summarization. Itmeasures summary quality by counting overlapping unitssuch as the N-gram, word sequences and word pairs betweenthe candidate summary and the reference summary. TheROUGE-N measure compares N-grams of two summaries,

    and counts the number of matches. The measure is definedby formula (8):

    =

    ref

    ref

    SummS SgramN

    SummS SgramN match

    gramNCount

    gramNCountNROUGE

    )(

    )(

    (8)Where N stands for the length of the N-gram, Countmatch

    (N-gram) is the maximum number of N-grams co-occurringin candidate summary and a set of reference summaries.Count (N-gram) is the number of N-grams in the referencesummaries.

    B. Runs and Evaluation ResultsFor evaluation the performance of our method we

    conduct the experiments on the document dataset DUC2003,compares our method with MMR (Carbonell, 1998) and

    WAA (Zhang Qi, 2004) methods. As shown in Table1, onDUC2003 dataset, the values of ROUGE-1, ROUGE-2 andF1 metrics of our method is better than other summarizationmethods.

    TABLE 1 THE VALUES OF EVALUATION METRICS

    FOR SUMMARIZATION METHODS (DUC2003 DATASET)

    Methods ROUGE-1 ROUGE-2 F1-Measure

    MMR 0.34813 0.07917 0.43245

    WAA 0.38023 0.09121 0.45335

    Our Method 0.43512 0.10142 0.47576

    V. CONCLUSIONWe have presented the approach to automatic text

    summarization based on the sentences clustering and

    extraction. Our approach consists of three steps. First clustersthe sentences in document, and then on each clustercalculates the accumulative sentence similarity based on themulti-features combination, at last chooses the topicsentences by the rules. When comparing our method withother existing summarization methods on an open DUC2003datasets, we found that our method can improve thesummarization results significantly using the evaluationmetrics of ROUGE-1, ROUGE-2 and F1-Measure.

    The main contributions of this study are as follows:(1) It proposes a sentence similarity computing method

    based on the three features of the sentences, on the base ofanalyzing of the word form feature, the word order featureand the semantic feature, using the weight to describe thecontribution of each feature of the sentence, describes thesentence similarity more preciously.

    (2) It has given a method of determinate the number ofthe sentence clusters.

    (3) It gives an approach of text summarization based onthe sentences clustering.

    REFERENCES

    [1] Dong Zhen-dong. HowNet[OL].http://www.keenage.com[2] Barzilay, Elhadad, 1997. Using lexical chains for text summarization.

    Proceedings of the ACL97/EACL97 Workshop on IntelligentScalable Text Summarization.

    [3] Berger and Mittal, 2000. Query-relevant summarization using faqs.Proceedings of the 38th Annual Meeting of the Association forComputational Linguistics.

    [4] Carbonell, Goldstein, 1998. The use of MMR, diversity-basedreranking for reordering documents and producting summaries [A], In:

    Proceedings of the 21st ACM-SIGIR International Conference onResearch and Development in Information Retrieval [C], Melbourne,Australia.

    [5] Jiang Min, Xiao Shi-bin, Wang Hong-wei et al, 2008. An improvedword similarity computing method based on HowNet[J].Journal ofChinese information processing.

    [6] Goldstein, Kantrowitz, Mittal, and Carbonell, 1999. Summarizationtext documents: Sentence selection and evaluation metrics.Proceedings SIGIR.

    [7] Katz, S. M, 1996. Distribution of content words and phrases in textand language modeling. Natural Language Engineering.

    [8] Klavans, Shaw, 1995. Lexical semantics in summarization.Proceedings of the First Annual Workshop of the IFIP working Groupfor Natural Language Processing and Knowledge Representation.

    [9] Mani, 2002. Automatic summarization. A tutorial presented at ICON.[10]Nakao, 2000. An algorithm for one-page summarization for a long

    text based on thematic hierarchy detection. Proceedings of the 38thAnnual Meeting of the Association for Computational Linguistics.

    [11]Nomoto, Matsumoto, 2001. A new approach to unsupervised textsummarization. Proceedings of the 24th ACM SIGIR.

    [12] Radev, McKeown, 1998. Generating natural language summariesfrom multiple online sources. Computational Linguistics.

    [13] Ramiz M. Aliguliyev, 2008. A new sentence similarity measure andsentence based extractive technique for automatic text summarization.Expert System with Applications.

    [14] Shen,D.,Sun,J.-T.,Li,H.,Yang, Q.,& Chen,Z. (2007). Documentsummarization using conditional random fields. In Proceedings of the20th international joint conference on artificial intelligence (IJCAI2007), January 6-12 (pp. 2862-2867) Hyderabad, India

    169

    uthorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

    http://www.keenage.com/http://www.keenage.com/
  • 7/27/2019 Automatic Text Summarization Based on Sentences Clustering and Extraction

    4/4

    [15] Strzalkowski, Wise, Wang, 1998. A robust practical textsummarization system. Proceedings of the Fifteenth NationalConference on A1.

    [16] Wan, X. (2008). Using only cross-document relationships for bothgeneric and topic-focused multi-document summarizations.Information Retrieval.

    [17] Zechner, 1996. Fast generation of abstracts from general domain textcorpora by extracting relevant sentences. COLING.

    [18] Zhang Qi, Huang Xuan-jing, Wu Li-de, 2004. A new method forcalculating similarity between sentences and application on automatic

    text summarization. Journal of Chinese information processing

    170

    uthorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.