Knowledge discovery in Great Textual Data Bases

18
1 Plekhanov Russian Academy of Economics Plekhanov Russian Academy of Economics D octor V. Romanov, student E. Pantileeva octor V. Romanov, student E. Pantileeva Knowledge discovery in large text data bases using the MST algorithm Doctor V. Romanov, student E. Pantileeva Doctor V. Romanov, student E. Pantileeva Plekhanov Plekhanov Russian Academy of Economics Russian Academy of Economics Data Mining 2005 25 – 27 May 2005 Skiathos

description

Text Processing & Knowledge Discovery System for Digest Preparing and Decision Making

Transcript of Knowledge discovery in Great Textual Data Bases

Page 1: Knowledge discovery in Great Textual Data Bases

1

Plekhanov Russian Academy of EconomicsPlekhanov Russian Academy of EconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Knowledge discovery in large text data bases using the MST

algorithm Doctor V. Romanov, student E. PantileevaDoctor V. Romanov, student E. Pantileeva

PlekhanovPlekhanov Russian Academy of Russian Academy of EconomicsEconomics

Data Mining 2005

25 – 27 May 2005

Skiathos

Page 2: Knowledge discovery in Great Textual Data Bases

DISCOVERING THE HIDDEN PROBLEM SITUATIONSTUCTURE FROM DOCUMENTS SET

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Manager (problem situation)

Document collection

Maximum Spanning

Tree

Attributes Names and Values

table

relevant data collection word dictionary with the word frequencies pairs of words from dictionary and pairs frequencies maximum spanning tree forming and interpretation

Page 3: Knowledge discovery in Great Textual Data Bases

Information system, supporting decision making, with embedded adaptation

function

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

User (problem situation)

KNOWLEDGEBASE

Documents text processing &

loading

OUTPUT DATA FOR QUERIES PROCESSED

METADATA:

THESAURUS

THEMATIC CLASSIFIER

SUBJECT CLASSIFIERS

NAVIGATOR

TEXTDATABASE

Queriesprocessing

DATA BASE RECORDS

STATSTICAL DATA

ONE TERM FREQUENCIES COUNTING &

NORMALISATION

ATTRIBUTES NAMES & VALUES

DETECTION

TWO TERMS FREQUENCIES COUNTING & COVARIANCE

MATRIXFORMATION

MAXIMUM SPANNING TREE DEVELOPMENT & EXPLICATION AS

NAVIGATOR

FORMAL CONTEXT

TABLE FORMING @

FILLING

FORMAL CONCEPTS&

RULESDISCOVERING

Page 4: Knowledge discovery in Great Textual Data Bases

TWO TERMS FREQUENCIES COUNTING & COVARIANCE MATRIX

FORMATION

Three Steps of MST Construction

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Maximum Spanning

Tree

ONE TERM FREQUENCIES COUNTING & NORMALISATION

MAXIMUM SPANNING TREE DEVELOPMENT

Page 5: Knowledge discovery in Great Textual Data Bases

The Construction of Maximum Spanning Tree

The Term Connectedness Graphs The Pairs of Terms Frequencies Matrix

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Page 6: Knowledge discovery in Great Textual Data Bases

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Maximum Spanning Tree Maximum Spanning Tree Matrix

The Construction of Maximum Spanning Tree

Page 7: Knowledge discovery in Great Textual Data Bases

Dynamic picture is formed as maximum spanning tree for graph, is representing covariance matrix for word/lemmas or concepts pairs.MST serves as dynamically changing thesaurus or semantic net for

query navigation.

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Page 8: Knowledge discovery in Great Textual Data Bases

Above the each word there is a sign one of two kinds: leaf or branch. The word designated by the leaf sign has not any connections down the tree. The words

designed by branch sign permit further navigation along the route.

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

The user begins retrieval session with browsing dictionary with word frequencies and choosing the word to be included in the query.

Page 9: Knowledge discovery in Great Textual Data Bases

An example of MST for thematic class

“Production reconstruction”

metallurgical plants

q u a r t e r s ( o f ye a r ) (года)

F e r r o u s m e t a l l u r g y d e p a r t m e n t

b l a s t fu r n a ce s

I

Lipet sk ci ty

w or k s

d i s t r i b u t i on

M a i n p l a n n i n g d e p a r t me n t

p u t t i n g i n t o op e r a t i on

Ch e mi ca l ma c h i n e r y c on s t r u ct i on d e p a r t me n t

p r od u ct i o n ,

De vi ce s i n d u s t r y d e p a r t me n t

B u i l d i n g ( p r o ce s s )

El e c t r i ca l e n gi n e e r i n g d e p a r t me n t

d e vi ce s

b a s i c fu n d s o f i n d u s t r y

M a i n s u p p l yi n g

d e p a r t me n t

e q u i p me n t ( s u b j e c t )

M a gn i t o g or s k c i t y

c ol d r o l l i n g s h op

gr ou p of me t a l l u r g y e n t e r p r i s e s

ensur ing

ob j e c t s

p r op e r t i e s apportionment

s t ee l

M o u n t i n g @ b u i l d i n g d e p a r t m e n t

r ec on s t r u ct i on

fu l f i l me n t

elaboration

He a v y e n gi n e e r i n g w or k s d e p a r t me n t

Li p e t s k s t e e l

h a l f - ye a r

u t i l i za t i o n ,

s c r a p me t a l

Main management board department

e n t e r p r i s e s

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Page 10: Knowledge discovery in Great Textual Data Bases

THE SEMANTIC OF PROBLEM SITUATION-1

DATA BASE DOMAINS: RG- REGIONS, SA – STATE AGENCIES, EN – ENTERPRISES, KM – KIND OF MAKE, SP - SUPPLIER, OB - OBJECT, EN - ENTERPRISE, DT – DATE, KR – KIND OF RESOURCE, EX - EXECUTOR, PR - PURPOSE, RN - RECIPIENT, DT – DATE…

Page 11: Knowledge discovery in Great Textual Data Bases

THE SEMANTIC OF PROBLEM SITUATION-2

Relations:“EQUIPMENT_SUPPLY” ES("KIND_OF_MAKE"/KM, "SUPPLIER"/ SP, "OBJECT"/OB,"ENTERPRISE"/EN,"DATE"/DT),

"RESOURCE_APPORTION" RA("KIND-OF_RESOURCE"/KR, "EXECUTOR"/EX, "PURPOSE"/PR,"RECIPIENT"/RN," DATE"/DT),

"RECONCTRUCTION”RC("OBJECT"/OB,"ENTERPRISE"/EN,"REGION”/RG,"STATE_AGENCY"/SA,"DATE"/DT,"PURPOSE"/PR),

Page 12: Knowledge discovery in Great Textual Data Bases

THE SEMANTIC OF PROBLEM SITUATION-3

QUESTIONS: WHO IS SUPPLYING EQUIPMENT FOR (OBJECT,

REGION,KIND_OF_EQUIPMENT)? etc.

RULES: "STATEMENT_#111_EXECUTION"

SE(“DATE_1”/DT,”DATE_2”/DT,”PRODUCTION_VOLUME_1”/VL, PRODUCTION_VOLUME_2/VL, "OBJECT"/OB, "ENERPRISE"/EN):-

RA("KIND-OF_RESOURCE"/KR, "EXECUTOR"/EX, "PURPOSE"/PR,"RECIPIENT"/RN," DATE"/DT),

MN( "KIND_OF_EQUIPMENT"/KE, "OBJECT”/OB,

"ENTERPRISE"/EN, "DATE"/DT,"PURPOSE"/PR),...

Page 13: Knowledge discovery in Great Textual Data Bases

CONTEXT FORMING FOR THE PROBLEM SITUATION

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Documents

table

Context

• 1. Attribute names and values recognizing in documents.

2. Table “documents-attributes” filling

Let the result of lexical and categorical analysis be set di -terms, extracted via mapping: value – domain – term.

For each text qk Q we can compose a matrix M named context, whose elements mki say whether term di enters into document qk.

Page 14: Knowledge discovery in Great Textual Data Bases

STAGES OF DB SCHEMA EXTRACTIONSTAGES OF DB SCHEMA EXTRACTIONFROM SET OF TEXTSFROM SET OF TEXTS

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Context

FormalConcept

Analysis

ConceptsThe problem

situationdata base schema

Page 15: Knowledge discovery in Great Textual Data Bases

The problem situation description in data base

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Document

Tokenization -Morphological

anlaysis

Semanticalanalysis

Syntacticalanalysis

User interface forms for data

entering

Data base records

Data base loading

Page 16: Knowledge discovery in Great Textual Data Bases

C1:=(main management board)

Concepts of the situation “Reconstruction”f1:=main management board, f2:=building, f3:=cold rolling shop, f4:=reconstruction, f5:=blast furnace O5

O4

O3

O2

O1

f5f4f3f2f1

Сontext

O5

O4

O3

O2

O1

C2f5f4f

3f2f1

C2:= (main management board, reconstruction, blast furnace)

O5

O4

O3

O2

O1

C

5

f5f4f3f2f1

C5:=(main management board, cold rolling shop, reconstruction, blast

furnace)

O5

O4

O3

O2

O1

C3f5f4f3f2f1

C3:=(main management board, cold rolling shop)

O5

O4

O3

O2

O1

C6f5f4f3f2f1

C6:=(main management board, building, reconstruction, blast furnace)

O5

O4

O3

O2

O1

C1f5f4f3f2f1

O5

O4

O3

O2

O1

C4f5f4f3f2f1

C4:=(main management board, building, cold rolling shop)

O5

××O4

O3

××O2

××O1

C7f5f4f3f2f1

C7:=(main management board, building)

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

Page 17: Knowledge discovery in Great Textual Data Bases

The Hasse diagram of concept lattice

Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva

C1

C4

C3 C2

C1:=(main management board)

C2:= (main management board, reconstruction, blast

furnace)C3:=(main management board, cold rolling shop)

C4:=(main management board,

building, cold rolling shop)

C5:=(main management board, cold rolling shop, reconstruction,

blast furnace)

C7:=(main management board,

building)

C6:=(main management board, building, reconstruction, blast

furnace)

C6

C5

C7

Page 18: Knowledge discovery in Great Textual Data Bases

User (problem situation)

Document collection

Official Official structurestructure

Object Object namename ActionsActions RegionRegion

Action’s Action’s effecteffect

TimeTime

Main Main ManagementManagement

boardboard

Blast Blast furnacefurnace

reconstructionreconstruction Lipetsk cityLipetsk city Scrap metal Scrap metal utilization utilization

meliorationmelioration

IV quarters IV quarters

of xxx-yearof xxx-year

Main Main

ManagementManagement

boardboard

Cold Cold rolling rolling shopshop

buildingbuilding Magnitogorsk Magnitogorsk citycity

Steel Steel production production increment increment

I half I half

xxx-yearxxx-year

MountingMounting

&building &building departmentdepartment

Blast Blast furnacefurnace

Putting into Putting into operationoperation

Lipetsk cityLipetsk city Equipment Equipment mountedmounted

II quarters II quarters of xxx-yearof xxx-year

Digest of situationPlekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva