A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information...

download A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

of 171

Transcript of A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information...

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    1/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    2/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    3/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    4/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    5/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    6/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    7/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    8/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    9/171

    m

    m

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    10/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    11/171

    cardA(.)

    W

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    12/171

    cardA()

    P

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    13/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    14/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    15/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    16/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    17/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    18/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    19/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    20/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    21/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    22/171

    ,

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    23/171

    O = {S,C,h,g}

    S

    Cp

    Ct

    h h : (Cp S) (Cp Ct)

    h hg

    g : Cp (Cp Ct)g g

    O OL = {S,C,h,g,L,f}

    L

    f f : (c1, c2, . . . , cn) L n 1ci C c1, c2, . . . , cn h

    g Lf

    h g f

    OO

    S C h g

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    24/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    25/171

    Brand

    MHz

    MBKB

    DecimalReal Integer

    GHz

    GB

    Laptop

    Processor

    ProductID

    Family

    FrequencyMagnitude

    Model

    Speed

    FrequencyMeasurement

    FrequencyUnits

    Cache

    SizeLevel

    FSBInstalled Max.Installable

    Memory

    MemorySize

    MemoryMagnitude

    Magnitude

    MemoryUnits

    HardDisk ...

    ...

    ......

    ...

    ...

    ...

    ...

    has-parthas-attribute

    (is-a)-1

    Initialnode

    Internalnode

    Terminalnode

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    26/171

    118104

    56

    26

    14 145 7 6 1 1 1 1 1

    0

    40

    80

    120

    0 1 2 3 4 5 6 7 8 9 10 11 12 23

    node out-degree

    countingofnodes

    (Lap

    top

    )

    (Tim

    eU

    nit

    s)

    (Distance

    Un

    its

    )

    (Ba

    ttery

    )

    (Hard

    Disk)

    ( )terminalnodes

    100

    1510

    6 4 2 1 3 1 1 1 1 1 1 1

    1 2 3 4 5 6 7 8 9 12 13 15 17 18 45

    node in-degree

    countingofnodes

    (Boo

    lean

    )

    (Bran

    d)

    (Integer)

    (Fam

    ily

    )

    (Magn

    itu

    de)

    (Tec

    hno

    logy

    )

    (Version

    )

    (Mo

    de

    l,Digit)

    (TimeMeasurement)

    350307

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    27/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    28/171

    REF_HardDisk

    [HardDisk,HD,HDD,HardDiskDrive,Disk,HardDrive]

    CONCEPT

    TERMS

    TOKENS

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    29/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    30/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    31/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    32/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    33/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    34/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    35/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    36/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    37/171

    smsn m n O(mn)

    O(min(m, n))

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    38/171

    S U N D A Y

    0 1 2 3 4 5 6

    S 1 0 1 2 3 4 5

    A 2 1 1 2 3 3 4

    T 3 2 2 2 3 4 4

    U 4 3 2 3 3 4 5

    R 5 4 3 3 4 4 5

    D 6 5 4 4 3 4 5

    A 7 6 5 5 4 3 4

    Y 8 7 6 6 5 4 3

    k

    G

    d(c1, c2)

    a b

    c || ||a ca,

    c,a ca,b

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    39/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    40/171

    sm sn m n

    dbag(sm, sn) = max{|sm sn|, |sn sm|}

    a,a,a,b a,a,b,c,c a

    O(mn)O(m + n)

    x, y

    N CD(x, y) =C(xy) min {C(x), C(y)}

    max {C(x), C(y)}C(x) x C(xy)

    x y C

    C(xx) = C(x) C() = 0

    C(xy) C(x)C(xy) = C(yx)

    C(xy) + C(z) C(xz) + C(yz)

    C

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    41/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    42/171

    |AB||AB| R9

    max(|A\B|c,|B\A|c)n

    2|AB||A|+|B| R10

    min(|A\B,|B\A|)max(|A\B|,|B\A|)

    |AB||A||B|

    S17|AB|

    max(|A\B|c,|B\A|c)

    2|AB||AB|+|AB| R12

    min(|A\B,|cB\A|c)max(|A\B|c,|B\A|c)

    |AB|min(|A|,|B|)max(|A|,|B|) R13

    min(|A\B,|B\A|)|AB|

    |AB|min(|A|,|B|) R14

    min(|A|,|B|)|AB|

    |A B| R15 min(|A\B|c,|B\A|c)

    n|AB|C

    nRc3

    |AB|c

    min(|A|c,|B|c)|AB|

    nRc5

    |AB|c

    |AB|c

    R1 |AB|max(|A|,|B|) Rc8 max(|A|c

    ,|B|c

    )|AB|c

    R2|AB|c

    max(|A\B|c,|B\A|c) Rc14

    min(|A|c,|B|c)|AB|c

    R4|AB|c

    min(|A\B|c,|B\A|c) Sc17

    |AB|c

    max(|A\B|c,|B\A|c)

    R7max(|A\B|,|B\A|)

    |AB| Sc18

    |AB|c

    n

    R8max(|A|,|B|)

    |AB|

    a bsim

    simMongeElkan(a, b) =1

    |a||a|i=1

    max|b|j=1sim(ai, bj)

    O(|a| |b|)

    O(min(|a|, |b|)3)

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    43/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    44/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    45/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    46/171

    m

    simMongeElkanm(a, b) =

    1

    |a||a|i=1

    max

    |b|j=1sim(ai, bj)

    m1

    m

    Jaccard(A, B) =|A B||A B|

    |.||A B| = |A| + |B| |A B|

    Jaccard(A, B) =|A| + |B| |A B|

    |A B|

    23 = 0.6666

    05 = 0

    A B m nA a1, a2,...,an b1, b2,...,bm B

    card(.)

    Jaccard(A, B) =card(A) + card(B) card(A B)

    card(A

    B)

    card(.)

    card(a1, a2,...,an) = 1, if (a1 = a2 = ... = an)

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    47/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    48/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    49/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    50/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    51/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    52/171

    rel1 rel2

    1 editDistance(A,B)

    max(|A|,|B|)

    #commonBigrams(A,B)max(|A|,|B|)

    sim(a, b)

    sim(a, b)

    m = 0.00001

    m = 0.5

    m = 1.5

    m = 2

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    53/171

    card(AB)card(AB)

    2card(AB)card(A)+card(B)

    card(AB)card(A)card(B)2card(AB)

    card(AB)+card(AB)card(AB)min(card(A),card(B))

    max(card(A),card(B))card(AB)

    min(card(A),card(B))card(AB)

    max(card(A),card(B))max(card(A),card(B))

    card(AB)

    min(card(A),card(B))card(AB)

    m = 5

    m = 10

    m simmax(a, b) = max|a|i=1 max|b|j=1 sim(ai, bj)

    sim(a, b)

    cardA(.)

    cardA(.)

    cardA(.)

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    54/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    55/171

    0

    0,2

    0,4

    0,6

    0,8

    1

    0 0,2 0,4 0,6 0,8 1

    similarity threshold

    recall precision F-measure

    r

    r

    1 1

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    56/171

    0

    0,2

    0,4

    0,6

    0,8

    1

    0 0,2 0,4 0,6 0,8 1

    recall

    precision

    0

    0,2

    0,4

    0,6

    0,8

    1

    0 0,2 0,4 0,6 0,8 1

    recall

    precision

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    57/171

    m1 m2 m1 m2

    H0

    m1 m2

    m1 m2

    n

    W

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    58/171

    n

    WW

    n

    Wn

    W n = 12

    W

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    59/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    60/171

    WilcoxonsRate =test

    75

    sim(a, b)

    WW n = 12

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    61/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    62/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    63/171

    sim(a, b) W

    sim(a, b) W

    m

    m = 1m

    m

    W nW

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    64/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    65/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    66/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    67/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    68/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    69/171

    = 0.75

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    70/171

    IC(c)c

    SI MLeacock&Chodorow(a, b) = log

    pathLength(a, b)

    2D

    depth(x)x

    SI MWu&Palmer(a, b) = 2 depth(lcs(a, b))pathLength(a, b) + 2 depth(lcs(a, b))

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    71/171

    Brand

    MHzDecimalReal GHz

    Laptop

    Processor

    ProductID

    Family

    FrequencyMagnitude

    Model

    Speed

    Frequency

    FrequencyUnits

    Cache FSB BusWidth

    Memory HardDisk...

    ...

    ...

    ...

    ...

    SI MpathLenght(a, b) = depth(a) + depth(b) 2 lcs(a, b)

    LenFactor(a, b) =pathLength(a, b)

    2 D

    Spec(x) =depth(x)

    clusterDepth(x)

    SpecFactor(a, b) = |Spec(a) Spec(b)|

    SI MAltinas(a, b) =1

    1 + LenFactor(a, b) + SpecFactor(a, b)

    clusterDepth(x)

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    72/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    73/171

    Brand

    MHzDecimalReal GHz

    Laptop

    Processor

    ProductID

    Family

    FrequencyMagnitude

    Model

    Speed

    Frequency

    FrequencyUnits

    Ca ch e FS B BusWidth

    Memory HardDisk...

    ...

    ...

    ...

    ...

    ...

    w=1

    w=2

    w=3

    w=4w=4

    w=3

    WeightedPathLength (Brand,GHz)=3+4+4+3+2+1=17

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    74/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    75/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    76/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    77/171

    w1 w2 w3 w4 w5

    senses

    words

    t3 t5t5

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    78/171

    Term

    senses

    tokens

    t1 t2 t3 t4 t5

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    79/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    80/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    81/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    82/171

    distance = 1 normalizedSimilarity

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    83/171

    Integer

    Laptop

    HardDisk

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    HardDisk

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    HardDisk

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    HardDisk

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    HardDisk

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    VideoAdapter

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    VideoAdapter

    MemorySize

    MemoryMagnitude

    Installed

    Integer

    Laptop

    Memory

    MemorySize

    MemoryMagnitude

    MaxInstallable

    Integer

    Laptop

    Memory

    MemorySize

    MemoryMagnitude

    Installed

    Laptop

    Processor

    CacheSize

    MemorySize

    MemoryMagnitude

    Cache

    Integer

    Megabyte

    Laptop

    VideoAdapter

    MemorySize

    MemoryUnits

    MaxInstallable

    Megabyte

    Laptop

    VideoAdapter

    MemorySize

    MemoryUnits

    Installed

    Megabyte

    Laptop

    Memory

    MemorySize

    MemoryUnits

    MaxInstallable

    Megabyte

    Laptop

    Memory

    MemorySize

    MemoryUnits

    Installed

    Laptop

    Processor

    CacheSize

    MemorySize

    MemoryUnits

    Cache

    Megabyte

    MB512... ...

    Laptop

    Processor

    Cache

    REF_Cache

    Cache

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    84/171

    MB Cache512...

    ...

    ..

    .

    ...

    ...

    Semantic

    Path#1

    Semantic

    Path #5

    Semantic

    Path #5

    Semantic

    Path #6

    Semantic

    Path #n

    Semantic

    Path#4

    Semantic

    Path#4

    Semantic

    Path #2

    Semantic

    Path #2

    Semantic

    Path #3

    Semantic

    Path #3

    Semantic

    Path #1

    Semantic

    Path #1

    w

    Semantic

    RelatednessMetric

    P

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    85/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    86/171

    P

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    87/171

    Atoken#1

    Atoken#1

    Btoken#2

    Btoken#2

    token#3

    Ctoken#4

    Dtoken#5

    token#5

    Etoken#6

    token#6

    token#7

    token#7

    token#8

    token#8

    Xtoken#3

    Ytoken#4

    truepositives

    falsepositives

    falsenegatives

    truenegatives

    targe

    t

    se

    lec

    ted

    (B)(A)

    [0, 1]

    precision = T PT P + F P

    recall =T P

    T P + F N

    F measure = 2 precision recallprecision + recall

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    88/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    89/171

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    threshold

    Recall

    Precision

    F-measure

    Recall-Baseline

    Precision-Baseline

    F-measure-Baseline

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    Recall

    Precision

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    90/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    91/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    92/171

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 25 50 75 100

    noisy lexicon level

    F1-score

    ExactMatch-SimpleStr ExactMatch-MongeElkan

    EditDistance-SimpleStr EditDistance-MongeElkan

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 25 50 75 100

    noisy lexicon level

    F1-score

    ExactMatch-SimpleStr ExactMatch-MongeElkan

    EditDistance-SimpleStr EditDistance-MongeElkan

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    93/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    94/171

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    0

    0.3

    0.6

    0.9

    0.5 0.6 0.7 0.8 0.9 1

    EditDistance-MongeElkan

    2grams(Dice)-MongeElkan0

    EditDistance-cosine(A)

    no

    ise

    free

    lex

    icon

    no

    isy

    le

    xicon

    25

    no

    isy

    lex

    icon

    50

    no

    isy

    lex

    icon

    75

    no

    isy

    lex

    icon

    100

    F-measure

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    95/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    96/171

    0

    50,000

    100,000

    150,000

    200,000

    250,000

    0 0.2 0.4 0.6 0.8 1

    threshold

    #ofgraphnodes

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    97/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    98/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    99/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    100/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    101/171

    Score =# of wordsin the match

    # of characters in theacronym

    Score

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    102/171

    bitsacronymmodelbitstextcompressionmodel

    = 0.2

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    103/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    104/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    105/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    106/171

    S

    Pattern

    Document

    ... .... . . .

    Serial

    5400 rpm

    T

    T

    A-

    A A

    A

    GB120:DiskHard

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    107/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    108/171

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1threshold

    F-Measure

    Baseline (ExactMatch-SimpleStr.)

    Best Configuration

    Best Conf. + Acronym Matcher

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    109/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    110/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    111/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    112/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    113/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    114/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    115/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    116/171

    hasparthaspart

    haspart

    hasparthasparthaspart

    haspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

    hasparthaspart

    haspart

    hasparthaspart

    hasparthasparthaspart

    haspart

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    117/171

    haspart

    hasparthaspart

    hasparthaspart

    haspart

    hasparthaspart

    haspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

    hasparthaspart

    hasparthasparthasparthaspart

    hasparthaspart

    hasparthaspart

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    118/171

    haspart

    hasparthaspart

    haspart

    hasparthaspart

    haspart

    hasparthasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    119/171

    haspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    haspart

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    120/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    121/171

    haspart

    hasparthaspart

    haspart

    hasparthaspart

    haspart

    hasparthasparthaspart

    hasparthaspart

    hasparthaspart

    hasparthaspart

    isaisa

    isahaspart

    isaisa

    isa

    isa

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    122/171

    isa

    isaisa

    isaisa

    isaisa

    isa

    hasparthaspart

    isa

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    123/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    124/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    125/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    126/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    127/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    128/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    129/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    130/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    131/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    132/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    133/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    134/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    135/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    136/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    137/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    138/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    139/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    140/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    141/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    142/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    143/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    144/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    145/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    146/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    147/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    148/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    149/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    150/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    151/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    152/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    153/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    154/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    155/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    156/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    157/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    158/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    159/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    160/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    161/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    162/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    163/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    164/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    165/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    166/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    167/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    168/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    169/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    170/171

  • 7/31/2019 A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain

    171/171