Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference)...

207
Copyright © 2019 KNIME AG Text Mining Course for KNIME Analytics Platform KNIME AG

Transcript of Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference)...

Page 1: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Copyright © 2019 KNIME AG

Text Mining Coursefor KNIME Analytics PlatformKNIME AG

Page 2: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Table of Contents

1

1. The Open Analytics Platform

2. The Text Processing Extension

3. Importing Text

4. Enrichment

5. Preprocessing

6. Transformation

7. Classification

8. Visualization

9. Clustering

10. Supplementary Workflows

Copyright © 2019 KNIME AG®

2

Page 3: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

OverviewKNIME Analytics Platform

1Copyright © 2019 KNIME AG®

3

Page 4: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

What is KNIME Analytics Platform?

• A tool for data analysis, manipulation, visualization, and reporting

• Based on the graphical programming paradigm

• Provides a diverse array of extensions:

– Text Mining

– Network Mining

– Cheminformatics

– Many integrations, such as Java, R, Python, Weka, Keras, H2O, etc.

Copyright © 2019 KNIME AG®

4

Page 5: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Visual KNIME Workflows

NODES perform tasks on data

Nodes are combined to createWORKFLOWS

3

Status

Inputs Outputs

Not Configured

Configured

Executed

Error

Copyright © 2019 KNIME AG®

5

Page 6: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Data Access

• Databases– MySQL, PostgreSQL– any JDBC (Oracle, DB2, MS SQL

Server)

• Files– CSV, txt– Excel, Word, PDF– SAS, SPSS– XML– PMML– Images, texts, networks, chem

• Web, Cloud– REST, Web services– Twitter, Google

4Copyright © 2019 KNIME AG®

6

Page 7: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Big Data

• Spark

• HDFS support

• Hive

• Impala

• Vertica

• In-database processing

5Copyright © 2019 KNIME AG®

7

Page 8: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Transformation

• Preprocessing

– Row, column, matrix based

• Data blending

– Join, concatenate, append

• Aggregation

– Grouping, pivoting, binning

• Feature Creation and Selection

6Copyright © 2019 KNIME AG®

8

Page 9: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Analysis & Data Mining

• Regression– Linear, logistic

• Classification– Decision tree, ensembles, SVM,

MLP, Naïve Bayes

• Clustering– k-means, DBSCAN, hierarchical

• Validation– Cross-validation, scoring, ROC

• Deep Learning– Keras, DL4J

• External– R, Python, Weka, H2O, Keras

7Copyright © 2019 KNIME AG®

9

Page 10: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Visualization

• Interactive Visualizations

• JavaScript-based nodes

– Scatter Plot, Box Plot, Line Plot

– Networks, ROC Curve, Decision Tree

– Adding more with each release!

• Misc

– Tag cloud, open street map, molecules

• Script-based visualizations

– R, Python

8Copyright © 2019 KNIME AG®

10

Page 11: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Deployment

• Database

• Files

– Excel, CSV, txt

– XML

– PMML

– to: local, KNIME Server, SSH-, FTP-Server

• BIRT Reporting

9Copyright © 2019 KNIME AG®

11

Page 12: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Analysis & MiningStatisticsData MiningMachine LearningWeb AnalyticsText MiningNetwork AnalysisSocial Media AnalysisR, Weka, PythonCommunity / 3rd

Data AccessMySQL, Oracle, ...SAS, SPSS, ...Excel, Flat, ...Hive, Impala, ...XML, JSON, PMMLText, Doc, Image, ...Web CrawlersIndustry SpecificCommunity / 3rd

TransformationRowColumnMatrixText, ImageTime SeriesJavaPythonCommunity / 3rd

VisualizationRJFreeChartJavaScriptCommunity / 3rd

Deploymentvia BIRTPMMLXML, JSONDatabasesExcel, Flat, etc.Text, Doc, ImageIndustry SpecificCommunity / 3rd

Over 2000 Native and Embedded Nodes Included:

10Copyright © 2019 KNIME AG®

12

Page 13: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Overview

• Installing KNIME Analytics Platform

• The KNIME Workspace

• The KNIME File Extensions

• The KNIME Workbench

– Workflow editor

– Explorer

– Node Repository

– Node Description

• Installing new features

Copyright © 2019 KNIME AG®

13

Page 14: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Install KNIME Analytics Platform

• Select the KNIME version for your computer:

– Mac

– Windows – 32 or 64 bit

– Linux

• Download archive and extract the file, or download installer package and run it

Copyright © 2019 KNIME AG®

14

Page 15: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Start KNIME Analytics Platform

• Use the shortcut created by the installer

• Or go to the installation directory and launch KNIME via the knime.exe

Copyright © 2019 KNIME AG®

15

Page 16: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

The KNIME Workspace

• The workspace is the folder/directory in which workflows (and potentially data files) are stored for the current KNIME session.

• Workspaces are portable (just like KNIME)

14Copyright © 2019 KNIME AG®

16

Page 17: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

The KNIME Workbench

15

KNIME Explorer

Workflow Coach

Node Repository

Workflow Editor

Outline

Console

Node Description

Copyright © 2019 KNIME AG®

17

Page 18: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

KNIME Explorer

• In LOCAL you can access your own workflow projects.

• The Explorer toolbar on the top has a search box and buttons to– select the workflow displayed in

the active editor

– refresh the view

• The KNIME Explorer can contain 4 types of content:– Workflows

– Workflow groups

– Data files

– Metanode templates

Copyright © 2019 KNIME AG®

18

Page 19: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Creating New Workflows, Importing and Exporting

• Right-click in KNIME Explorer to create new workflow or workflow group or to import workflow

• Right-click on workflow or workflow group to export

Copyright © 2019 KNIME AG®

19

Page 20: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Node Repository

• The Node Repository lists all KNIME nodes

• The search box has 2 modes– Standard Search – exact match

of node name

– Fuzzy Search – finds the most similar node name

• Nodes can be added by drag and drop from the Node Repository to the Workflow Editor.

Copyright © 2019 KNIME AG®

20

Page 21: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Console and Other Views

• Console view prints out error and warning messages about what is going on under the hood

• Click on View and select Other… to add different views

– Node Monitor, Licenses, etc.

• KNIME Hub Search View: search for nodes and workflows on the Hub

Copyright © 2019 KNIME AG®

21

Page 22: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Node Description

• The Node Description window gives information about:

– Node Functionality

– Input & Output

– Node Settings

– Ports

– References to literature

Copyright © 2019 KNIME AG®

22

Page 23: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Workflow Coach

• Node recommendation engine

– Gives hints about which node use next in the workflow

– Based on KNIME communities' usage statistics

– Based on own KNIME workflows

Copyright © 2019 KNIME AG®

23

Page 24: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tool Bar

The buttons in the toolbar can be used for the active workflow. The most important buttons:

– Execute selected and executable nodes (F7)

– Execute all executable nodes

– Execute selected nodes and open first view

– Cancel all selected, running nodes (F9)

– Cancel all running nodes

Copyright © 2019 KNIME AG®

24

Page 25: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

KNIME File Extensions

• Dedicated file extensions for Workflows and Workflow groups associated with KNIME Analytics Platform

• *.knwf for KNIME Workflow Files

• *.knar for KNIME Archive Files

23Copyright © 2019 KNIME AG®

25

Page 26: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

More on Nodes…

A node can have 3 states:

24

Not Configured: The node is waiting for configuration or incoming data.

Configured:The node has been configured correctly, and can be executed.

Executed: The node has been successfully executed. Results may be viewed and used in downstream nodes.

Copyright © 2019 KNIME AG®

26

Page 27: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Inserting and Connecting Nodes

• Insert nodes into workspace by dragging them from Node Repository orby double-clicking in Node Repository

• Connect nodes by left-clicking output port of Node A and dragging thecursor to (matching) input port of Node B

• Common port types:

Data

Model

Image

Flow Variable

Database Connection

Database Query

Copyright © 2019 KNIME AG®

27

Page 28: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Node Configuration

• Most nodes require configuration

• To access a node configuration window:

– Double-click the node

– Right-click -> Configure

Copyright © 2019 KNIME AG®

28

Page 29: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Node Execution

• Right-click node

• Select Execute in context menu

• If execution is successful, statusshows green light

• If execution encounters errors, statusshows red light

Copyright © 2019 KNIME AG®

29

Page 30: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Node Views

• Right-click node

• Select Views in context menu

• Select output port to inspect execution results

28

Plot View

Data View

Copyright © 2019 KNIME AG®

30

Page 31: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Curved Connections!

29Copyright © 2019 KNIME AG®

31

Page 32: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Getting Started: KNIME Example Server

• Public repository with large selection of example workflows for many, many applications

• Connect via KNIME Explorer

30Copyright © 2019 KNIME AG®

32

Page 33: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

KNIME Community Workflow Hub

A place to share knowledge about Workflows and Nodes https://hub.knime.com

Copyright © 2019 KNIME AG®

33

Page 34: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Hot Keys (for Future Reference)

32

Task Hot key Description

Node Configuration F6 opens the configuration window of the selected node

Node View Shift + F6 opens first out-port view

Node Execution

F7 executes selected configured nodes

Shift + F7 executes all configured nodes

Shift + F10 executes all configured nodes and opens all views

F9 cancels selected running nodes

Shift + F9 cancels all running nodes

Move Nodes and Annotations

Ctrl + Shift + Arrow moves the selected node in the arrow direction

Ctrl + Shift + PgUp/PgDown

moves the selected annotation in the front or in the back of all overlapping annotations

Workflow Operations

F8 resets selected nodes

Ctrl + S saves the workflow

Ctrl + Shift + S saves all open workflows

Ctrl + Shift + W closes all open workflows

Meta-node Shift + F12 opens meta-node wizard

Copyright © 2019 KNIME AG®

34

Page 35: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Stay connected with KNIME

33

Blog: knime.com/blog

Forum: forum.knime.com

KNIME Hub: hub.knime.com

Follow us on social media:

KNIME E-Learning Course:www.knime.com/e-learning-course

Copyright © 2019 KNIME AG®

35

Page 36: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Today’s Example

Copyright © 2019 KNIME AG®

36

Page 37: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Today’s Example

• Classification of free-text documents is a common task in the field of text mining.

• It is used to categorize documents, i.e. assign pre-defined topics, or it can be used for sentiment analysis.

• Today we want to construct a workflow that reads and preprocesses text documents, transforms them into a numerical representation and builds a predictive model to assign pre-defined labels to documents.

• Additional tasks:– Sentiment analysis– Visualization of documents– Document clustering

2Copyright © 2019 KNIME AG®

37

Page 38: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Today’s Example

3

RatingTitle

FulltextAuthor

Copyright © 2019 KNIME AG®

38

Page 39: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Today’s Example

4

Goal:

• Build a classifier to distinguish between reviews about Italian or Chinese restaurants.

Review aboutan Italian or a Chinese restaurant?

Copyright © 2019 KNIME AG®

39

Page 40: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Today’s Example

5Copyright © 2019 KNIME AG®

40

Page 41: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Bonus Examples

6Copyright © 2019 KNIME AG®

41

Page 42: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

The KNIME Text Processing Extension

Copyright © 2019 KNIME AG®

42

Page 43: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Installation

2

1.) 2.) KNIME & Extensions -> KNIME Textprocessing

Copyright © 2019 KNIME AG®

43

Page 44: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tip

• Increase maximum memory for KNIME

• Edit knime.ini

– Add “-Xmx3G” as last line of knime.ini file

– Replace 3 by the amount of gigabytes allocated for KNIME

• Useful additional extensions

– XML-Processing (KNIME extension)• Parsing and processing of XML documents

– KNIME JavaScript Views (Labs)• Tagged Document Viewer

3Copyright © 2019 KNIME AG®

44

Page 45: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Philosophy

4

… perhaps your nameis

Rumpelstiltskin[Person] ? …

… perhaps your nameis

Rumpelstiltskin[Person] ? …

Visualization

Cluster-ing

Classifi-cation

1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0

Copyright © 2019 KNIME AG®

45

Page 46: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Additional Data Types

• Document Cell

– Encapsulates a document• Title, sentences, terms, words

• Authors, category, source

• Generic meta data (key, value pairs)

• Term Cell

– Encapsulates a term• Words, tags

5Copyright © 2019 KNIME AG®

46

Page 47: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Data Table Structures

• Document table– List of documents

• Bag of words– Tuples of documents

and terms

• Document vectors– Numerical

representations of documents

6Copyright © 2019 KNIME AG®

47

Page 48: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Open KNIME

• Import workflows from USB stick

7Copyright © 2019 KNIME AG®

48

Page 49: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Importing Text

Copyright © 2019 KNIME AG®

49

Page 50: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Data Source Nodes

• Typically characterized by:

– Orange color

– No input ports, 1 output port

2

Status

Node annotation

Output port

Copyright © 2019 KNIME AG®

50

Page 51: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: File Reader

• Workhorse of the KNIME Source nodes

– Reads text based files

– Many advanced features allow it to read most ‘weird’ files

3Copyright © 2019 KNIME AG®

51

Page 52: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

File Reader: Configuration

4

Preview

Basic Settings Advanced

Settings

File path

Node description

Copyright © 2019 KNIME AG®

52

Page 53: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Excel Reader (XLS)

• Reads .xls and .xlsx file from Microsoft Excel

– Supports reading from multiple sheets

5Copyright © 2019 KNIME AG®

53

Page 54: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Excel Reader Configuration

Preview

Sheet specificsettings

File path

6Copyright © 2019 KNIME AG®

54

Page 55: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Table Reader

• Reads tables from the native KNIME Format

• Maximum performance

• Minimum configuration

7

v

File path

Copyright © 2019 KNIME AG®

55

Page 56: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Database Reader

• Connectors for Common DB types

(MySQL, Postgres, SQLite)

• Also works with any jdbc driver

• Common nodes for SQL Query Building

(Groupby, Join, Filter, Sort)

8Copyright © 2019 KNIME AG®

56

Page 57: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Other Interesting Nodes

• PMML Reader – reads standard predictive models

• XML Reader with XPATH support

• REST/SOAP, and many more

9Copyright © 2019 KNIME AG®

57

Page 58: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Parser Nodes

• Node Repository: Other Data Types/Text Processing/IO

• Available Parser Nodes

– Flat File Document Parser

– PDF Parser

– Word Parser

– Document Grabber

– …

10Copyright © 2019 KNIME AG®

58

Page 59: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Strings To Document

• Creation of document cells from strings

– Converts string cells to document cells

– Useful in combination with e.g. File Reader, XLS Reader, database nodes

11Copyright © 2019 KNIME AG®

59

Page 60: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Strings To Document: Configuration

12

TitleText

Authors

Category

Tokenizer

Copyright © 2019 KNIME AG®

60

Page 61: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tokenizers

• Different tokenizers are available leading to slightly different terms extracted from the document

Example: “I’m enjoying the tutorial”

“I”, “’m”, “enjoying”, “the”, “tutorial” “I’m”, “enjoying”, “the”, “tutorial”

WhitespaceTokenizerEnglish WordTokenizer

Copyright © 2019 KNIME AG®

61

Page 62: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Meta Info Inserter & Extractor

• Inserter allows adding document meta info

– Adds meta info to documents as key value pairs

– Helpful if more meta info available than covered by Strings to Documents node

• Extractor brings data back from document cell into table columns

– Each key results in a column, containing the specific values for each document related to that key.

14Copyright © 2019 KNIME AG®

62

Page 63: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Tika Parser

• Reads files of various formats from directory

– Searches for all files with specified extension in directory

– Creates one document for each file

– Extracts specified (meta) information

15Copyright © 2019 KNIME AG®

63

Page 64: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tika Parser: Configuration

16

Directory

File extensions

Recursivesearch

Meta data toextract

Extraction ofattachments

Copyright © 2019 KNIME AG®

64

Page 65: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Importing text”

– Import string data from:

• TripadvisorReviews-SanFranciscoRestaurants-ItalianChineseFood.table

– Filter rows with missing titles

– Convert strings to documents

– Filter all columns except the document column

17

You can download the training workflows from the KNIME Hub:https://hub.knime.com/knime/space/Education/02%20KNIME%20Text%20Mining%20Course/

Copyright © 2019 KNIME AG®

65

Page 66: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Import text

• Table Reader

• Row Filter

• Strings to Documents

• Column Filter

18Copyright © 2019 KNIME AG®

66

Page 67: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Enrichment

Copyright © 2019 KNIME AG®

67

Page 68: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Enrichment

• Semantic information is indicated by a tag assignment

– Part of speech, named entities (persons, organizations, genes, …), sentiments

• A tag consists of a type and a value

– Type represents the class or set of tags• e.g. POS (part of speech), NE (named entity)

– Value represents the actual tag value• e.g. NN (noun), PERSON

2

Column containing terms

with tags

Term “food” with tag value “NN” and type

“POS”

Copyright © 2019 KNIME AG®

68

Page 69: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tagger Nodes

• Typically characterized by:

– Yellow color

– 1 to 2 input ports (requiring one document column), 1 output port

– Assignment of semantic information (tags) to terms

3Copyright © 2019 KNIME AG®

69

Page 70: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tagger Nodes

• Node Repository:

Other Data Types/Text Processing/Enrichment

• Available Tagger Nodes

– Stanford tagger

– Dictionary (& Wildcard) tagger

– OpenNLP tagger

– Abner tagger

– Amazon Comprehend

– …

4Copyright © 2019 KNIME AG®

70

Page 71: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tagger Nodes

• Allows to specify the number of parallel threads.

• Note: each thread will load a separate model into memory!

• Tagged terms are set unmodifiable.

5

Number ofparallel threads

Copyright © 2019 KNIME AG®

71

Page 72: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Stanford tagger

• Assigns part of speech tags to terms

– Models for English, German, French (from Stanford NLP Group)

– Alternative node: POS tagger

• Model only for English (from OpenNLP)

6Copyright © 2019 KNIME AG®

72

Page 73: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Stanford Tagger: Configuration

7

Model touse

Number ofparallel threads

Copyright © 2019 KNIME AG®

73

Page 74: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Assigns selected tag to matching terms

– Matches terms in documents against terms in dictionary

– Tag to be assigned to matching terms is specified in the dialog

– Alternative node: Wildcard tagger

• Terms in dictionary may contain wild cards and regular expressions

New Node: Dictionary Tagger

8Copyright © 2019 KNIME AG®

74

Page 75: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Dictionary Tagger: Configuration

9

Dictionarycolumn

Tag value tobe assignedType of tag

to be assigned

Exact matchor “contains”

Copyright © 2019 KNIME AG®

75

Page 76: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Tagged Document Viewer

• Displays documents with tags highlighted:

– Takes a column with documents as input

– Allows to inspect tags assigned to documents

Documentcolumn

Document with tags highlighted

Copyright © 2019 KNIME AG®

76

Page 77: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tagged Document Viewer: Configuration

Document column

Enable display of tags

Number of documents to

display

View and interactivity

configuration

Copyright © 2019 KNIME AG®

77

Page 78: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Enrichment”

– Assign (English) POS tags

– View tagged documents

12Copyright © 2019 KNIME AG®

78

Page 79: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Enrichment

• POS tagger

• Tagged Document Viewer

13Copyright © 2019 KNIME AG®

79

Page 80: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise (Bonus)

• Start with “Exercise: Enrichment II”

– Read files that contain positive and negative words

• MPQA-OpinionCorpus-PositiveList.csv

• MPQA-OpinionCorpus-NegativeList.csv

– Assign positive and negative sentiment tags based on positive and negative word lists

– View tagged documents

– Tip: Dictionary Tagger node

14Copyright © 2019 KNIME AG®

80

Page 81: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution (Bonus)

Enrichment

• File Reader

• Dictionary Tagger

• Tagged Document Viewer

15Copyright © 2019 KNIME AG®

81

Page 82: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Custom NER models

16

• The provided NER models of OpenNLP NE tagger and StandfordNLP NE tagger are trained for a few types of entities and English language only.

• For more specific applications and other languages custom models are needed.

Copyright © 2019 KNIME AG®

82

Page 83: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Trains a NER model based on the input dictionary and corpus

– Tag type and value can be set in the dialog

– Creates tagged corpus based on input documents and dictionary. Trains model with tagged corpus.

New Node: StanfordNLP NE Learner

17

Dictionary

Documentcorpus StanfordNLP

NE model

Copyright © 2019 KNIME AG®

83

Page 84: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Stanford Tagger: Configuration

18

Dictionarycolumn

Documentcorpus

Tag type and value

Copyright © 2019 KNIME AG®

84

Page 85: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Tags documents based on input NER model.

– NER model can be specified in dialog, built-in or model from input port

New Node: StanfordNLP NE tagger

19Copyright © 2019 KNIME AG®

85

Page 86: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

StanfordNLP NE tagger: Configuration

20

Parameters for builtin

model

Use modelfrom input

port or built-in models

Copyright © 2019 KNIME AG®

86

Page 87: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tagging Conflicts

21

• In case of tag intersections the last node overwrites.

• “Serbian-American inventor Nikola Tesla developed the …”1. POS tagger: “Serbian-American\NNP inventor\NNP Nikola\NNP Tesla\NNP developed\VBD

the\DT…”

2. NE tagger: “Serbian-American\NNP inventor\NNP Nikola Tesla\Person developed\VBD the\DT …”

Overwrite!

Copyright © 2019 KNIME AG®

87

Page 88: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Tagged terms can be set unmodifiable

• Unmodifiable terms are not affected by any preprocessing node

• Preprocessing nodes can explicitly ignore unmodifiability

Unmodifiable Terms

22

Set unmodifiablein tagger nodes

Ignoreunmodifiability in

preprocessingnodes

Copyright © 2019 KNIME AG®

88

Page 89: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Supplementary Workflows: NER Tagger Model Training

• Trains NER model for latin and gallic names based on “De Bello Gallico” from Julius Caesar.

23Copyright © 2019 KNIME AG®

89

Page 90: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Preprocessing

Copyright © 2019 KNIME AG®

90

Page 91: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Preprocessing

• Reduction of feature space (terms)

• Filtering of unnecessary terms

– Stop words, based on POS tags, dictionaries, regex, …

• Normalization of terms

– Stemming, case conversion

2Copyright © 2019 KNIME AG®

91

Page 92: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Typically characterized by:

– Yellow color

– 1 to 2 input ports (requiring one document column), 1 output port

– For filtering and normalizing terms of documents and bags of words

Preprocessing Nodes

3Copyright © 2019 KNIME AG®

92

Page 93: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Preprocessing Nodes

• Node Repository:

Other Data Types/Text Processing/Preprocessing

• Available Preprocessing Nodes

– Stop Word Filter

– Snowball Stemmer

– Tag Filter

– Case Converter

– RegEx Filter

– …

4Copyright © 2019 KNIME AG®

93

Page 94: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Preprocessing Nodes

• Preprocessing tab in node dialog to specify:

– Append original documents

– Ignore term unmodifiability(set by tagger nodes).

5

Appendoriginal

document

Ignore termunmodifiability

Copyright © 2019 KNIME AG®

94

Page 95: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Stop Word Filter

• Filters stop words

– Built-in stop word lists: English, French, German, Italian, …

– Alternatively load custom stop word list

6Copyright © 2019 KNIME AG®

95

Page 96: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Stop Word Filter: Configuration

7

Built-in stopword lists

Custom stopword list

Copyright © 2019 KNIME AG®

96

Page 97: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Stemming

Stemming reduces different forms of a word to it’s common base by sequentialapplication of stemming rules.

Original text:Light caresses colours, sets them aglow, plays with nuances, shadows and structures

Porter stemmer:Light caress colour, set them aglow, plai with nuanc, shadowand structure.

Rule Example

SSES → SS caresses → caress

IES → I ponies → poni

SS → SS caress → caress

S → cats → cat

Copyright © 2019 KNIME AG®

97

Page 98: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Snowball Stemmer

• Reduces terms to word stem

– For various languages: English, German, French, Italian, …

– Integration of Snowball stemming library

– Alternative nodes: Porter Stemmer, Kuhlen Stemmer

• For English only

9Copyright © 2019 KNIME AG®

98

Page 99: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Snowball Stemmer: Configuration

10

Language selection

Copyright © 2019 KNIME AG®

99

Page 100: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Tag Filter

• Filters terms based on specified tag values

– For all tag types and values

11Copyright © 2019 KNIME AG®

100

Page 101: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tag Filter: Configuration

12

Tag typeselection

Tag valueselection

Copyright © 2019 KNIME AG®

101

Page 102: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Preprocessing”

– Filtering:

• Numbers

• Punctuation marks

• Stop words

• All terms except: nouns, verbs, adjectives

– Stemming

– To lower case

13Copyright © 2019 KNIME AG®

102

Page 103: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Preprocessing

• Number Filter

• Punctuation Erasure

• Stop Word Filter

• Case Converter

• Snowball Stemmer

• POS Filter

14Copyright © 2019 KNIME AG®

103

Page 104: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Transformation

Copyright © 2019 KNIME AG®

104

Page 105: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Transformation

• Transformation of data table structures

– List of documents ➔ bag of words

– Bag of words ➔ document / term vectors

– Extraction of document fields to string columns

– Conversion of terms to strings

2Copyright © 2019 KNIME AG®

105

Page 106: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Typically characterized by:

– Yellow color

– 1 input port, 1 output port

Transformation Nodes

3Copyright © 2019 KNIME AG®

106

Page 107: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Transformation Nodes

• Node Repository:

Other Data Types/Text Processing/Transformation

• Available Transformation Nodes

– Bag of Words Creator

– Document Vector

– Strings to Document

– Sentence Extractor

– Document Data Extractor

– Unique Term Extractor

– …

4Copyright © 2019 KNIME AG®

107

Page 108: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Bag of Words Creator

• Transforms list of documents into bag of words

– Original documents can be appended in a column

5

Documentlist

Bag of words

Copyright © 2019 KNIME AG®

108

Page 109: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Bag of Words Creator: Configuration

6

Documentsused to createbag of words

Original documents

can beappended

Copyright © 2019 KNIME AG®

109

Page 110: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Term to String

• Transforms term cells into string cells

– Tag information will get lost

7

Bag of words

Bag of wordswith string

column

Copyright © 2019 KNIME AG®

110

Page 111: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Term to String: Configuration

8

Terms totransform to

strings

Copyright © 2019 KNIME AG®

111

Page 112: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Preprocessing II”

– Create bag of words

– Filter terms that occur in less than 5 documents

– Tip: Bag of Words, GroupBy, and Reference Row Filter

9Copyright © 2019 KNIME AG®

112

Page 113: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Preprocessing II

• Bow Creator

• Term to String

• GroupBy

• Row Filter

• Reference Row Filter

10Copyright © 2019 KNIME AG®

113

Page 114: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Document Vector

• Transforms bag of words into document vectors

– Creates bit or numerical vectors

11

Bag of words withfrequency column

Documentvector

Copyright © 2019 KNIME AG®

114

Page 115: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Document Vector: Configuration

12

Documents toappend to leftof the created

vector columns

Create bit ornumerical

vector

Copyright © 2019 KNIME AG®

115

Page 116: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Document Vector Applier

• Transforms bag of words into document vectors

– Creates feature space of reference document vectors

– Creates bit or numerical vectors

13

Reference document vectors

Documentvector

Copyright © 2019 KNIME AG®

116

Page 117: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Document Vector Applier: Configuration

14

Include andexclude lists offeatures of the

referencevectors

Use settings from model

input

Copyright © 2019 KNIME AG®

117

Page 118: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Document Vector Hashing

• Transforms documents into document vectors

– Vector indices of terms are determined by term hashing

– Creates bit or numerical vectors

– Is streamable

15

Hasheddocument

vector

Copyright © 2019 KNIME AG®

118

Page 119: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Document Vector Hashing: Configuration

16

Dimensions ofdocument vectors

Hashingfunction

Copyright © 2019 KNIME AG®

119

Page 120: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Document Data Extractor

• Extracts document fields as strings

– Title, text, categories, …

17

Documentcolumn

Extracted field as string column

Reminder: we stored restaurant type into Category field in String to

Document conversion

Copyright © 2019 KNIME AG®

120

Page 121: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Document Data Extractor: Configuration

18

Fields toextract

Copyright © 2019 KNIME AG®

121

Page 122: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Frequencies

• Frequencies are based on the number of occurrences of terms

– Locally (in documents): term frequency (TF) absolute or relative

– Globally (in corpus): inverse document frequency (IDF)

• Frequencies can also be used for term filtering

19Copyright © 2019 KNIME AG®

122

Page 123: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Typically characterized by:

– Green color

– 1 input port, 1 output port

– Require bag of words

Frequency Nodes

20

Append column withrelative TF values

Copyright © 2019 KNIME AG®

123

Page 124: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Frequency Nodes

• Node Repository:

Other Data Types/Text Processing/Frequencies

• Available Frequency Nodes

– TF

– IDF

– Ngram creator

– …

21Copyright © 2019 KNIME AG®

124

Page 125: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: TF

• Computes the relative orabsolute term frequency (tf) of each term within a document

Appended columnwith TF values

Copyright © 2019 KNIME AG®

125

Page 126: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: DF

• Computes the number of documents that contain each term

Appended columnwith DF values

Copyright © 2019 KNIME AG®

126

Page 127: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: IDF

• Computes three variants of inverse document frequency (IDF) for each term within the documents

– Smooth, normalized, and probabilistic

Appended columnwith IDF values

Copyright © 2019 KNIME AG®

127

Page 128: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Term Co-Occurence Counter

• Counts the number ofpairwise co-occurences ofterms in bag of wordswithin selected parts of document (e.g. sentence, paragraph, title)

Copyright © 2019 KNIME AG®

128

Page 129: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Ngram Creator

• Creates ngrams from documents of input table and counts their frequencies

• Both word and character ngrams are possible

Copyright © 2019 KNIME AG®

129

Page 130: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Transformation”

– Compute relative term frequencies

– Create document vectors

– Extract class label / category

27Copyright © 2019 KNIME AG®

130

Page 131: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Transformation

• TF

• Document Vector

• Document Data Extractor

28Copyright © 2019 KNIME AG®

131

Page 132: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Classification

Copyright © 2019 KNIME AG®

132

Page 133: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Classification

• Assigning pre-defined labels to documents– Categorization

– Sentiment analysis

– Topic assignment

• Supervised learning

• In the last section we transformed textual documents into a numerical representation (document vectors).

• We can use standard KNIME nodes to classify / analyze these vectors.

2Copyright © 2019 KNIME AG®

133

Page 134: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Classification

Methods:

• Decision Trees

• Neural Networks

• Naïve Bayes

• Logistic Regression

• Support Vector Machine

• Tree Ensembles

3Copyright © 2019 KNIME AG®

134

Page 135: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Predictive Modeling Overview

4

Training

Set

Test

Set

Original

Data Set

Train

Model

Apply

Model

Score

Model

Data PartitioningTraining and

Applying ModelsModel Evaluation

Copyright © 2019 KNIME AG®

135

Page 136: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Partitioning

• Use it to split data into training and evaluation sets

• Partition by count (e.g. 10 rows) or fraction (e.g. 10%)

• Sample by a variety of methods; random, linear, stratified

5Copyright © 2019 KNIME AG®

136

Page 137: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Predictive Modeling Overview

6

Training

Set

Test

Set

Original

Data Set

Train

Model

Apply

Model

Score

Model

Data Partitioning Training and

Applying Models

Scoring

Strategies

Copyright © 2019 KNIME AG®

137

Page 138: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• All data mining models use a Learner-Predictor motif.

• The Learner node trains the model with its input data.

• The Predictor node applies the model to a different subset of data.

The Learner-Predictor Motif

7

Training set

Test set

Trained Model

Copyright © 2019 KNIME AG®

138

Page 139: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Decision Tree

• C4.5 builds a tree from a set of training data using the concept of information entropy.

• At each node of the tree, the attribute of the data with the highest normalized information gain (difference in entropy) is chosen to split the data.

• The C4.5 algorithm then recourses on the smaller sub lists.

8

J.R. Quinlan, “C4.5 Programs for machine learning”

J. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data

Mining”

Copyright © 2019 KNIME AG®

139

Page 140: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Decision Tree Learner

9Copyright © 2019 KNIME AG®

140

Page 141: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Decision Tree: View

10

If the word “Italian” occurs in a review, the restaurant is

very likely an Italian restaurant.

Copyright © 2019 KNIME AG®

141

Page 142: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Decision Tree Predictor

• Consumes a Decision Tree model and new data to classify

• Check the box to append class probabilities

11Copyright © 2019 KNIME AG®

142

Page 143: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Predictive Modeling Overview

12

Training

Set

Test

Set

Original

Data Set

Train

Model

Apply

Model

Score

Model

Data Partitioning Training and

Applying Models

Scoring

Strategies

Copyright © 2019 KNIME AG®

143

Page 144: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Scorer

• Compare predicted results to known truth to evaluate model quality

• Confusion matrix shows the distribution of model errors

• An accuracy statistics table provides additional info

13Copyright © 2019 KNIME AG®

144

Page 145: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Scorer: Confusion Matrix

14

This is the difference

between the confusion

matrix data table and the

confusion matrix view

True Positives

False Negatives

False Positives

True Negatives

Copyright © 2019 KNIME AG®

145

Page 146: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Scorer: Accuracy Measures

15

From the confusion matrix

Copyright © 2019 KNIME AG®

146

Page 147: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Classification”

– Append color information based on class labels

– Split data into training and test set

– Train decision tree classifier on training set

– Apply trained model on test set

– Score model

16Copyright © 2019 KNIME AG®

147

Page 148: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Classification

• Color Manager

• Column Filter

• Partitioning

• Decision Tree Learner

• Decision Tree Predictor

• Scorer

17Copyright © 2019 KNIME AG®

148

Page 149: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Classification (Bonus)

• Usually the documents used to train a model are read from a different source than that of the documents to which the model is applied afterwards

• To apply a trained model on a second set of documents we need to ensure that all features of the training set exist as features of the second set.

• This means that all document vector columns of the training set must exist as document vector columns in the second set.

18Copyright © 2019 KNIME AG®

149

Page 150: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Classification (Bonus)

19

All features of the trainingset must exist as features

in the second set.

Copyright © 2019 KNIME AG®

150

Page 151: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise (Bonus)

• Start with “Exercise: Classification II”

– Create document vectors for the second set of documents “Boston Tripadvisor Reviews”

– The feature space of the second set has to contain all features of the training set!

– Apply the trained model on the second set of documents

20Copyright © 2019 KNIME AG®

151

Page 152: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution (Bonus)

Classification II

21Copyright © 2019 KNIME AG®

152

Page 153: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Sentiment Analysis (Bonus)

• In sentiment analysis predefined sentiment labels, such as "positive" or "negative“, are assigned to texts.

Methods:

• Predictive modeling

• Dictionary based

• Deep parsing

• …

22Copyright © 2019 KNIME AG®

153

Page 154: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Sentiment Analysis Example (Bonus)

• The Large Movie Review Dataset v1.0

– 50.000 English movie reviews

– Associated sentiment labels “positive” and “negative”

– http://ai.stanford.edu/~amaas/data/sentiment/

• Subset contains 2000 documents

– 1000 positive reviews

– 1000 negative reviews

– …/data/IMDb-sample.csv

23Copyright © 2019 KNIME AG®

154

Page 155: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Sentiment Analysis Example (Bonus)

24

Predictive modeling:

• Build classifier to distinguish between positive and negative reviews.

– “Ah, Moonwalker, I'm a huge Michael Jackson fan, I grew up with his music, Thriller was actually the first music video I ever saw apparently. …”

– “This film has a very simple but somehow very bad plot. …”

Positive ornegative?

Copyright © 2019 KNIME AG®

155

Page 156: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise (Bonus)

• Start with “Exercise: Classification III”

– Create document cells

– Preprocess documents

• Punctuation Erasure, N Chars Filter, Stop Word Filter, Case converter, Snowball Stemmer

• Filter all terms that occur in less than 20 documents

– Create document vectors

– Extract sentiment label and assign colors

– Partition into training and test set

– Train decision tree model and score it

25Copyright © 2019 KNIME AG®

156

Page 157: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution (Bonus)

Classification

• Strings to document

• Preprocessing nodes

• Bag of words creation, grouping, counting, and filtering

• Vector creation

• Model training and scoring

26Copyright © 2019 KNIME AG®

157

Page 158: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Sentiment Analysis Example (Bonus)

27

Dictionary based:

• Use a custom dictionary to count positive andnegative words.

• Compute sentiment score to predict sentimentlabel.

Copyright © 2019 KNIME AG®

158

Page 159: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise (Bonus)

• Start with “Exercise: Classification IV”

– Create document cells

– Tag terms based on sentiment dictionaries

• Tip: Dictionary Tagger

– Extract and count positive and negative terms

– Compute sentiment score based on the number of positive and negative terms

– Predict sentiment labels based on score

– Score predictions

28Copyright © 2019 KNIME AG®

159

Page 160: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution (Bonus)

Classification• Strings to

Documents• Dictionary Tagger• Bag of words, TF,

and GroupBy for counting

• Pivoting• Math Formula• Rule Engine• Scorer

29Copyright © 2019 KNIME AG®

160

Page 161: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Visualization

Copyright © 2019 KNIME AG®

161

Page 162: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

• Typically characterized by:

– Blue color

– 1 input port, 1-2 output port (image port)

Visualization Nodes

2Copyright © 2019 KNIME AG®

162

Page 163: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Visualization Nodes

• Node Repository:

Other Data Types/Text Processing/Misc

• Available Visualization Nodes

– Document Viewer

– Tagged Document Viewer (in JS Views (Labs))

– Tag Cloud

• KNIME Text Processing provides only two dedicated viz. nodes

• Various other nodes can be used for viz. too.

3Copyright © 2019 KNIME AG®

163

Page 164: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Tag Cloud (JavaScript)

• Shows terms visualized in a cloud

– Colors are specified via the Color Manager

– Requires a term and a numerical column (usually tf)

– Creates image, available at image out port

4

List of termsand

frequencies

Size of wordscorrespondsto frequency

Copyright © 2019 KNIME AG®

164

Page 165: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tag Cloud: Configuration

5

Display onlytop N terms

(rows)

Term columnand frequency

column

Copyright © 2019 KNIME AG®

165

Page 166: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Tag Cloud: View

6

Min and max fontsize, angle, …

Scaling of fontsize: linear,

log, exp

Copyright © 2019 KNIME AG®

166

Page 167: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Additional Visualizations

• Decision Tree View

– Inspect trained model

– See which terms are discriminative

7Copyright © 2019 KNIME AG®

167

Page 168: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Visualization”

– Inspect decision tree via its view

– Visualize bag of words using a tag cloud

– Assign colors to terms in tag cloud (Optional)

• Green if term occurs mostly in Chinese reviews, blue if terms occurs mostly in Italian reviews

8Copyright © 2019 KNIME AG®

168

Page 169: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Visualization

• Decision Tree Learner

• Tag Cloud

• (Optional Coloring)

– TF, Document Data Extractor, Group By, Pivoting, Math Formula, Color Manager

9Copyright © 2019 KNIME AG®

169

Page 170: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Document Viewer

• Shows details of documents

– Title, Full text

– Meta information

– Tagged terms can be hilited and linked

10

Documentcolumn

Copyright © 2019 KNIME AG®

170

Page 171: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Document Viewer: View

11

List of all documents.

Double click fordetails

Details view withtitle andfull text

Tagged termscan be hilited

Author, category, meta information,

Tagset tohilite

Copyright © 2019 KNIME AG®

171

Page 172: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Reminder: Tagged Document Viewer

• Displays documents with tags highlighted:

– Takes a column with documents as input

– Allows to inspect tags assigned to documents

Documentcolumn

Document with tags highlighted

Copyright © 2019 KNIME AG®

172

Page 173: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Visualization II”

– View document content

– View document content and highlight tagged terms

– View tagged documents

13Copyright © 2019 KNIME AG®

173

Page 174: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Visualization

• Document Viewer

• Tagged Document Viewer

14Copyright © 2019 KNIME AG®

174

Page 175: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Bonus Visualizations

• Supplementary Workflows/

– R Theme River (R plot)

– Twitter Word Tree (JavaScript view)

15Copyright © 2019 KNIME AG®

175

Page 176: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Clustering

Copyright © 2019 KNIME AG®

176

Page 177: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Clustering

• Find groups (clusters) of similar documents

– Topic detection

– Exploration

• Unsupervised learning

• We can use standard KNIME nodes to cluster the numerical document vectors.

2Copyright © 2019 KNIME AG®

177

Page 178: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Clustering

Methods:

• Hierarchical clustering

• K-Means / Medoids

• Density based

• …

3Copyright © 2019 KNIME AG®

178

Page 179: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Hierarchical Clustering

• Creates hierarchy for all data points– Agglomerative, bottom-up– Combine the “closest” data points/clusters, one at a time

• Hierarchy can be illustrated by dendrogram• Applicable only on small data sets (<5000)

• Complete linkage: combine data object/cluster with minimal maximum distance– Finds compact, convex clusters

• Single linkage: combine data object/cluster with minimal minimum distance– Also finds concave clusters

• Average linkage: distance between two clusters c1 and c2 = mean distance between all points in c1 and c2

4Copyright © 2019 KNIME AG®

179

Page 180: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Prototype-based Clustering

• K-Medoids, K-Means, Fuzzy C-Means, …

• Data are condensed to a small fixed number of prototypical data points

• Each prototype represents a subset of data points

• Applicable on large data sets

• Number of prototypes (k) must be specified in advance

5Copyright © 2019 KNIME AG®

180

Page 181: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Distance Matrix Calculate

• Computes all pairwise distances

• Different distance measures available

– Euclidean, Manhattan, Cosine, Dice, Tanimoto, …

• Optional distance model input port

6

Documentvectors

Distancecolumn

Copyright © 2019 KNIME AG®

181

Page 182: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Distance Matrix Calculate: Configuration

7

Distancemeasure

Columns to usefor distancecomputation

Name ofdistancecolumn

Copyright © 2019 KNIME AG®

182

Page 183: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Hierarchical Clustering (DistMatrix)

• Creates hierarchy of input data points

– Complete Linkage, Average Linkage, Single Linkage

• Requires distance column or model

8

Distancecolumn

Clustering model

Distancefunction

(optional)

Copyright © 2019 KNIME AG®

183

Page 184: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Hierarchical Clustering (DistMatrix): Configuration

9

Distancecolumn Linkage

type

Copyright © 2019 KNIME AG®

184

Page 185: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Hierarchical Cluster View

• Shows:

– Dendrogram of clustering

– Distance curve

– Colors

10

Data points, e.g.document

vectors

Hierarchicalclustering

model

Dendrogramor distance

Copyright © 2019 KNIME AG®

185

Page 186: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: Hierarchical Cluster Assigner

• Assigns data points to clusters based on

– Distance threshold

– Number of clusters

11

Data points, e.g.document

vectors

Hierarchicalclustering

model

Cluster assignment

Copyright © 2019 KNIME AG®

186

Page 187: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Hierarchical Cluster Assigner: Configuration

12

Threshold orcluster count

based assignment

Copyright © 2019 KNIME AG®

187

Page 188: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Hierarchical Clustering: Example Workflow

13

Data e.g.:document

vectors

Hierarchy ofdata points Illustration of

dendrogram

Assignment ofclusters

Copyright © 2019 KNIME AG®

188

Page 189: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

New Node: k-Medoids

• Computes k prototypes (medoids)

• Requires distance column or model

• Requires specification of k

• Similar nodes:

– k-Means

– Fuzzy c-Means

14

Data points anddistance column

Cluster assignment

Copyright © 2019 KNIME AG®

189

Page 190: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

k-Medoids: Configuration

15

Cluster count k

Distance matrixcolumn

Random seedfor reproducible

results

Copyright © 2019 KNIME AG®

190

Page 191: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

k-Medoids Clustering: Example Workflow

16

Data e.g.:document

vectors

Assignment ofclusters

Copyright © 2019 KNIME AG®

191

Page 192: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Exercise

• Start with “Exercise: Clustering”

– What groups of documents are in the data?

– Compute pairwise cosine distances

– Apply hierarchical clustering

• View dendrogram to find out the number of clusters (k)

• Assign k clusters

– Apply k-Medoids with k as number of clusters

– Select documents of one cluster in dendrogram, hilite them, and inspect data in a table view

17Copyright © 2019 KNIME AG®

192

Page 193: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Section Solution

Clustering

• Distance Matrix Calculate

• Hierarchical Clustering

– Cluster View

– Cluster Assigner

• k-Medoids

18Copyright © 2019 KNIME AG®

193

Page 194: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

1

Supplementary Workflows

Copyright © 2019 KNIME AG®

194

Page 195: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

R Theme River

Creates theme river using ggplot2.

• ggplot2 has to be installed!

• Change lib path

2Copyright © 2019 KNIME AG®

195

Page 196: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Twitter Word Tree

Creates a word tree using the JavaScript Google charting library.

3Copyright © 2019 KNIME AG®

196

Page 197: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Term Co-occurrences

Term co-occurrences of all term pairs are counted on sentence and document level.

4Copyright © 2019 KNIME AG®

197

Page 198: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Topic Extraction

Extracts two topics from the input documents and 10 words to represent each topic.

5Copyright © 2019 KNIME AG®

198

Page 199: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

RESTful Geolocation

6

Try Catch Block

REST call to get lat long for IPs

Copyright © 2019 KNIME AG®

199

Page 200: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

RESTful Geolocation

• Translates IPs to geo coordinates via RESTful service

• GET Resource: access RESTful API via GET

• IP to geo coordinates (lat/lon)

• Read REST Representation: parse REST result

– JSON, XML, CSV, …

• Try Catch nodes to log errors gracefully

7Copyright © 2019 KNIME AG®

200

Page 201: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Geographic Analysis

8Copyright © 2019 KNIME AG®

201

Page 202: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Geographic Analysis

• Reads IPs from download weblog and related geo coordinates

• Aggregates downloads by city, country, and US states

• OSM Map View to visualize geo coordinates

• OSM Map to Image to create image of map view

9Copyright © 2019 KNIME AG®

202

Page 203: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Social Media Analysis

10

Sentiment analysis of users

Leader / Follower analysis of users

Copyright © 2019 KNIME AG®

203

Page 204: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Social Media Analysis

• Slashdot forum data

• Text Mining: sentiment analysis of users

• Network Mining: leader and follower scoring of users

11Copyright © 2019 KNIME AG®

204

Page 205: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Romeo and Juliet

12

Load JPEG and convert to PNG

Read epub file

Insert PNG images and visualize network

Tag character names and count frequencies

Copyright © 2019 KNIME AG®

205

Page 206: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Romeo and Juliet

• Interaction network of characters.

• Border color indicates family assignment

• Node size is related to TF of character names

13Copyright © 2019 KNIME AG®

206

Page 207: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window

Copyright © 2019 KNIME AG

Thank You!

[email protected]

The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States.

KNIME® is also registered in Germany.