As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which...

35
as news analysis tool SNATZ TECHNOLOGY

Transcript of As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which...

Page 1: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

as news analysis tool

SNATZ TECHNOLOGY

Page 2: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Main terms used in presentation

Term – a phrase, which system uses for training NLP algorithms.Summary – a phrase, which system automatically detects during analyzing of news content.Trend - an unique chain, which contains one, two or more summary. These chains are created as result of cluster analysis.Tag – a term, which created by moderator for detecting user’s interest category.User interests - cloud of tags, which system recognizes from user’s social accounts and OPML files.Segments - a groups of ‘similar’ Trends, which are intersected more than 30% by search results.Semantic network - is a network which represents semantic relations between keywordsData warehouse - is a database used for reporting and data analysis.

Page 3: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

The main goal of SNATZ

Snatz is a data mining instrument. •It can recognizes semantic of news content using NLP algorithm•On the basis of acquired summary SNATZ can define new knowledges:

• detect new Summary sets• gathering Trends statistics • opportunity to build Segments• using new Summary as Terms for training NLP algorithm• Making recommendation of news from different Segments

Our solutions allow to change the paradigm of ‘Collaborative Filtering’

Page 4: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Snatz platform architecture

Snatz platform architecture consists of:

• SNATZ Recommender System - personal recommendation based on the

users’ interests

• SNATZ Data Mining Tool – semantic network of trends. It is created by sending recognized metadata to analysis processing

Page 5: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ Recommender• CRAWLER. Crawler interacts with Web sites by receiving RSS-feeds and tweets.

Content of RSS and tweets are the main resources.• Blogosphere. All resources which web crawler detected are saved in data

warehouse makes internal SNATZ “blogosphere”.• Data Processing. Exporting resources to the SNATZ DM Tool. Also, together with

news articles it sends sets of labels, terms, summary.• USERS• Tags/Posts:

- Posts. System recognized users’ posts from Fb, Tw and OPML files.- Tags. Using NLP algorithm system defines the User’s interests.

• Recommendations. Component contains rules of forming news recommendations.• News archive. News items which were recommended for a users

Page 6: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ Data Mining ToolDocumenter. Imports resources from Data Processing component and sends them to the NLPNLP. Semantic analysis:

- POS Tagging- Defining articles attributes: labels, terms and summary.

Meta-Docs. Data warehouse of articles with semantic analysisAnalysis:

- Multi Clusterization- Trends defining

Semantic Network of Trends:- Segments

Reporter- Trends statistics

Page 7: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Data workflow

CRAWLER Blogs

Data Processing

Users

Recommendation

News Archi

ve

Tags/Posts

Meta Docs

Docs NLP

Analysis

ReporterSemantic Network of Trends

IMPORTEXPORT

Recommender System Data Mining Tool

WEB

Page 8: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Building segments

DocumentsDocuments Meta-DocsMeta-Docs Tree of TrendsTree of Trends

Meta –Docs•Labels•Terms•Summary

Meta –Docs•Labels•Terms•Summary

Tree of TrendsTree of Trends SegmentsSegments

Page 9: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Building segments

• Documenter imports resources and sets of labels, terms, summary from Data Processing component. And sends them to the NLP.

• NLP recognizes an attributes in recourses: labels, terms, summary. These resources become a meta-docs and are saved in Data warehouse.

• Meta Docs are sent to the Analysis and system forms actual Trend Tree.• Trends identify related summary, i.e. the main direction of its topics and sub-

topics. Through such relations of trends SNATZ finds similar/related topics and groups them into Segments:

- If Trends intersects more than 30% than trends create a new Segment.

Page 10: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Recommendations

Users PostsUsers Posts Meta-DocsMeta-Docs InterestsInterests

TagsTags Related TagsRelated Tags Daily ReviewDaily Review

Page 11: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Recommendations

• System parses users posts from Fb, Tw, uploaded OPML file of subscriptions. • Updating of interests performed every 4 hours• NLP recognized interests from Posts resulting a set of Tags.• If number of Tags is less than 12, system tries to find relates Tags.• System takes Trends which were received from Meta-Docs and defines related

Trends for users interests. If Trend contains user’s interest it becomes connected with user.

• Summary which are in related Trend becomes the Related Tags.• System takes trends from User’s Trend tree and makes Daily Review

Page 12: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Personal Recommendations

User InterestsUser Interests Segments

Segments InterestsInterests

User’s TrendsUser’s Trends User’s Tree of TrendsUser’s Tree of Trends List of 12 NewsList of 12 News

User’s TrendsUser’s Trends

Check TrendsGet Trends

'Diversity' Filtering

Page 13: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Personal Recommendations

• System takes ‘Last Trends’ which contains users interests and forms User’s Trends

• User’s Trends are checked on segments and forms User’s Trends Tree. ’Diversity’ filtering:•System does not take more than 2 interests from one category•No more than one news article for the trend •System gets news only with new keywords (i.e. comparing with previous sets of news)•Only 1 news from same segment•Only 2 news from one category

Page 14: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ server architecture

Page 15: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ server architectureCluster High-availability provides the following services:

1. virtual ip for cluster.2. DRBD storage of cluster .3. ext4 file system on top of DRBD.4. containers openVZ on ext4 over DRBD.

• each cluster is assembled on two nodes.• corosyn is used for managing. • Pacemaker is a resource manager. • system is five two-node clusters.

Page 16: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ server architectureRedundant services are performed on openVZ containers and start

together with the start of the container. Interaction redundant services between the containers is carried via the local network, which is connected via a separate commutator to the second network interface of each node.

For each two-node cluster written sequence of start of redundant services:1. switching active / passive DRBD2. mount the ext4 file system to the mount point of the active node .3. start of openVZ containers which are placed on DRBD.

Page 17: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ + Elasticsearch engine

Elasticsearch is a search server which provides distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.

Advantages Elasticsearch for SNATZ:• Elasticsearch is a stable working project• AWS Cloud Plugin (allows to use Amazon EC2 API)• Real time data Search and Analysis• Index versioning support• Search opportunities: fuzzy requests & etc.

Page 18: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Elasticsearch + Amazon EC2

Page 19: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Elasticsearch + Amazon EC2

• ability to maintain a high performance cluster designed for I/O intensive operations

• new instances are started and stopped when required• no need to pay for long-term servers and their administration• pricing is per instance-hour consumed for each instance• ability to create images from a working machine (configured & set up) and start

other instances from these images

Features:

Page 20: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ + NLP

Features:• Part-of-speech-tagging• Summary extraction• User-defined Terms and Labels• Synonyms handling• Supervised text classification using user-defined datasets for training/evaluating

performance

Language support:• English• Japanese (using third-party tools like MeCab)

Page 21: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Challenges of SNATZ

• Filter Bubble (user’s interests)• Diversity and ‘Long Tail’• Data sparsity (‘the cold start problem’)• Scalability• Segmentation (‘related topics’)

Page 22: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

How SNATZ solves this problems?

Using TRENDs

Page 23: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

What is Filter Bubble

User can see popular news only by TOP-Tags from his interests’ categories.

But user doesn’t see related Tags outside the Filter Bubble

Page 24: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

What is TREND?

All summary and terms of articles has close connections.

The task of SNATZ to define significant connections.

Page 25: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

How Trends are detected?

Clustering By Labels

Clustering By Terms

Clustering By Summary

System detected Trends

News with Terms

Page 26: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Abstraction of algorithm

Multilevel clustering algorithm has 3 abstractions:

•Labels•Terms•Summary

Page 27: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

SNATZ outside Filter bubble

SNATZ tries to show news beyond users` filter bubble to cover more Trends.

Trends identify related summary, i.e. the main direction of its topics and sub-topics

Page 28: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Long Tail problem

SNATZ solves this problem:•For user recommendations SNATZ selects Trends only by different Segments•In order to provide users with *new* content, SNATZ does NOT make recommendations based on Summary that were already picked for previous recommendations. This way the user can see the news based on the latest Trends•SNATZ does NOT use TOP-Tags from user’s interest categories.

Users usually doesn't see most of news because they have too small Popularity Rank.

Page 29: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Collaborating Filter

The first and most common way to determine the significance of an article is its social rating. This is determined through an advanced technique called Collaborative Filtering, which collects taste preferences or personal information (such as language, country, etc.) from many users and uses that data to make automatic predictions.

SNATZ recommends news solely on the basis of user interests. Every step of recommendations is unique and depends on the previous step. Recommendations are made only on the basis of the individual user's experience.

Page 30: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Effective Content Personalization

One approach to effective content personalization is called ‘the classification of trends’, and based on the principles of identifying the most significant relationships between summary, creating a unique chain of summary called a trend.

A trend contains one or more summary from Web content, and determines specific subtopics. The main characteristic of a trend is dynamics of chains or summary, with positive (growing) or negative (fading) conditions over a specific period of time.

Page 31: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Automatic segmentation of blogosphere

Through such relations of chains SNATZ finds similar/related topics and groups them into Segments.

For example: ‘Network+Tumblr’ intersects with ‘Network+Tumblr+Instagram’ by more than 30%.These chains create a new Segment.

Trends are determined by analyzing content in the current news state of the daily Blogosphere at it’s most basic form - relevant daily news topics. If a recommendation engine calculates the thematic proximity of trends, then it can auto-classify them into trend segments, so that similar sub-topics are put in the same segments. This auto-classification of segments splits Web content on various major topics.

Page 32: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Automatic segmentation

A recommendation engine that applies this classification process on trends (and not tags) solves two major personalization problems:

•Removes Long Tail, making news recommendations from different segments possible

•Solves the problem of thematic proximity, making sure that similar or duplicate news is filtered out

Page 33: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Data mining result. Infographics

System detect • System can: • detects more actual Trends

for any given topic.• detects ‘Related’ Tags

for any given topic.• detects the dynamics of Trends• detects the sentiment of news

Page 34: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Findings

Information becomes increasingly dense, consumers deserve to get the news that they want to read – not the news an algorithm thinks they want.

SNATZ gives is a personalization algorithm that can solve the challenges of the filter bubble and long tail

Page 35: As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Thanks for your attention!

SNATZ Team