Automation & Analysis of RSS & ATOM newsfeeds using Hadoop …€¦ · Analysis 3. Workflow 4....

Automation & Analysis of RSS & ATOM newsfeeds using Hadoop Ecosystem

by

MURUGAPANDIAN RAMAIAH

as part of the certification course on

Big Data Analytics using Hadoop

Submitted to

on

20th November 2016

Table of ContentsSummary...............................................................................................................................................4Background and Motivation.................................................................................................................4Scope....................................................................................................................................................4Project structure....................................................................................................................................4

Technology stack.............................................................................................................................6Building blocks.....................................................................................................................................7

Integration of jatomrss parser..........................................................................................................7Importing the jatomrss parser output to HDFS..............................................................................10Geo categorizing the news feeds...................................................................................................13Exporting the output back to database...........................................................................................14Wiring the tasks 1 to 4 with Hadoop workflow.............................................................................14Providing a browser based interface to the project........................................................................19

Extensibility of the project.................................................................................................................20Summary.............................................................................................................................................20References..........................................................................................................................................21

To Tamilmanam Kaasi

SummaryThis project has been carried out to extract, analyse and display the RSS and ATOM news feeds.

This has been submitted as the curriculum project of the course ‘Big Data Analytics using

Hadoop’, the certification course offered by Manipal ProLearn.

Background and MotivationSeveral websites and applications are available to aggregate the news feeds. Tamil blog aggregator

‘tamilmanam - http://tamilmanam.net/’ by Mr Kasilingam Arumugam was released in 2004. It is

one of the popular tools used by Tamil blogosphere till date. Though years passed by, it hasn’t

been updated recently to broaden its scope like integrating the aggregator with feeds of global

news agencies, social media etc. In addition, it needs in-depth analytics features for the news

agencies and bloggers.

This project aims at providing a framework for the following -

1. Interface to aggregate newsfeeds of global news agencies and blogs

2. Auto-geo-categorizing the feeds

3. Providing a technorati-like extensive aggregation features for Tamil blogs

4. Providing deep analytics features for the bloggers

ScopeThe final goal of this project would be as given below.

1. Providing an automated workflow for the feeds extraction and analysis

2. Providing a browser based user interface for the analysis and reporting

Along with the above main goals, it has been considered to provide a scalable framework to opt-in

many other feeds mechanism (social media streaming) and machine learning analytics.

Project structureHadoop eco system used in this project is responsible for invoking the feed parser and analysing

the parsed feeds. In addition, it will update the backend database with the results. All these tasks

are wired with oozie workflow for automated execution.

The results then be picked by the end user by the front-end built using web technologies.

http://tamilmanam.net/

https://www.linkedin.com/in/akaasi

This would be completed in four stages.

1. Extraction

2. Analysis

3. Workflow

4. Reporting

The detailed list of the tasks for each stage is given below.

# Detail Stage

1 Integration of jatomrss parser Extraction

2 Importing the jatomrss parser output to HDFS Analysis

3 Geo categorizing the news feeds using Hadoop ecosystem Analysis

4 Exporting the output back to database Analysis

5 Wiring the tasks 1 to 4 with Hadoop workflow Workflow

6 Providing a browser based interface to the project Reporting

Table 1: Task List

Illustration 1: Project structure

Technology stackFeed parsing : jatomrss-0.0.1-SNAPSHOT, JDK 1.8.0_101

Data Analysis : Hadoop 2.6.4, Sqoop 1.4.6, Oozie 4.0.0-cdh5.1.0

User Interface : Spring 4.1.6.RELEASE, Spring security 4.0.0.RELEASE, Hibernate

4.3.8.Final, Tiles 3.0.5, Tomcat 8.5.6

Database : MySQL 5.7.13-0ubuntu0.16.04.2 (Ubuntu)

Geo Mapping : Google API

Dependency : Maven 3.3.9

IDE : Spring Tool Suite 3.8.1

OS : Ubuntu 16.04.1 LTS

Machine configuration :

Memory : 12415396 kB

CPU : 4

Disk :

Filesystem Size Used Avail Use% Mounted on

udev 6.0G 0 6.0G 0% /dev

tmpfs 1.2G 9.7M 1.2G 1% /run

/dev/sda1 91G 19G 68G 22% /

tmpfs 6.0G 188K 6.0G 1% /dev/shm

tmpfs 5.0M 4.0K 5.0M 1% /run/lock

tmpfs 6.0G 0 6.0G 0% /sys/fs/cgroup

tmpfs 1.2G 36K 1.2G 1% /run/user/108

tmpfs 1.2G 0 1.2G 0% /run/user/599

Building blocks

Integration of jatomrss parserThe parsing of RSS and ATOM feeds would be performed by jatomrss parser. This is a opensource

project maintained by Murugapandian Ramaiah (me) since 2008. This is freely available for use

and development to the public at https://sourceforge.net/projects/jatomrss/

Scripting work has been undertaken in the parsing logic to store the feed list and output.

The feeds would be classified using the below given table.

FeedCategories

# Name Type Collation Attributes Null Default Extra

1 id bigint(20) No None AUTO_INCREMENT

2 categoryName varchar(200) utf8_bin No None

Table 2: List of categories under which the blogs would be classified

https://sourceforge.net/projects/jatomrss/

This table has the following values.

id categoryName

1 Entertainment

2 Tech & IT

3 Commercial

4 Regional

5 Literature

6 Personal Blog

7 News

Table 3: Contents of feedcategory table

The list of the feeds are stored in the following table. The jAtomRss feed parser would parse all the

feeds which has ‘enabled’ as 1.

feedlist

# Name Type Collation Att

rib

ute

s

Null Default Extra

1 id bigint(20) No None AUTO_INC

REMENT

2 feedXmlUrl varchar(2000) utf8_bin No None Index

3 enabled tinyint(4) No 1

4 feedCategoryIndex bigint(20) No None

5 generator varchar(2000) utf8_general_ci Yes NULL

6 title varchar(2000) utf8_general_ci Yes NULL

7 description varchar(2500) utf8_general_ci Yes NULL

8 author varchar(2000) utf8_general_ci Yes NULL

9 pubDate timestamp Yes NULL

10 lastParseTime timestamp Yes NULL

Table 4: Structure of the table feedlist

A sample of the values stored in the above table is given below.

id feedXml

Url

enabled feedCate

gory

generato

r

title descripti

on

author pubDate lastPars

eTime

324 http://www.straitstimes.com/news/asia/rss.xml

1 7 application/rss+xml

The Straits Times Asia News

NULL NULL 2016-11-2006:49:04

2016-11-2007:45:46

325 http://www.thehindu.com/news/international/?servic...


The Hindu - International

RSS feed NULL NULL 2016-11-2007:45:47

326 http://www.thehindu.com/news/cities/Tiruchirapalli...


The Hindu - Tiruchirapalli


327 http://www.thehindu.com/news/international/south-a...


The Hindu - South Asia


330 http://www.straitstimes.com/news/singapore/rss.xml


The Straits Times Singapore News

NULL NULL 2016-11-2007:33:50

2016-11-2007:45:46

Table 5: Sample values of the table feedlist

The jAtomRss feed parser has been modified extract any news items appeared after the

lastParseTime column of the feedlist table.

The new articles would be written in to the feed_record table.

feed_record

# Name Type Collation Attribu

tes

Null Default Extra


REMENT

2 feedIdIndex bigint(20) No None

3 entrySubject varchar(1000) utf8_bin Yes NULL

4 entryAuthor varchar(1000) utf8_bin Yes NULL

5 entryUrl varchar(1000) utf8_bin Yes NULL

6 entryDate timestamp Yes NULL

7 categorySet varchar(1000) utf8_bin Yes NULL

8 entryDescription longblob No None

A sample of the values stored in feed_record table is given as below.

id feedId entrySubject entryAuth

or

entryUrl entryDate categorySe

t

entryDescr

iption

76355 381 UPDATE 1-Soccer-

Italy, Germany

play out 0-0

draw, ...

NULL http://eco

nomictime

s.indiatime

s.com/new

s/sports/u

p...

2016-11-

16

06:29:15

[Germany,I

nternation

al]

[BLOB -

120 B]

76356 381 UPDATE 1-Soccer-

France held 0-0 by

Ivory Coast in ...

NULL http://eco

nomictime

s.indiatime

s.com/new

s/sports/u

p...

2016-11-

16

06:24:15

[Sports,So

ccer]

[BLOB -

110 B]

76357 381 Dennis forced out

of McLaren after 35

years

NULL http://eco

nomictime

s.indiatime

s.com/new

s/sports/d

e...

2016-11-

16

06:19:20

[Dennis,M

cLaren]

[BLOB - 71

B]

76358 381 UPDATE 2-Motor

racing-Dennis

forced out of

McLaren...

NULL http://eco

nomictime

s.indiatime

s.com/new

s/sports/u

p...

2016-11-

16

06:14:20

[Sports,Ra

cing]

[BLOB -

104 B]

Hence the scope of the ‘Integration of jatomrss parser’ is completed with updating the feed_record

table with latest articles information.

Importing the jatomrss parser output to HDFSAfter the parsing is completed with populating the database table feed_record, the data would be

imported to HDFS for further analysis.

Only one analysis, geo-categorizing, is performed in this project.

Location and its latitude and longitude is stored in the below given table.

lat_long



REMENT

2 eng_name varchar(500) utf8_bin No None

3 tamil_name varchar(500) utf8_bin No None

4 longitude varchar(20) utf8_general_ci No None

5 latitude varchar(20) utf8_general_ci No None

6 place varchar(500) utf8_bin No None

Table 6: Structure of lat_long table

Sample of the contents of the lat_long table is given as below

id eng_name tamil_name longitude latitude place

6 Kanchipuram காஞ்சிபுரம் 12.834174 79.703644 Kanchipuram,

Tamil Nadu,

India

8 Bengaluru பெங்களுரு 12.971599 77.594566 Bengaluru,

Karnataka,

India

20 Mangaluru மங்களுரு 12.915605 74.855965 Mangaluru,

Karnataka,

India

30 Uttar Pradesh உத்திர பிரரதிரதேசம் 27.94908 80.782402 Lakhimpur,

Uttar Pradesh,

India

31 Bihar பீஹார் 25.35128 85.031143 Masaurhi,

Bihar, India

32 Gujarat குஜராத் 23.00795 72.553757 Bhatta, Paldi,

Ahmedabad,

Gujarat, India

34 West Bengal ரதமற்கு வங்கம் 23.012794 87.593948 Kotulpur, West

Bengal, India

Later each article of feed_record would be mapped to the geo record of lat_long using many-to-

many relationship. The mapping is stored in the mapping_geo_article table. The geo location of the

article would be marked as 671 if the system is unable to find a geo-mapping.

mapping_geo_article


1 feedRecordid bigint(20) No None

2 geoIdPrimaryIndex bigint(20) No None

Table 7: The structure of mapping_geo_article

A sample of the above table is given below.

feedRecordId geoId

76546 8

76568 8

77259 8

77405 8

77259 20

78894 20

77244 30

77444 30

A view vw_feed_record_geoid is created by combining the tables feed_record and lat_long table

with the following structure.

vw_feed_record_geoid

# Name Type Collatio

n

Attributes Null Default Extra

1 id bigint(20) No 0

2 feedId bigint(20) No None

3 entrySubject varchar(1000) utf8_bin Yes NULL

4 entryAuthor varchar(1000) utf8_bin Yes NULL

5 entryUrl varchar(1000) utf8_bin Yes NULL

6 entryDate timestamp Yes NULL

7 categorySet varchar(1000) utf8_bin Yes NULL

8 entryDescription longblob No None

9 feedRecordId bigint(20) Yes NULL

10 geoId bigint(20) Yes NULL

All the article without a geo mapping would be imported to HDFS using sqoop with the following

command.

sqoop import--driver com.mysql.jdbc.Driver--connect jdbc:mysql://localhost:3306/feed_analytics--username feed_analytics--password P@ssw0rd--table vw_feed_record_geoid--where feedRecordId is null--m 1--target-dir /user/hadoop/feed/scheduler/sqoop/output/--fields-terminated-by ×

All the latitude and longitude records would be imported by Sqoop as given below.

Sqoop import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table lat_long --target-dir /user/hadoop/feed/scheduler/sqoop/geo --fields-terminated-by $ --m 1Hence the imported records would be saved at /user/hadoop/feed/scheduler/sqoop/geo. Each

record would be delimited by $. It will launch one mapper process.

Geo categorizing the news feedsA Mapper program org.grassfield.nandu.geo.GeoArticleMapper is written to find the related

geo records. This program would read the category and the content of the article to find the

location.

Map Reduce

Parameter Value

Map Output Key Class LongWritable.class

Map Output Value Class LongWritable.class

Output Key Class LongWritable.class

Output Value Class LongWritable.class

Output Format Class TextOutputFormat.class

Mapper Class GeoArticleMapper.class

Reducer Class GeoArticleReducer.class

Number of Reduce Tasks 0

Input Paths (HDFS) /user/hadoop/feed/scheduler/sqoop/output/

Output Path (HDFS) /user/hadoop/feed/scheduler/sqoop/mr

Once the Mapreduce is executed with the below given command the output will be saved in the

Output Path of HDFS.

hadoop jar FeedCategoryCount-32.jar org.grassfield.nandu.geo.GeoArticleDriver /user/hadoop/feed/scheduler/sqoop/output/part-m-00000 /user/hadoop/feed/scheduler/mr/Output

78630 65578631 65578632 65578633 67178650 33678650 33978650 37378650 63078650 63178650 632

Exporting the output back to databaseThe output of the Map Reduce should be written back to the database table

mapping_geo_article so that it shall be used by the user interface. It would be performed by

Sqoop export as given below.

sqoop export --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table mapping_geo_article --export-dir /user/hadoop/feed/scheduler/sqoop/mr/ --input-fields-terminated-by ‘\t’

Wiring the tasks 1 to 4 with Hadoop workflowIndividual building blocks are completed. To automate the tasks of Hadoop eco system, I have

used oozie workflow engine.

The workflow is defined in the xml format as given below.

XML Parameters are defined in job.properties file

nameNode=hdfs://gandhari:9000jobTracker=gandhari:8032queueName=defaultexamplesRoot=examplesoozie.use.system.libpath=trueoozie.wf.application.path=/user/hadoop/feed/scheduler

#extract feedsfeedparserProperties=/opt/hadoop/feed/database.properties

#extract feed recordsfeedsOutputFolder=/user/hadoop/feed/scheduler/sqoop/output/

#extract geocommand=import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table lat_long --target-dir /user/hadoop/feed/scheduler/sqoop/geo --fields-terminated-by $ --m 1geoOutputFolder=/user/hadoop/feed/scheduler/sqoop/geo/

#mapreduce geomrInputFolder=/user/hadoop/feed/scheduler/sqoop/output/mrOutputFolder=/user/hadoop/feed/scheduler/sqoop/mr

#export mr output to mysqlmrExportSqoopCommand=export --connect jdbc:mysql://localhost:3306/feed_analytics--username feed_analytics --password P@ssw0rd --table mapping_geo_article --export-dir /user/hadoop/feed/scheduler/sqoop/mr/ --input-fields-terminated-by "\t"

The oozie job would be started as given below

$ oozie job -oozie http://gandhari:11000/oozie -config job.properties --dryrun

OK

$ oozie job -oozie http://gandhari:11000/oozie -config job.properties --run

job: 0000012-161119193909833-oozie-hado-W

The job was running for 20 minutes (for a small set of data) and completed successfully as shown

below.

Illustration 2: Status of Oozie job

Illustration 3: Oozie DAG

Providing a browser based interface to the projectOnce the data analytics is completed, the data is plotted in a user friendly way in a browser based

interface.

Illustration 4: Last 1 hour news

Illustration 5: Last 5 hour news

Hence the scope of the project was well covered and explained with the above verifications. The following has been submitted for the reviewers reference.

1. FeedCategoryCount-32.jar – Archive of Map Reduce code, Oozie workflow

2. jessica-0.0.1-SNAPSHOT.war – Archive of the browser based interface

3. feed_analytics.sql.zip – database schema

4. Project report

5. README.txt – meta data for the files

Extensibility of the projectThe framework shall be extended to provide the analysis as given below

1. Appling machine learning logic to find the trending topics

2. Extending the functionality to social media streams

3. Providing analysis over the years with the help of NoSQL databases

SummaryThe detailed list of the tasks for each stage is given below.

# Detail Stage Status

1 Integration of jatomrss parser Extraction COMPLETED

2 Importing the jatomrss parser output to HDFS Analysis COMPLETED

3 Geo categorizing the news feeds using Hadoop ecosystem Analysis COMPLETED

4 Exporting the output back to database Analysis COMPLETED

5 Wiring the tasks 1 to 4 with Hadoop workflow Workflow COMPLETED

6 Providing a browser based interface to the project Reporting COMPLETED

Table 8: Task List

Illustration 6: Text version of 1 day news

References1. ProLearn – prerecorded demos

2. Oozie - Apache documentation

3. Sqoop – apache documentation

4. Community forums of Cloudera and Hortonworks

Automation & Analysis of RSS & ATOM newsfeeds using Hadoop …€¦ · Analysis 3. Workflow 4....

Documents

Transcript of Automation & Analysis of RSS & ATOM newsfeeds using Hadoop …€¦ · Analysis 3. Workflow 4....