Automation & Analysis of RSS & ATOM newsfeeds using Hadoop …€¦ · Analysis 3. Workflow 4....
Transcript of Automation & Analysis of RSS & ATOM newsfeeds using Hadoop …€¦ · Analysis 3. Workflow 4....
Automation & Analysis of RSS & ATOM newsfeeds using Hadoop Ecosystem
by
MURUGAPANDIAN RAMAIAH
as part of the certification course on
Big Data Analytics using Hadoop
Submitted to
on
20th November 2016
Table of ContentsSummary...............................................................................................................................................4Background and Motivation.................................................................................................................4Scope....................................................................................................................................................4Project structure....................................................................................................................................4
Technology stack.............................................................................................................................6Building blocks.....................................................................................................................................7
Integration of jatomrss parser..........................................................................................................7Importing the jatomrss parser output to HDFS..............................................................................10Geo categorizing the news feeds...................................................................................................13Exporting the output back to database...........................................................................................14Wiring the tasks 1 to 4 with Hadoop workflow.............................................................................14Providing a browser based interface to the project........................................................................19
Extensibility of the project.................................................................................................................20Summary.............................................................................................................................................20References..........................................................................................................................................21
To Tamilmanam Kaasi
SummaryThis project has been carried out to extract, analyse and display the RSS and ATOM news feeds.
This has been submitted as the curriculum project of the course ‘Big Data Analytics using
Hadoop’, the certification course offered by Manipal ProLearn.
Background and MotivationSeveral websites and applications are available to aggregate the news feeds. Tamil blog aggregator
‘tamilmanam - http://tamilmanam.net/’ by Mr Kasilingam Arumugam was released in 2004. It is
one of the popular tools used by Tamil blogosphere till date. Though years passed by, it hasn’t
been updated recently to broaden its scope like integrating the aggregator with feeds of global
news agencies, social media etc. In addition, it needs in-depth analytics features for the news
agencies and bloggers.
This project aims at providing a framework for the following -
1. Interface to aggregate newsfeeds of global news agencies and blogs
2. Auto-geo-categorizing the feeds
3. Providing a technorati-like extensive aggregation features for Tamil blogs
4. Providing deep analytics features for the bloggers
ScopeThe final goal of this project would be as given below.
1. Providing an automated workflow for the feeds extraction and analysis
2. Providing a browser based user interface for the analysis and reporting
Along with the above main goals, it has been considered to provide a scalable framework to opt-in
many other feeds mechanism (social media streaming) and machine learning analytics.
Project structureHadoop eco system used in this project is responsible for invoking the feed parser and analysing
the parsed feeds. In addition, it will update the backend database with the results. All these tasks
are wired with oozie workflow for automated execution.
The results then be picked by the end user by the front-end built using web technologies.
This would be completed in four stages.
1. Extraction
2. Analysis
3. Workflow
4. Reporting
The detailed list of the tasks for each stage is given below.
# Detail Stage
1 Integration of jatomrss parser Extraction
2 Importing the jatomrss parser output to HDFS Analysis
3 Geo categorizing the news feeds using Hadoop ecosystem Analysis
4 Exporting the output back to database Analysis
5 Wiring the tasks 1 to 4 with Hadoop workflow Workflow
6 Providing a browser based interface to the project Reporting
Table 1: Task List
Illustration 1: Project structure
Technology stackFeed parsing : jatomrss-0.0.1-SNAPSHOT, JDK 1.8.0_101
Data Analysis : Hadoop 2.6.4, Sqoop 1.4.6, Oozie 4.0.0-cdh5.1.0
User Interface : Spring 4.1.6.RELEASE, Spring security 4.0.0.RELEASE, Hibernate
4.3.8.Final, Tiles 3.0.5, Tomcat 8.5.6
Database : MySQL 5.7.13-0ubuntu0.16.04.2 (Ubuntu)
Geo Mapping : Google API
Dependency : Maven 3.3.9
IDE : Spring Tool Suite 3.8.1
OS : Ubuntu 16.04.1 LTS
Machine configuration :
Memory : 12415396 kB
CPU : 4
Disk :
Filesystem Size Used Avail Use% Mounted on
udev 6.0G 0 6.0G 0% /dev
tmpfs 1.2G 9.7M 1.2G 1% /run
/dev/sda1 91G 19G 68G 22% /
tmpfs 6.0G 188K 6.0G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 6.0G 0 6.0G 0% /sys/fs/cgroup
tmpfs 1.2G 36K 1.2G 1% /run/user/108
tmpfs 1.2G 0 1.2G 0% /run/user/599
Building blocks
Integration of jatomrss parserThe parsing of RSS and ATOM feeds would be performed by jatomrss parser. This is a opensource
project maintained by Murugapandian Ramaiah (me) since 2008. This is freely available for use
and development to the public at https://sourceforge.net/projects/jatomrss/
Scripting work has been undertaken in the parsing logic to store the feed list and output.
The feeds would be classified using the below given table.
FeedCategories
# Name Type Collation Attributes Null Default Extra
1 id bigint(20) No None AUTO_INCREMENT
2 categoryName varchar(200) utf8_bin No None
Table 2: List of categories under which the blogs would be classified
This table has the following values.
id categoryName
1 Entertainment
2 Tech & IT
3 Commercial
4 Regional
5 Literature
6 Personal Blog
7 News
Table 3: Contents of feedcategory table
The list of the feeds are stored in the following table. The jAtomRss feed parser would parse all the
feeds which has ‘enabled’ as 1.
feedlist
# Name Type Collation Att
rib
ute
s
Null Default Extra
1 id bigint(20) No None AUTO_INC
REMENT
2 feedXmlUrl varchar(2000) utf8_bin No None Index
3 enabled tinyint(4) No 1
4 feedCategoryIndex bigint(20) No None
5 generator varchar(2000) utf8_general_ci Yes NULL
6 title varchar(2000) utf8_general_ci Yes NULL
7 description varchar(2500) utf8_general_ci Yes NULL
8 author varchar(2000) utf8_general_ci Yes NULL
9 pubDate timestamp Yes NULL
10 lastParseTime timestamp Yes NULL
Table 4: Structure of the table feedlist
A sample of the values stored in the above table is given below.
id feedXml
Url
enabled feedCate
gory
generato
r
title descripti
on
author pubDate lastPars
eTime
324 http://www.straitstimes.com/news/asia/rss.xml
1 7 application/rss+xml
The Straits Times Asia News
NULL NULL 2016-11-2006:49:04
2016-11-2007:45:46
325 http://www.thehindu.com/news/international/?servic...
1 7 application/rss+xml
The Hindu - International
RSS feed NULL NULL 2016-11-2007:45:47
326 http://www.thehindu.com/news/cities/Tiruchirapalli...
1 7 application/rss+xml
The Hindu - Tiruchirapalli
RSS feed NULL NULL 2016-11-2007:45:47
327 http://www.thehindu.com/news/international/south-a...
1 7 application/rss+xml
The Hindu - South Asia
RSS feed NULL NULL 2016-11-2007:45:47
330 http://www.straitstimes.com/news/singapore/rss.xml
1 7 application/rss+xml
The Straits Times Singapore News
NULL NULL 2016-11-2007:33:50
2016-11-2007:45:46
Table 5: Sample values of the table feedlist
The jAtomRss feed parser has been modified extract any news items appeared after the
lastParseTime column of the feedlist table.
The new articles would be written in to the feed_record table.
feed_record
# Name Type Collation Attribu
tes
Null Default Extra
1 id bigint(20) No None AUTO_INC
REMENT
2 feedIdIndex bigint(20) No None
3 entrySubject varchar(1000) utf8_bin Yes NULL
4 entryAuthor varchar(1000) utf8_bin Yes NULL
5 entryUrl varchar(1000) utf8_bin Yes NULL
6 entryDate timestamp Yes NULL
7 categorySet varchar(1000) utf8_bin Yes NULL
8 entryDescription longblob No None
A sample of the values stored in feed_record table is given as below.
id feedId entrySubject entryAuth
or
entryUrl entryDate categorySe
t
entryDescr
iption
76355 381 UPDATE 1-Soccer-
Italy, Germany
play out 0-0
draw, ...
NULL http://eco
nomictime
s.indiatime
s.com/new
s/sports/u
p...
2016-11-
16
06:29:15
[Germany,I
nternation
al]
[BLOB -
120 B]
76356 381 UPDATE 1-Soccer-
France held 0-0 by
Ivory Coast in ...
NULL http://eco
nomictime
s.indiatime
s.com/new
s/sports/u
p...
2016-11-
16
06:24:15
[Sports,So
ccer]
[BLOB -
110 B]
76357 381 Dennis forced out
of McLaren after 35
years
NULL http://eco
nomictime
s.indiatime
s.com/new
s/sports/d
e...
2016-11-
16
06:19:20
[Dennis,M
cLaren]
[BLOB - 71
B]
76358 381 UPDATE 2-Motor
racing-Dennis
forced out of
McLaren...
NULL http://eco
nomictime
s.indiatime
s.com/new
s/sports/u
p...
2016-11-
16
06:14:20
[Sports,Ra
cing]
[BLOB -
104 B]
Hence the scope of the ‘Integration of jatomrss parser’ is completed with updating the feed_record
table with latest articles information.
Importing the jatomrss parser output to HDFSAfter the parsing is completed with populating the database table feed_record, the data would be
imported to HDFS for further analysis.
Only one analysis, geo-categorizing, is performed in this project.
Location and its latitude and longitude is stored in the below given table.
lat_long
# Name Type Collation Attributes Null Default Extra
1 id bigint(20) No None AUTO_INC
REMENT
2 eng_name varchar(500) utf8_bin No None
3 tamil_name varchar(500) utf8_bin No None
4 longitude varchar(20) utf8_general_ci No None
5 latitude varchar(20) utf8_general_ci No None
6 place varchar(500) utf8_bin No None
Table 6: Structure of lat_long table
Sample of the contents of the lat_long table is given as below
id eng_name tamil_name longitude latitude place
6 Kanchipuram காஞ்சிபுரம் 12.834174 79.703644 Kanchipuram,
Tamil Nadu,
India
8 Bengaluru பெங்களுரு 12.971599 77.594566 Bengaluru,
Karnataka,
India
20 Mangaluru மங்களுரு 12.915605 74.855965 Mangaluru,
Karnataka,
India
30 Uttar Pradesh உத்திர பிரரதிரதேசம் 27.94908 80.782402 Lakhimpur,
Uttar Pradesh,
India
31 Bihar பீஹார் 25.35128 85.031143 Masaurhi,
Bihar, India
32 Gujarat குஜராத் 23.00795 72.553757 Bhatta, Paldi,
Ahmedabad,
Gujarat, India
34 West Bengal ரதமற்கு வங்கம் 23.012794 87.593948 Kotulpur, West
Bengal, India
Later each article of feed_record would be mapped to the geo record of lat_long using many-to-
many relationship. The mapping is stored in the mapping_geo_article table. The geo location of the
article would be marked as 671 if the system is unable to find a geo-mapping.
mapping_geo_article
# Name Type Collation Attributes Null Default Extra
1 feedRecordid bigint(20) No None
2 geoIdPrimaryIndex bigint(20) No None
Table 7: The structure of mapping_geo_article
A sample of the above table is given below.
feedRecordId geoId
76546 8
76568 8
77259 8
77405 8
77259 20
78894 20
77244 30
77444 30
A view vw_feed_record_geoid is created by combining the tables feed_record and lat_long table
with the following structure.
vw_feed_record_geoid
# Name Type Collatio
n
Attributes Null Default Extra
1 id bigint(20) No 0
2 feedId bigint(20) No None
3 entrySubject varchar(1000) utf8_bin Yes NULL
4 entryAuthor varchar(1000) utf8_bin Yes NULL
5 entryUrl varchar(1000) utf8_bin Yes NULL
6 entryDate timestamp Yes NULL
7 categorySet varchar(1000) utf8_bin Yes NULL
8 entryDescription longblob No None
9 feedRecordId bigint(20) Yes NULL
10 geoId bigint(20) Yes NULL
All the article without a geo mapping would be imported to HDFS using sqoop with the following
command.
sqoop import--driver com.mysql.jdbc.Driver--connect jdbc:mysql://localhost:3306/feed_analytics--username feed_analytics--password P@ssw0rd--table vw_feed_record_geoid--where feedRecordId is null--m 1--target-dir /user/hadoop/feed/scheduler/sqoop/output/--fields-terminated-by ×
All the latitude and longitude records would be imported by Sqoop as given below.
Sqoop import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table lat_long --target-dir /user/hadoop/feed/scheduler/sqoop/geo --fields-terminated-by $ --m 1Hence the imported records would be saved at /user/hadoop/feed/scheduler/sqoop/geo. Each
record would be delimited by $. It will launch one mapper process.
Geo categorizing the news feedsA Mapper program org.grassfield.nandu.geo.GeoArticleMapper is written to find the related
geo records. This program would read the category and the content of the article to find the
location.
Map Reduce
Parameter Value
Map Output Key Class LongWritable.class
Map Output Value Class LongWritable.class
Output Key Class LongWritable.class
Output Value Class LongWritable.class
Output Format Class TextOutputFormat.class
Mapper Class GeoArticleMapper.class
Reducer Class GeoArticleReducer.class
Number of Reduce Tasks 0
Input Paths (HDFS) /user/hadoop/feed/scheduler/sqoop/output/
Output Path (HDFS) /user/hadoop/feed/scheduler/sqoop/mr
Once the Mapreduce is executed with the below given command the output will be saved in the
Output Path of HDFS.
hadoop jar FeedCategoryCount-32.jar org.grassfield.nandu.geo.GeoArticleDriver /user/hadoop/feed/scheduler/sqoop/output/part-m-00000 /user/hadoop/feed/scheduler/mr/Output
78630 65578631 65578632 65578633 67178650 33678650 33978650 37378650 63078650 63178650 632
Exporting the output back to databaseThe output of the Map Reduce should be written back to the database table
mapping_geo_article so that it shall be used by the user interface. It would be performed by
Sqoop export as given below.
sqoop export --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table mapping_geo_article --export-dir /user/hadoop/feed/scheduler/sqoop/mr/ --input-fields-terminated-by ‘\t’
Wiring the tasks 1 to 4 with Hadoop workflowIndividual building blocks are completed. To automate the tasks of Hadoop eco system, I have
used oozie workflow engine.
The workflow is defined in the xml format as given below.
XML Parameters are defined in job.properties file
nameNode=hdfs://gandhari:9000jobTracker=gandhari:8032queueName=defaultexamplesRoot=examplesoozie.use.system.libpath=trueoozie.wf.application.path=/user/hadoop/feed/scheduler
#extract feedsfeedparserProperties=/opt/hadoop/feed/database.properties
#extract feed recordsfeedsOutputFolder=/user/hadoop/feed/scheduler/sqoop/output/
#extract geocommand=import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://localhost:3306/feed_analytics --username feed_analytics --password P@ssw0rd --table lat_long --target-dir /user/hadoop/feed/scheduler/sqoop/geo --fields-terminated-by $ --m 1geoOutputFolder=/user/hadoop/feed/scheduler/sqoop/geo/
#mapreduce geomrInputFolder=/user/hadoop/feed/scheduler/sqoop/output/mrOutputFolder=/user/hadoop/feed/scheduler/sqoop/mr
#export mr output to mysqlmrExportSqoopCommand=export --connect jdbc:mysql://localhost:3306/feed_analytics--username feed_analytics --password P@ssw0rd --table mapping_geo_article --export-dir /user/hadoop/feed/scheduler/sqoop/mr/ --input-fields-terminated-by "\t"
The oozie job would be started as given below
$ oozie job -oozie http://gandhari:11000/oozie -config job.properties --dryrun
OK
$ oozie job -oozie http://gandhari:11000/oozie -config job.properties --run
job: 0000012-161119193909833-oozie-hado-W
The job was running for 20 minutes (for a small set of data) and completed successfully as shown
below.
Illustration 2: Status of Oozie job
Illustration 3: Oozie DAG
Providing a browser based interface to the projectOnce the data analytics is completed, the data is plotted in a user friendly way in a browser based
interface.
Illustration 4: Last 1 hour news
Illustration 5: Last 5 hour news
Hence the scope of the project was well covered and explained with the above verifications. The following has been submitted for the reviewers reference.
1. FeedCategoryCount-32.jar – Archive of Map Reduce code, Oozie workflow
2. jessica-0.0.1-SNAPSHOT.war – Archive of the browser based interface
3. feed_analytics.sql.zip – database schema
4. Project report
5. README.txt – meta data for the files
Extensibility of the projectThe framework shall be extended to provide the analysis as given below
1. Appling machine learning logic to find the trending topics
2. Extending the functionality to social media streams
3. Providing analysis over the years with the help of NoSQL databases
SummaryThe detailed list of the tasks for each stage is given below.
# Detail Stage Status
1 Integration of jatomrss parser Extraction COMPLETED
2 Importing the jatomrss parser output to HDFS Analysis COMPLETED
3 Geo categorizing the news feeds using Hadoop ecosystem Analysis COMPLETED
4 Exporting the output back to database Analysis COMPLETED
5 Wiring the tasks 1 to 4 with Hadoop workflow Workflow COMPLETED
6 Providing a browser based interface to the project Reporting COMPLETED
Table 8: Task List
Illustration 6: Text version of 1 day news
References1. ProLearn – prerecorded demos
2. Oozie - Apache documentation
3. Sqoop – apache documentation
4. Community forums of Cloudera and Hortonworks