Analysis & Review of Product Using Twitter Tweets as a Dataset

43
A Project Report On Analysis & Review of Product using Twitter Tweets as a Dataset Submitted in partial fulfillment of the requirements of University of Mumbai for the degree of Bachelor of Engineering By Gadade Dnyaneshwar (111CP1193) Khavase Namrata (111CP1727) Rao Vidya (111CP1726) Project Guide: Prof. Chandrashekhar Badgujar i

description

BE final year project Report on Analysis & Review of Product using Twitter Tweets as a Dataset.

Transcript of Analysis & Review of Product Using Twitter Tweets as a Dataset

AProject Report

On

Analysis & Review of Product using Twitter

Tweets as a Dataset Submitted in partial fulfillment of the requirements

of University of Mumbai for the degree of

Bachelor of Engineering

By

Gadade Dnyaneshwar (111CP1193)

Khavase Namrata (111CP1727)

Rao Vidya (111CP1726)

Project Guide: Prof. Chandrashekhar Badgujar

Department of Computer Engineering

Mahatma Gandhi Mission’s College of Engineering & Technology

Kamothe, Navi Mumbai – 400 209

Academic Year: 2015-16

i

CERTIFICATE

This is to certify that the project entitled “Analysis & Review using Twitter Text

Mining” is a bonafide work of

1. Mr. Gadade Dnyaneshwar (111CP1193)

2. Ms. Khavase Namrata (111CP1727)

3. Ms. Rao Vidya (111CP1726)

Submitted to the University of Mumbai in partial fulfillment of

the requirement for the award of the degree of “Undergraduate” in “Computer

Engineering”.

(Prof. Chandrasekhar Badgujar) (Prof. Jagrati Shekhawat)

Project Guide Project Coordinator

(Dr. Anitha Patil) (Dr. Santosh K. Narayankhedkar)

Head of Department Principal

ii

Project Report Approval for B. E.

This project report entitled Analysis & Review using Twitter Text Mining by

BE Students is approved for the degree of Computer Engineering

Examiners

1.--------------------------------------------

2.---------------------------------------------

Date:

Place:

iii

Declaration

We declare that this written submission represents our ideas in our own words and where

others' ideas or words have been included, we have adequately cited and referenced the

original sources. We also declare that we have adhered to all principles of academic honesty

and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source

in my submission. We understand that any violation of the above will be cause for

disciplinary action by the Institute and can also evoke penal action from the sources which

have thus not been properly cited or from whom proper permission has not been taken when

needed.

-----------------------------------------

(Signature)

-----------------------------------------Gadade Dnyaneshwar (111CP1193)

-----------------------------------------

(Signature)

-----------------------------------------(Khavase Namrata (111CP1727)

-----------------------------------------

(Signature)

-----------------------------------------Rao Vidya (111CP1726)

Date:

iv

Chapter 1. INTRODUCTION

Recently, a number of online shopping customers have dramatically increased due to the rapid growth of e-commerce, and the increase of online merchants. To enhance the customer satisfaction, merchants and product manufacturers allow customers to review or express their opinions on the products or services. The customers can now post a review of products at merchant sites, e.g., amazon.com, cnet.com, and epinions.com. These online customer reviews, thereafter, become a cognitive source of information which is very useful for both potential customers and product manufacturers. Customers have utilized this piece of this information to support their decision on whether to purchase the product. For product manufacturer perspective, understanding the preferences of customers is highly valuable for product development, marketing and consumer relationship management.

Since customer feedbacks influence other customer's decision, the review documents have become an important source of information for business organizations to take it into account while developing marketing and product development plans.

How does Opinion Mining System Works?

1

Among the 2 main types of textual information - facts and opinions, a major portion of current information processes methods such as web search and text mining work with the former. Opinion Mining refers to the broad area of natural language processing, computational linguistics and text mining involving the computational study of opinions, sentiments and emotions expressed in text. A thought, view, or attitude based on emotion instead of reason is often referred to as a sentiment. Hence, an alternate term for Opinion Mining, namely Sentiment Analysis. This field ends critical use in areas where organizations or individuals wish to know the general sentiment associated to a particular entity - be it a product, person, public policy, movie or even an institution. Opinion mining has many application domains including science and technology, entertainment, education, politics, marketing, accounting, law, research and development. In earlier days, with limited access to user generated opinions, research in this field was minimal. But with the tremendous growth of the World Wide Web, huge volumes of opinionated texts in the form of blogs, reviews, discussion groups and forums are available for analysis making the World Wide Web the fastest, most comprehensive and easily accessible medium for sentiment analysis. However, finding opinion sources and monitoring them over the Web can be a formidable task because a large number of diverse sources exist on the Web and each source also contains a huge volume of information. From a human’s perspective, it is both difficult and tiresome to find relevant sources, extract pertinent sentences, read them, summarize them and organize them into usable form. An automated and faster opinion mining and summarizing system is thus needed.

1.1 Scope of the Project

Opinion mining generally refers to the process of extracting product features and opinions from review documents and summarizing them using a graphical representation. Generally, document level opinion mining systems fail to reveal the product features liked or disliked by the users, rather they classify the reviews as positive or negative. A positive review does not mean that the opinion holder has positive opinion on all aspects or features of the product. Similarly, a negative review does not mean that the opinion holder dislikes everything about the product. Keeping in mind the above facts, feature-based opinion mining is proposed. Since a complete opinion is always expressed in one sentence along with its relevant feature, the feature and opinion pair extraction can be performed at sentence-level to avoid their false associations.

2

Chapter 2. MOTIVATION & APPROACH

2.1 Problem Statement

As the number of reviews that a product receives may grow rapidly and many times the reviews may also be quite lengthy, it is hard for the customers to analyze them through manual reading to make an informed decision to purchase a product. A large number of reviews for a single product may also make it harder for individuals to evaluate the true underlying quality of a product. In these cases, customers may naturally gravitate to read a few reviews in order to form a decision regarding the product and he/ she may get only a biased view of the product. Similarly, manufacturers want to read the reviews to identify what elements of a product affect sales most, and a large number of reviews make it hard for product manufacturers or business organizations to keep track of customer's opinions and sentiments on their products and services. Since, most of the reviews are stored either in unstructured or semi-structured format; the distillation of knowledge from this huge repository becomes a challenging task. It would be a great help for both customers and manufacturers if the reviews could be processed automatically and presented in a summarized form highlighting the product features and users opinions expressed over them. So I propose a text mining approach to mine product features and opinions from review documents.

2.2 Motivation

In recent years, we have witnessed that opinionated postings in social media have helped reshape businesses, and sway public sentiments and emotions, which have profoundly impacted on our social and political systems. Such postings have also mobilized masses for political changes such as those happened in some Arab countries in 2011. It has thus become a necessity to collect and study opinions on the Web. Of course, opinionated documents not only exist on the Web (called external data), many organizations also have their internal data, e.g., customer feedback collected from emails and call centers or results from surveys conducted by the organizations.

Due to these applications, industrial activities have flourished in recent years. Sentiment analysis applications have spread to almost every possible domain, from consumer products, services, healthcare, and financial services to social events and political elections. Many big corporations have also built their own in-house capabilities, e.g., Microsoft, Google, Hewlett-Packard, SAP, and SAS. These practical applications and industrial interests have provided

3

strong motivations for research in sentiment analysis.

Now the proposed system deals with the language of English. The purpose behind selecting the English language is because it is widely used and universally known. Most of the customers would express their opinions using English language. But the problem is that the contents that are available on web that is, the recommendations and views expressed by the internet users are unstructured or semi-structured. In simple terms we can say that the use of English language can be grammatically incorrect with chances of spelling mistakes and wide use of shortcuts. Technically we can phrase these by calling them formal opinions and informal opinions. Where formal opinions refer to the use of proper English with no grammatical mistakes and informal refers to the improper use of English with grammatical mistakes, involving use of words which are not standard spellings. They say that the UK English is said to be formal English and US English is considered to be informal English. Most of the customers use US English, hence it is very important to concentrate on these opinions too, or else they might be discarded as noise. Hence the system proposed deals with this aspect of informal reviews.

4

Chapter 3. Related Work

3.1 Overview

Our work is partly based on and closely related to opinion mining and sentence sentiment classification. Extensive research has been done on sentiment analysis of review text and subjectivity analysis (determining whether a sentence is subjective or objective). Another related area is feature/topic-based sentiment analysis, in which opinions on particular attributes of a product are determined. Most of this work concentrates on finding the sentiment associated with a sentence (and in some cases, the entire review). There has also been some research on automatically extracting product features from review text. Though there has been some work in review summarization, and assigning summary scores to products based on customer reviews, there has been relatively little work on ranking products using customer reviews.

3.2 Existing System

Existing Systems on feature-based opinion mining have applied various methods for feature extraction and refinement, including NLP and statistical methods. However, these analyses revealed two main problems. First, most systems select the feature from a sentence by considering only information about the term itself, for example, term frequency, not bothering to consider the relationship between the term and the related opinion phrases in the sentence. As a result, there is a high probability that the wrong terms will be chosen as features. Second, words like ‘photo,’ ‘picture,’ and ‘image’ that have the same or similar meanings are treated as different features since most methods only employ surface or grammatical analysis for feature differentiation. This results in the extraction of too many features from the review data, often causing incorrect opinion analysis and providing an inappropriate summary of the review analysis.

3.3 Level of Opinion MiningThe opinion mining tasks at hand can be broadly classified based on the level at which it is done with the various levels being namely,

The document level,

The sentence level.

The feature level.

5

At the document level, sentiment classification of documents into positive, negative, and neutral polarities is done with the assumption made that each document focuses on a single object O(although this is not necessarily the case in many realistic situations such as discussion forum posts) and contains opinion from a single opinion holder. At the sentence level, identification of subjective or opinionated sentences amongst the corpus is done by classifying data into objective (Lack of opinion) and subjective or opinionated text. Subsequently, sentiment classification ofthe aforementioned sentences are done moving each sentence into positive, negative and neutral classes. At this level as well, I make the assumption that a sentence contains only one opinion which as in our previous levels is not true in many cases. An optional task is to consider clauses.

At the feature level, the various tasks that are looked at are:

Task1: Identifying and extracting object features that have been commented on in each review/text.

Task 2: Determining whether the opinions on the features are positive, negative or neutral.

Task 3: Grouping feature synonyms and producing a feature-based opinion summary of multiple reviews/text.

When both F (the set of features) and W (synonym of each feature) are unknown, all three tasks need to be performed. If F is known but W is unknown, all three tasks are needed, but Task 3 is easier. It narrows down to the problem of matching discovered features with the set of given features F. When both W and F are known, only task 2 is needed.

3.3.1 Document Level Sentiment Analysis

The binary classification task of labeling an opinionated document as expressing either an overall positive or an overall negative opinion is called sentiment polarity classification or polarity classification. This binary decision task is alternatively termed as sentiment classification. Sentiment polarity forms one of the core problems in opinion mining in the sense that it forms the general character for a set of problems in this field: given an opinionated piece of text, where we shall assume that the entire text bears an overall opinion on a single issue or item, classify the opinion as one of two opposing sentiment polarities and locate its position on the continuum between these two polarities.

6

3.3.1.1 Approach

1. Classification based on sentiment phrases2. Classification using text classification methods3. Classification using a score function

Example (positive Document)

Canon’s PowerShot SX10 IS is a 10 Megapixel super-zoom camera with a 20x optically-stabilized lens and a 2.5in flip-out screen. It has an excellent resolution. Digital zoom images are surprisingly good. Announced in September 2008 alongside the higher-end SX1 IS, it’s the successor to the best-selling PowerShotS5 IS and retains its main body shape, articulated screen, a battery power and movies with stereo sound, but within the camera there’s been some major changes. Now, it has better noise control than Canon's previous "S" models. However, autofocus is very slow.

1) Classify documents (e.g., reviews) based on the overall sentiments expressed by opinion holders (authors)

Positive, negative, and (possibly) neutral Since in our model an object O itself is also a feature, then sentiment classification

essentially determines the opinion expressed on O in each document (e.g., review).

2) Similar but different from topic-based text classification.

In topic-based text classification, topic words are important. In sentiment classification, sentiment words are more important, e.g., great, excellent,

horrible, bad, worst, etc.

3) Mainly at the document-level, but also extendable to the sentence-level.

7

Disadvantages

It does not give details on what people liked or disliked

Specific features of an object that the author likes or dislikes cannot be identified

It is not easily applicable to non-reviews, e.g. forum and blog postings

Main focus may not be evaluation or review, but still contain a few opinion sentences

3.3.2 Sentence-level sentiment Analysis

The sentiment classification at the document-level is the most important field of web opinion mining. However, for most applications, the document-level is too coarse. Therefore it is possible to perform finer analysis at the sentence-level. The research studies in this field mostly focus on a classification of the sentences whether they hold an objective or a subjective speech, the aim is to recognize subjective sentences in news articles and not to extract them. The sentiment classification as it has been described in the document-level part still exists at the sentence-level; the same approaches as the Turney's algorithm are used, based on likelihood ratios. Because this approach has already been described in this paper, this part focuses on the objective/subjective sentences classification and presents two methods to tackle this issue. The first method is based on a bootstrapping approach using learned patterns. It means that this method is self-improving and is based on phrases patterns which are learned automatically.

8

The input of this method is known subjective vocabulary and a collection of annotated texts.

• The high-precision classifiers find whether the sentences are objective or subjective based on the input vocabulary. High-precision means their behaviors are stable and reproducible. They are not able to classify all the sentences but they make almost no errors.

• Then the phrases patterns which are supposed to represent a subjective sentence are extracted and used on the sentences the HP classifiers have let unlabeled.

• The system is self-improving as the new subjective sentences or patterns are used in a loop on the unlabeled data.

This algorithm was able to recognize 40% of the subjective sentences in a test set of 2197 sentences (59% are subjective) with a 90% precision. In order to compare, the HP subjective classifier alone recognizes 33% of the subjective sentences with a 91%precision.Along this original method, more classical data mining algorithm are used such as the naïve bayes classifier.

The general concept is to split each sentence in features -- such as presence of words, presence of n-grams, and heuristics from other studies in the field -- and to use the statistics of the training data set about those features to classify new sentences. Their results show that the more features, the better. They achieved at best a 80-90%recall and precision classification for subjective/opinions sentences and a 50% recall and precision classification for objective/facts sentences. The sentence-level sentiment classification methods are improving, this results from research studies in 2003 show that they were already quite efficient then and that the task is possible.

3.3.3 Feature-based sentiment analysis

Sentiment classifications at both document and sentence (or clause) level are useful, however they do not find what the opinion holder liked and disliked. A negative sentiment on an object does not mean that the opinion holder dislikes everything about the object. Similarly, a positive sentiment on an object does not mean that the opinion holder likes everything about the object. Thus, sentiment analysis at the feature level is necessary.

Considering user reviews as our primary source of opinionated data in this model, we look at different review formats that the system is expected to handle (most of which are commonly used in websites and forums).

Format 1 - Pros, Cons and detailed review: The reviewer is asked to describe Pros and

Cons separately and also write a detailed review. Epinions.com uses this format.9

Format 2 - Pros and Cons: The reviewer is asked to describe Pros and Cons separately. C|net.com used to use this format.

Format 3 - free format: The reviewer can write freely, i.e., no separation of Pros and Cons. Amazon.com uses this format.

3.4 System structure

The proposed system aims to identify semantic orientations of movie reviews. This task can be conducted in eight main steps (Shown as Figure 1):

Step1. The first step is to parse the movie reviews by Stanford parser.

Step2. The second step is to extract adjectives and features from reviews with their grammatical relationships, i.e. to take grammar analysis of reviews. In this step, all adjectives and the features modified or described by them are found. This step is conducted with grammatical knowledge.

Step3. This step is to predict the polarity of adjectives. Based on WordNet, all adjectives are divided into five groups. The polarity knowledge library of adjectives is output in this step.

Step4. The similarity between features and topic is computed.

Step5. Based on polarity library and similarity between features and topic, the value of original score of every noun is produced.

Step6. This step is to produce the value of weighted score.

Step7. In order to predict the semantic orientation of reviews, it must be represented with a vector. This step is to construct the vector of review with weighted score.

Step8. The last step is to produce polarity labels for reviews. We use two methods machine learning technology and total weighted score computing to generate polarity labels of reviews and compare their precision with simple term counting method.

10

11

Chapter 4.System Analyses And Design

4.1 ACTIVITY DIAGRAM

12

Extraction of +ve and -ve keywords

Admin Login

Input Keyword

Tweets Retreval

Classification using NLP

Feature Set Extraction

Data Pre-processiong

Results & Graphical Respresentation

QT clustering

4.2 USECASE DIAGRAM

Input Keyword

Retrieve Tweets

Feature Extraction of words

QT clustering

Admin

Result Generation

13

4.3 SEQUENCE DIAGRAM

14

4.4 COLLABORATION DIAGRAM

15

4.5 CLASS DIAGRAM

16

Chapter 5. Proposed System

5.1 Overview

The proposed system takes care of the informal reviews. The idea of formal and informal words has been revised little in the proposed system. The use of English language can be grammatically incorrect with chances of spelling mistakes and wide use of shortcuts. Technically we can phrase these by calling them formal opinions and informal opinions. Where formal opinions refer to the use of proper English with no grammatical mistakes and informal refers to the improper use of English with grammatical mistakes, involving use of words which are not standard spellings e.g. “4get” instead of “forget”. They say that the UK English is said to be formal English and US English is considered to be informal English. Most of the customers use US English, hence it is very important to concentrate on these opinions too, or else they might be discarded as noise. Hence the system proposed deals with this aspect of informal reviews.

The purpose of the analysis is to extract, organize, and classify the information contained in the required documents. The required document undergoes cleaning for unwanted stop-words and words not listed in the dictionary. It is then classified as formal or informal depending on the use of words, which then undergoes the application of POS tags. Thus the feature and its accompanying opinion is identified and extracted. The output of the proposed system is feature-opinion pairs from the review documents.

The complete architecture of the proposed opinion mining system, which consists of five different functional components –

review documents crawler,

document pre-processor,

document parser,

feature and opinion learner

17

Architecture of the proposed opinion mining system

5.1.1 Review Documents Crawler and Document Pre-processor

For a target review site, the crawler retrieves review documents and stores them locally after filtering markup language tags. The filtered review documents are divided into manageable record-size chunks whose boundaries are decided heuristically based on the presence of special characters. It has been found that granularity of words, word stems, and word synonyms may cause problem while extracting real features and opinion. We have applied rigorous reprocessing on review documents to filter out noisy reviews that are introduced either without any purpose or to increase/decrease the popularity of the product.

18

5.1.2 Document Parser

The functionality of this module is to facilitate the linguistic and semantic analysis of text for information component extraction. This module accepts record-size chunks generated by document pre-processor as input to assign Parts-Of-Speech (POS) tags to each word. It also converts each sentence into a set of dependency relations between the pair of words. For POS analysis and dependency relation generation purpose, we have used Stanford parser1, which is a statistical parser. As observed in, noun phrases generally correspond to product features, adjectives refer to opinions and adverbs are generally used as modifiers to represent the degree of expressiveness of opinions, we have applied POS-based filtering mechanism to avoid unwanted texts from further processing.

19

5.1.3 Feature and Opinion Learner

This module is responsible to analyze dependency relations generated by document parser and generate all possible information components from them. The dependency relations between a pair of words w1 and w2 is represented as relation type (w1; w2), in which w1 is called head or governor and w2 is called dependent or modifier. The relationship relation type between w1 and w2 can be of two types- i) direct and ii) indirect. In a direct relationship, one word depends on the other or both of them depend on a third word directly, whereas in an indirect relationship one word depends on the other through other words or both of them depend on a third word indirectly. An information component is defined as a triplet < f; m; o >, where f represents a feature generally expressed as a noun phrase, o refers to opinion which is generally expressed as adjective, and m is an adverb that acts as a modifier to represent the degree of expressiveness of the opinion. As pointed out in, opinion words and features are generally associated with each other and consequently, there exist inherent as well as semantic relations between them. Therefore, the feature and opinion learner module is implemented as a rule-based system, which analyzes the dependency relations to identify information components from review documents. For example, consider the following opinion sentences related to Nokia N95:

(i) The screen is very attractive and bright.

(ii) The sound sometimes comes out very clear.

(iii) Nokia N95 has a pretty screen.

(iv) Yes, the push email is the \Best" in the business.

In example (i), the screen is a noun phrase which represents a feature of Nokia N95, and the adjective word attractive can be extracted using nominal subject nsubj relation (a dependency relationship type used by Stanford parser) as an opinion. Further, using advmod relation the adverb very can be identified as a modifier to represent the degree of expressiveness of the opinion word attractive. In example (ii), the noun sound is a nominal subject of the verb comes, and the adjective word clear is adjectival complement of it. Therefore, clear can be extracted as opinion word for the feature sound. In example (iii), the adjective pretty is parsed as directly depending on the noun screen through amod relationship. If pretty is identified as an opinion word, then the word screen can be extracted as a feature; likewise, if screen is identified as a feature, the adjective word pretty can be extracted as an opinion. Similarly in example (iv), the noun email is a nominal subject of the verb is, and the word Best is direct object of it. Therefore, Best can be identified as opinion word for the feature word email.

20

Based on these and other observations, we have defined different rules to tackle different types of sentence structures to identify information components embedded within them.

Rule-1: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1) such that POS(w1) = POS(w2) = NN_, POS(w3) = JJ* and w1, w2 are not stop-words, or if there exists a relationship nsubj(w3;w4) such that POS(w3) = JJ*, POS(w4) =NN* and w3, w4 are not stop-words, then either (w1;w2) or w4 is considered as a feature and w3 as an opinion.

Rule-2: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1) such that POS(w1) = POS(w2) = NN_, POS(w3) = JJ* and w1, w2 are not stop-words, or if there exists a relationship nsubj(w3;w4) such that POS(w3) = JJ*, POS(w4) =NN_ and w3, w4 are not stop-words, then either (w1;w2) or w4 is considered as the feature and w3 as an opinion. Thereafter, the relationship advmod (w3; w5) relating w3 with some adverbial word w5 is searched. In case of presence of advmod relationship, the information component is identified as < (w1; w2) or w4; w5; w3 > otherwise < (w1; w2) or w4; -; w3 >.

Rule-3: In a dependency relation R, if there exist relationships nn(w1;w2) and nsubj(w3;w1) such that POS(w1) = POS(w2) = NN_, POS(w3) = V B_ and w1, w2 are not a stop-words, or if, there exist a relationship nsubj(w3;w4) such that POS(w3) = V B*, POS(w4) = NN* and w4 is not a stop-word, then we search for acomp(w3;w5) relation. If acomp relationship exists such that POS (w5) = JJ_ and w5 is not a stop-word then either (w1; w2) or w4 is assumed as the feature and w5 as an opinion. Thereafter, the modifier is searched and information component is generated in the same way as in Rule-2.

The need to identify and interpret possible difference in the linguistic style of texts–such as formal or informal–is increasing, as more people use the Internet as their main research resource. There are different factors that affect the style, including the words and expressions used and syntactical features. Vocabulary choice is likely the biggest style marker. In general, longer words and Latin origin verbs are formal, while phrasal verbs and idioms are informal (Park, 2007). There are also many formal/informal style equivalents that can be used in writing.The formal style is used in most writing and business situations, and when speaking to people

21

with whom we do not have close relationships. Some characteristics of this style are long words and using the passive voice. Informal style is mainly for casual conversation, like at home between family members, and is used in writing only when there is a personal or closed relationship, such as that of friends and family. Some characteristics of this style are word contractions such as “won’t”, abbreviations like “phone”, and short words. We discuss the main characteristics of both styles

5.2 Characteristics of Informal Style Text

The informal style has the following characteristics:1. It uses a personal style: the first and second person (“I” and “you”) and the active

voice (e.g., “I have noticed that...”).2. It uses short simple words and sentences (e.g., “latest”).3. It uses contractions (e.g., “won’t”).4. It uses many abbreviations (e.g., “TV”).5. It uses many phrasal verbs in the text (e.g., “find out”).6. Words that express rapport and familiarity are often used in speech, such as “brother”,

“buddy” and “man”.7. It uses a subjective style, expressing opinions and feelings (e.g.“pretty”, “I feel”).8. It uses vague expressions, personal vocabulary and colloquialisms (slang words are

accepted in spoken text, but not in written text (e.g., “wanna” = “want to”))

5.3 Characteristics of Formal Style Text

The formal style has the following characteristics:

1. It uses an impersonal style: the third person (“it”, “he” and “she”) and often the passive voice (e.g., “It has been noticed that...”).

2. It uses complex words and sentences to express complex points (e.g., “state-of-the-art”).

3. It does not use contractions.

4. It does not use many abbreviations, though there are some abbreviations used in formal texts, such as titles with proper names (e.g., “Mr.”) or short names of methods in scientific papers (e.g., “SVM”).

5. It uses appropriate and clear expressions, precise education, and business and technical vocabularies (Latin origin).

22

6. It uses polite words and formulae, such as “Please”, “Thank you”, “Madam” and “Sir”.

7. It uses an objective style, citing facts and references to support an argument.

8. It does not use vague expressions and slang words.

Feature Informal Style Formal Style

Contractions Use Contractions:

e.g., “Many patients don’t listen to their doctors.”

Avoid Contractions:

e.g., “Many patients do not listen to their doctors.”

Phrasal

Verbs

Use phrasal verbs:

e.g.; “I looked up information about nursing positions.”

Avoid phrasal verbs:

e.g., “I researched information about nursing positions.”

Personal/

Impersonal

Pronouns

Use personal pronouns:

e.g., “I think this is an effective plan.”

Use impersonal pronouns:

e.g., “This could be an effective plan.”

List of Informal and Formal Words

Informal Formal

about approximately

and in addition

anybody anyone

ask for request

23

boss employer

but however

buy purchase

end finish

enough sufficient

get obtain

go up increase

have to must

Contractions List

Informal Formal

aren’t are not

can’t cannot

didn’t did not

hadn’t had not

hasn’t has not

I’m I am

24

Abbreviations List

Informal Formal

asap as soon as possible

grad. graduate

HR Human Resources

Feb. February

Lab laboratory

temp Temperature

Short Code list

Msg MessageLuv LoveWlcm WelcomePlzz PleaseWhn WhenWld WouldBtw By the wayBcoz Because

25

Chapter 6. Implementation

6.1 Hardware Resources Required:Hardware Requirements

1 RAM 1GB or more

2 Processor Intel dual core or later versions

3 OS Microsoft Windows XP or Microsoft Windows 7

6.2 Software Resources Required:Software Requirements

1Software for development

Netbeans, NLP, JDK,

2 Platform Windows

3 Dataset Review Website

6.3 Selection of a platformA computing platform includes a hardware architecture and a software framework (including application frameworks), where the combination allows software to run. Typical platforms include a computer architecture, operating system and runtime libraries. A platform is a crucial element in software development. A platform might be simply defined as a place to launch software. The platform provider offers the software developer an undertaking that logic code will run consistently as long as the platform is in place.

The platform for this project used is Windows. The reason for selecting Windows as the operating platform is because of its user friendly characteristic and its other versions. Windows is popularly known to all and widely used all over the world. Microsoft has made several advancements and changes that have made it a much easier to use operating system, and although arguably it may not be the easiest operating system, it is still Easier than Linux. Windows comes with built-in Internet Connection Firewall software that provides you with a resilient defence to security threats when you're connected to the Internet—particularly if you use always-on connections such as cable modems and DSL. Windows offers thousands of security-related settings that can be implemented individually.

All Windows versions have been based on a file system permission system referred to as AGLP (Accounts, Global, Local, Permissions). So creating files and their dynamic use can become simpler in case of Windows Platform as compared to others.The other main reason for selecting Windows as the computing platform is the availability of open-source free licensed software’s that gives the developer good opportunity to explore and play with many things. Because of the large amount of Microsoft Windows users, there is a much larger selection of available software programs, utilities for Windows.

26

6.4 Software used

6.4.1 Netbeans

NetBeans is an integrated development environment (IDE) for developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application framework for Java desktop applications and others.The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM.The NetBeans Platform allows applications to be developed from a set of modular software components called modules. Applications based on the NetBeans Platform (including the NetBeans IDE itself) can be extended by third party developersNetBeans IDE provides first-class comprehensive support for the newest Java technologies and latest Java specification enhancements before other IDEs. It is the first free IDE providing support for JDK 8 previews, JDK 7, Java EE 7 including its related HTML5 enhancements, and JavaFX 

With its constantly improving Java Editor, many rich features and an extensive range of tools, templates and samples, NetBeans IDE sets the standard for developing with cutting edge technologies out of the box.

Netbeans Code Editor

27

6.4.2 NLP Stanford Parser

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing.

The parser runs under Windows and Unix/Linux/MacOSX and requires a Java Runtime Environment (JRE) (Java 1.5 or higher).

Parsers for different languages such as Chinese, Arabic, English and German are provided. In most cases, the probabilistic context-free grammar (PCFG) parser will be sufficient, since it processes fast, shows good accuracy values and moderate memory usage. A PCDG model is not provided for parsing German texts. The parser model called FACTORED is more complex and requires more memory because it contains two grammars and leads the system to run three parsers.

POS Tag Description ExampleCC coordinating conjunction andCD cardinal number 1, thirdDT determiner theEX existential there there isFW foreign word d’hoevre

INpreposition/subordinating conjunction

in, of, like

JJ adjective bigJJR adjective, comparative biggerJJS adjective, superlative biggestLS list marker 1)MD modal could, willNN noun, singular or mass doorNNS noun plural doorsNNP proper noun, singular JohnNNPS proper noun, plural VikingsPDT predeterminer both the boys

28

POS possessive ending friend‘sPRP personal pronoun I, he, itPRP$ possessive pronoun my, his

RB adverbhowever, usually, naturally, here, good

RBR adverb, comparative betterRBS adverb, superlative bestRP particle give upTO to to go, to himUH interjection uhhuhhuhhVB verb, base form takeVBD verb, past tense tookVBG verb, gerund/present participle TakingVBN verb, past participle takenVBP verb, sing. present, non-3d takeVBZ verb, 3rd person sing. present takesWDT wh-determiner whichWP wh-pronoun who, whatWP$ possessive wh-pronoun whoseWRB wh-abverb where, when

29

Capter 6. CONCLUSION

Emotion and mood often affect user's behavior and his/her interaction with other users. Positivity and negativity are two important attributes of emotion and mood. Based on our analysis, we explored several key differences across positive and negative users of online social networks. We introduced a methodology for tweet partitioning and user partitioning, which led to seven groups of users from highly negative to highly posItIve. Our findings show that negative users are not interested in sharing their negativity in social media when compared neutral and positive users. In addition, positive users are more likely to make friendships with negative users based on the number of followers and followees. An interesting result is that negative users retweet much more than positive users. They use Twitter as a tool for social awareness and they need to gain emotional support. Retweeting positive tweets makes them more positive. Also, negative users try to avoid any interaction by replying to tweets, since they believe it can make them more negative. In addition, positive users also don't interact much with negative users, which means everyone triesto avoid interacting with negative users, perhaps because such interactions make them more negative. Our findings can beused in developing tools for automatically identifying positive and negative users based on user activities in Twitter. The key distinguishing criteria are how many followers and followees a user has, the volume and type (positive or negative) of tweets a user send out, the volume and type of tweets a user retweets as well as replies to. Our future work involves building a classifier based on the findings of this paper.

30

REFERENCES

[1] A. Gluhakl, M. Presser, L. Zhu, S. Esfandiyari, S. Kupschick, "TowardsMood Based Mobile Services and Applications," EuroSSC. Kendal, vol.4793, pp. 159-174, October 2007.[2] C.M. Lee, and S.S. Narayanan, "Toward detecting emotions in spokendialogs," IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp.293-303, March 2005.[3] S.P. Dibris, A. Stagliano, A. Camurri, A. F.O. Dibris, "A set offull-bodymovement features for emotion recognition to help children affected byautism spectrum condition," IDGEI International Workshop.Chania, May2013, in press.[4] L. Rui, S. Wang, H. Deng, R. Wang, K. C. Chang, "Towards social userprofiling: unified and discriminative influence model for inferring homelocations,". In LinkKDD. Beijing, pp. 1023-1031, August 2012.[5] R. Picard, Affective Computing, M.l.T Media Laboratory PerceptualComputing Section Technical Report, vol. 321,1995, pp. 1-26.

31