[IEEE 2013 International Conference on Cloud and Green Computing (CGC) - Karlsruhe, Germany...

Identifying Relevant Users and Groups in the Context of Credit Analysis Based on Data from Twitter

Danyllo W. A., Alisson V. B., Alexandre N. D. Moacir L.M. J., Jansepetrus B. P.

Center of Informatics – Federal University of Paraiba. João Pessoa, PB – Brazil.

danyllowagneralbuquerque, alisson.brito, alexandrend, moacir.lopes.jr, jansebp}@gmail.com

Roberto Felício Oliveira Department of Informatics – State University of Goias.

Posse, GO - Brazil [email protected]

Abstract— In recent years several online social networks have

emerged with very different purposes. This huge popularity is associated with common functionality to provide users with new ways to interact, producing content and commenting on various subjects and interests. This fact makes social networks favorable to research related to the organization and management of large amounts of data, besides constituting an ideal environment for knowledge extraction and application of data mining techniques. In this sense, this study collected data from the social network Twitter, and compared them with data from a financial institution in order to model the network and analyze their similarities. Three thousand users from Twitter were analyzed and 504 matched with the database from company for credit analysis. The results demonstrated that most of those users have more credit restriction than their neighbors, and users with no restrictions normally have also neighborhoods with no credit restriction as well.

Keywords—Social Network; Analysis; financial; credit.

I. INTRODUCTION Different networks have emerged from various social

interactions through Internet tools like Twitter, MySpace and Facebook. Thus, from the use of the spaces created by social networks, individuals who have interests or goals in common can share ideas and information about many subjects.

The concept of social network has attracted considerably the community of social and behavioral sciences to analyze the relationships between social entities and their implications [1]. According to Chakrabarti [2] Social Network Analysis (SNA) can understand and monitor more effectively the spread of information and interaction between individuals and groups that form a particular network

In contrast, the activity of credit analysis is an organized process to analyze data in order to enable or remove the right of a person to get access to money credit. This process covers a broad framework that is more complex than simply analyze the creditworthiness of a customer and financial data for decision-making [3].

This study intends to use a methodology supported by open-source tools to enable the search, collection and comparison of user from Twitter with their respective credit profile and restrictions. Thereafter, the data previously collected was compared with database from a partner company

specialized in credit analysis. Finally, concepts of graph theory and complex networks were used to identify the structural characteristics of the network or even the interaction of groups and users to improve the comparison with the credit profile in order to assist the process of credit granting.

II. RELATED WORKS Since its creation, microblogging has become a tool

increasingly used for communication on the Internet. We can verify the existence of several works related to the most various issues using Twitter data for various purposes.

Lampos and Cristianini [4] and Cheng et al [5] investigated Twitter as a tool to monitor and capture the trends and patterns emerging knowledge about critical time. Both works used the context of the initial outbreak of H1N1 influenza.

Works from Bruns and Liang [6] and Cheong and Cheong [7] analyzed Twitter based communication in the context of natural disasters and floods in Australia. Other studies [8] [9] examined the use of Twitter during the great civil unrest in UK and the Arab Spring riots.

In the commercial area was analyzed the work of Stieglitz and Krüger [10] who investigated the crisis of the automotive manufacturer Toyota and Park et al [11] investigated the crisis concerning the brand Domino's Pizza. Both studies used a methodology based on the analysis of the spread of bad news through Twitter.

Regarding the use of Twitter in the elections we mention work in different parts of the world as the elections in America [12], Sweden [13] and Brazil.

Directly related this research, the first paper conducted by Filho [14] investigate and analyze various enterprises through questionnaires the new reality of virtual interaction. They have monitored social networks as well as the interaction in these virtual environments of organizations. The work conducted by Borrasca et al [15] had as main purpose to investigate the importance of social networks relationships on the bank credit market. Sixteen interviews were conducted with managers of banks to analyze how personal information about customer influences the decision of credit grant.

Many other works not mentioned here have the Twitter as data source. This is basically due to easy access to its data

2013 IEEE Third International Conference on Cloud and Green Computing

978-0-7695-5114-2/13 $26.00 © 2013 IEEE

DOI 10.1109/CGC.2013.102

587


978-0-7695-5114-2/13 $26.00 © 2013 IEEE

DOI 10.1109/CGC.2013.102

587


978-0-7695-5114-2/13 $26.00 © 2013 IEEE

DOI 10.1109/CGC.2013.102

587

through an API. In the specific case of our study in Brazil, we have identified that the users of this social network are mostly from the higher social classes. This profile is similar to the group of people who have credit grant restriction.

III. THEORETICAL GROUNDING

A. Social Networks A network is a set of elements, which we call vertex or

nodes, with connections between them, called edges [16]. The topological structure of a network can be modeled as a graph. In the specific case of our study we present focus on the social networks, we can conveniently represent people through nodes and their possible relationship / common interests through the edges characterizing each of these substructures with attributes according to our convenience [17].

To analyze the relationships existing within a set of data arranged as a network, we can make use of graph theory. A Social Network Analysis (SNA) is based on the formal description of networks through the various measures used to characterize complex networks [16][18]. It provides models and vocabularies to describe social structures, as well as methods to analyze them. This allows defined concepts like "social role" or "social group", for example, and therefore enables a standardization that makes the studies based on it more consistent and easily comparable [2][18].

Regarding the measures for complex networks have metrics presented by Guimarães e Melo [19] in-degree [20]; out-degree [21]; Closeness [22]; Betweenness [23] and Page Rank [24]. The Table 1 presents the description of these metrics.

TABLE I. DESCRIPTION OF METRICS SNA

Metric Description Calculate

In-degree centrality

It is the number of arcs arriving at a node in a given network type.

Degree(v) is node degree ( in – out) and n is the amount of network nodes. Out-degree

centrality

It is the number of arcs which leave a node in a given network type.

Closenes Centrality

Measures how far a node is close to all

other network nodes.

d(ni,nj) represents the shortest path between ni and nj. The

proximity of the center of the topology is the inverse of the sum of the distances ni for all

other nodes, where i � j.

Betweenness centrality

It is an indicator of how central a

particular node is in relation to several

other nodes in network.

Where �(u,v) represents the number of paths between u

and v, and �e(u,v) represents the number of paths between u

and v including e.

B. Concepts of financial credit

According to Schrickel [25] credit is every act of disposition of someone to give in, even temporarily, part of its equity to another person, with the expectation that this portion back to full possession, after expiry of the stipulated time. This can be materialized by money (loan money) or asset (loan of material with monetary value).

The credit can cause companies to increase their activity level; stimulate consumption influencing on demand; comply with a social function of helping people obtain housing, property and even food; facilitate the implementation of projects for which companies do not have sufficient own resources. Finally, one should add that credit companies or people can take highly leveraged, as can be component of an inflationary process [26].

For some segments of the economy such as banking and credit institutions in general, whose main product is the money, they become mediators of other economic activities, making the task of analyzing and credit granting as its main business [27][28]. This task involves the ability to make a credit decision within a fraught context even with uncertainty, constant change and incomplete information. This ability depends on the capability to perform a logical analysis, however, complex, and reach a clear conclusion, precise and with a degree of ease to be implemented [25].

Generally credit analysis is a process that involves the gathering of information available about a credit requester, which can be individual or entity, for the purpose of decide whether to grant (or not) credit. Basically, the granting (or not) credit is given to the analysis of the credit applicant, balance sheets, billings, evaluation questionnaire, division of activity in which it operates and so on.

IV. METHODOLOGY OF ANALYSIS This section presents the methodological process, through a

systematic study of the following activities illustrated in Figure 1: Planning, Monitoring, Graph generation, Metrics and Analysis.

Fig. 1. Proposed SNA Methodology.

The five stages of SNA Methodology demonstrate the fundamental activities to obtain data for the experiments, and are conceptualized and detailed below:

(1) Planning - In this step you define the criteria used in monitoring such as the groups to be searched, (2) Monitoring -

588588588

This phase is designed to collect data that will form the basis of the experiment. (3) Graph Generation - The data obtained in the previous task is built a graph that represents the relationship and interaction among individuals monitored. (4) SNA Metrics - The main objective of this stage is the application of metrics for analyzing social networks, aiming to recognize patterns, individual results and general among network members. (5) Analysis - The results generated are studied in detail, noting that the hypotheses / objectives proposed at the beginning of the process were actually obtained.

In Figure 1 the steps occur in a cyclic process. For every pass over the cycle it is analyzed if all proposed targets were met. In case of failure, it starts another cycle adapting the steps of previous one.

V. DEVELOPMENT

A. Software for data collection During this study we have developed an application called

AppSNA. The tools were used in a chain, so that the output of each tool becomes the input to another. Thus, in Table 2 are the used technologies and their versions. Aiming facilitate the replication of this study, we prioritize the use of open-source tools.

TABLE II. TECHNOLOGIES USED IN THE CONSTRUCTION OF SOFTWARE.

Description Technology Tool

Program Language Java JDK 1.6_22

Data Layer

Database MySQL 5.1

Connector MySQL Connector Java 5.1.6

Framework JDBC 4

Logical Layer

Twitter API Twitter API 1.0

Java Twitter API Twitter4J 3.0.0

Graph Analysis JUNG 2.0.1

Presentation Layer Graph visualization

JUNG 2.0.1 Prefuse Beta

According to Table 3, the software is built in three

separate logical layers. For Data Layer construction, use basically a relational database along with its connector for the Java language. Still, as a framework for persistence, we used the JDBC aimed higher performance.

In Logical Layer define use cases and business logic using Java language features. For its construction were used the services of Twitter API (aided by Twitter4J) to capture the data on the microblog. Still, we use the JUNG API under the same provide the means necessary to perform the calculations of SNA metrics.

In Presentation layer appears the extension files GraphML 1 . In previous layer necessary heuristics were

1 http://graphml.graphdrawing.org/

performed for generation of this file that was consumed in this layer for use in visualization APIs (e.g. JUNG2 and Prefuse3) to provide graph viewing according to more suitable layouts for certain structure. This solution has made our application independent of presentation tools, therefore, the possession of these files, we can use other APIs or even standalone tools (such as Gephi4 and PAJEK5) since this structured data format is highly used in SNA tools. This tool is more detailed in [29].

B. Application of the methodology

The Twitter was chosen for two reasons: (1) It has easy access to its data through an API; (2) In Brazil, the users of this network have a social profile (high social standard) similar those people who make part of the universe of debtors. This hypothesis can be confirmed through the analysis of research conducted by CETIC (profile of social networks users) [30] and CNI (profile of debtors) [31].

The aforementioned experiment was initiated through a random collection of people registered on Twitter who had the attribute location with terms related to the city of João Pessoa in Brazil. Thus, the AppSNA identified approximately three thousand members.

Based on these data, we performed a correlating of previously collected data with existing data in a database of credit company that participated in this research. For example, from the Twitter screen name “@Danyllo_Wagner” we could perform a query in the database, looking for people with the name "Danyllo Wagner" who lived in João Pessoa. The biography on Twitter was also used. For instance, if the account had the value "Software Engineer and Systems Analyst", this information was also used to refine the search in the database of credit analysis company.

The database returned from the credit company was composed by personal data as full name, date of birth, CPF (personal document similar to the Social Security Card in USA), postal code, address, amount of debts and so on.

From the initial 3000 users, was possible to match 26 people, i.e., only 0.87% of users were correctly correlated in the database of the financial instituition. With these 26 users defined, the AppSNA collected data from them and their followers aiming to create a complete relationship network. Therefore an undirected graph was used to model the friendships (Figure 2) containing 26 communities (nodes in shades of blue with size proportional to the size of the community) and about 6518 nodes. After that, these users were matched again, finding a total of 504 users from Twitter in the credit database, that served as basis for the experiences presented following.

2 http://jung.sourceforge.net/ 3 http://prefuse.org/ 4 http://gephi.org/ 5 http://pajek.imfm.si

589589589

Fig. 2. Communities formed from the 26 initial users.

VI. RESULTS AND DISCUSSION In Figure 3 it is possible to see which users were identified

in the database of credit institution and who were not. This network provides an overview of graph in relation of the user identified that will be the basis for subsequent analyzes.

Fig. 3. User identified (blue) and not (red) in the database.

Most of the nodes have red color, representing users who have not been identified and there are nodes spread in the blue color, about 500 people corresponding to 7.74% of the initial data, representing users that were identified at the database.

We emphasize that in the database of the credit institution there are people who have credit restrictions and people who do not have such restrictions. Based on this information we seek to identify in each of the 26 communities the number of people who are registered in the database and what is the relation among them.

Fig. 4. Users with the respective number of followers identified in the

database.

The graph in Figure 4 represents the amount of followers in relation to the identified credit restrictions that each user has. With this information it is possible to analyze each person in relation with the group. For instance, we see that the node (3246), circled in red at the top of the graph has 13 followers with credit constraints and node (4412) circled in red at the bottom of the graph has only one person with that constraint. It is also visible that most of the nodes of this graph are blue representing the absence of links out to other network users.

In this analysis the representation of amount of followers that were identified in the database of the credit company was fetched. This individual information of each node can bring us important conclusions. Another important individual information can be observed based on the relation among users and their credit restrictions. Those approximately 500 nodes identified in blue in Figure 2 were analyzed to show some kind of common standard.

Fig. 5. Representation of the quantity of negative certificates by users.

In Figure 5, users are ranked according to the amount of negative certificates (amount of occurrence of debits related to

590590590

their profile) existing in the database of the credit institution. For example, the node (2504), circled in red at the top, has no negative certificates. However, 9 of their 13 followers, i.e. 69%, have negative certificates, and the node (5004), circled in red at the bottom has 9 negative certificates.

Figure 6 shows the amount of negative certificates in the neighborhood of each existing individual. For example, see that the neighbors of node (5466), circled in red on top, has 45 negative certificates, while the neighbors of the node (3246), also circled in red at the bottom, has 266 negative certificates.

Fig. 6. Amount of negative certificates from the neighborhood of 26 users.

Preliminary conclusions in this study indicate that among the 26 found communities (the 26 initial users and their followers), those with the highest number of people with credit restrictions are communities where the central node has some constraint credit associated.

All results obtained by this research, i.e., the identification of people with credit restriction and their relationship with other users, were used to define some standard between the 26 people initially identified and their neighborhood, through formulations described in Table 3. Table 4 was generated from data obtained from the application of the Bad Debt formula, which contained the average degree of Bad Debt of each of the 26 initial Twitter users and your identified neighborhood.

TABLE III. CALCULATIONS OF BAD DEBT.

Metric Description Calculation

Bad debt of Users (BDU)

Measures the bad debt degree of

users in network

Where: GIi represents the value of bad debt of user i Qni the amount of negative certificates of user i Qngi is the amount of negative certificate of group that i belongs

Bad debt of neighborhood

(BDN)

Measures the bad debt degree in

neighborhood of all users in

network

Where: GIVi represents the value of bad debt of neighborhood that i belongs QtnVi is the amount of negative certificate of neighborhood that i belongs QtpVi is the number of users that make neighborhood of user i

TABLE IV. BAD DEBT INFORMATIONS.

User BDU BDN User BDU BDN 916 0,194 0,174 3519 0 0,000 923 1 0,121 3712 0,578 0,040 1122 0,264 0,134 4253 0,001 0,225 1126 0,003 0,164 4412 0 0,000 1730 0,512 0,098 4662 0,035 0,181 2453 0,229 0,137 4811 0,061 0,122 2499 0,220 0,149 5466 0,676 0,064 2504 0 0,000 5747 0 0,000 2839 0,055 0,098 5810 0 0,000 2905 0,110 0,224 6229 0,02 0,134 3014 0 0,000 6296 0 0,000 3246 0,567 0,205 6915 0 0,000 3508 0,005 0,129 6960 0,163 0,085

It is possible to see that the users who has degree 0 in Bad Debt User (BDU) also has degree 0 in the Bad Debt Neighborhood (BDN). In addition we observed that the 18 users that have BDU different than 0, only 7 have BDU index smaller than BDN or tends to have BDU smaller or equal to the indexes of their neighborhoods. In contrast the other 11 users have the BDU index greater than their BDN, i.e., the index of bad debt tends to be higher or equal to the components of other users of its neighborhood.

VII. CONCLUSIONS AND FUTURE WORKS This study collected data from the social network Twitter,

and compared them with data from a financial institution in order to model the network and analyze their similarities. Three thousand users from Twitter were analyzed and 504 matched with the database from company for credit analysis. The results demonstrated that most of those users have more credit restriction than their neighbors, and users with no restrictions normally have also neighborhoods with no credit restriction as well.

It was also presented a tool chain that assisted us to reach all the goals that were initially proposed, capturing the direct interaction of data from Twitter through the API described in the work. The flexibility of AppSNA allow us to use Gephi,

591591591

which used GraphML files generated by our tool to obtain graphical representations of the network as well as the contents of the various metrics SNA.

Initial results obtained using the bad debt rates bring us the idea that in the future companies, as well as individuals to identify potential credit risks could use these data. It will be possible to detect consumers from groups of bad debt and, similarly, identify groups that demonstrate safety in relation to credit due to their low or non-existent BDU neither BDN. Yet, through this research, many factors related to credit may also be studied in order to define not only the credit profile of each person, as well as the group to which it is inserted.

As future work we suggest to continue the methodological steps of this study with more users, since we found the low level of accuracy among the database from the credit company and the data from Twitter. With this we hope to obtain a network with a larger number of nodes, which provide us with a greater sampling universe may help in confirming the assumptions concluded here. Still, it may be interesting to perform an analysis of the content of the messages posted, because with this our network may be analyzed through new ways. It is hoped, through this partnership with the credit company, to analyze these users in various aspects within the context of credit, such as bad debt rate of users and their neighborhood, the existence of some pattern generated by the SNA metrics that can give clues if a user is likely to be bad debt or not within the group to which it is inserted.

REFERENCES [1] Wasserman, S., Faust, K. (1994). Social Network Analysis: Methods

and Applications. 8. Ed. Cambridge University Press, New York, EUA. [2] Chakrabarti, S. (2003). Social Network Analysis, Mining the Web,

Morgan Kaufmann, pp. 203-254. [3] Blatt, A. (1999). Avaliação de risco e decisões de crédito: um enfoque

prático. São Paulo: Nobel. [4] Lampos, V., Cristianini, N. (2010). Tracking the flu pandemic by

monitoring the social web. In Cognitive Information Processing: 2nd International Workshop.

[5] Cheng, J., Sun, A., Hu, D., Zeng, D. (2011). An information diffusion-based recommendation framework for micro blogging. Journal of the Association of Information Systems 12(7), 463-486.

[6] Bruns, A., Liang, E. (2012). Tools and methods for capturing Twitter data during natural disasters. First Monday. Disponível em: <http://migre.me/f3T3I>. Acess in may 2013.

[7] Cheong, F.; Cheong, C. (2011). Social media data mining: A social network analysis of tweets during the 2010-2011 Australian floods. Proceedings of the Pacific Asia Conference on Information Systems.

[8] Vis, F. (2012). Reading the riots on Twitter: Who tweeted the riots? Researching Social Media. Disponível em: <http://migre.me/f3Tar>. Acess in may 2013.

[9] Lotan, G., Ananny, M., Gaffney, D., Boyd, d. (2011). The revolutions were tweeted: Information flows during the 2011 Tunisian and Egyptian revolutions. International Journal of Communication 5, 1375-1405.

[10] Stieglitz, S., Krüger, N. (2011): Analysis of sentiments in corporate Twitter communication – A case study on an issue of Toyota. Proceedings of the 22nd ACIS, Paper 103.

[11] Park, J., Cha, M., Kim, H., Jeong, J. (2011). Managing bad news in social media: A case study on Domino´s Pizza crisis. Proceedings of the ICWSM.

[12] Diakopoulos, N. A., Shamma, D. A. (2010). Characterizing debate performance via aggregated twitter sentiment. In Proceedings of the 28th international conference on Human factors in computing systems (CHI), 1195–1198.

[13] Larsson, A. O., Moe, H. (2011). Studying political microblogging: Twitter users in the 2010 Swedish election campaign. New Media & Society.

[14] Filho, N. R. S. (2012). Monitoramento das redes sociais como forma de relacionamento com o consumidor. O que as empresas estão fazendo. Available in: <http://migre.me/fBr5a>. Acess in november 2012.

[15] Borrasca, C. R.. Silva, C. E. L., Souza, O. T. (2012). Uma análise sobre influência das redes sociais na concessão de crédito bancário. Available in: <http://migre.me/fBr36>. Acess in november 2012.

[16] Newman, M. E. J. (2003). The structure and function of complex networks. Cornell University library. Available in: <http://migre.me/f5eJL>. Acess in february de 2013.

[17] Mika, P. (2007). Social Networks and the Semantic Web. Springer. [18] Simoes, S. J., et al (2011). “Exploring Influence and Interests among

Users within Social Networks”, Social Networks: Computational Aspects and Mining, Computer and Communication Networks Series, Springer.

[19] Guimarães, F.J.Z., Melo, E. S., (2005). Diagnóstico Utilizando Análise de Redes Sociais. COPPE/UFRJ. Rio de Janeiro, Brasil, PP 1-60.

[20] Proctor, C.H., Loomis, C.P. (1951). Analysis of sociometric data. in Research Methods in Social Relations.

[21] Shaw, M., (1954). Group structure and the behavior of individuals in small groups. Journal of Psychology. 38, 139-149.

[22] Brandes, U. (2001), A Faster Algorithm for Betweenness Centrality, in Journal of Mathematical Sociology 25(2):163-177

[23] Newman, M., Girvan, M., (2004) Finding and evaluating community structure in networks. Physical Review.

[24] Brin, S., Page, L. (1998), The Anatomy of a Large-Scale Hypertextual Web Search Engine, in Proceedings of the seventh International Conference on the World Wide Web

[25] Schrickel, W. K. (1994). Análise de Crédito: Concessão e gerência de empréstimos, São Paulo, Atlas.

[26] Silva, J. P. (1988) Análise Financeira das Empresas. São Paulo, Atlas. [27] Silva, J. P. (1989). Análise e Decisão de Crédito. São Paulo, Atlas. [28] Stiglitz, J. E. (2000). Money, credit, and business fluctuations.

Economic Record, v. 64, n. 187, p. 307-322. [29] Oliveira, R. F. (2013). Análise Quantitativa das Relações entre Usuários

no Twitter. Dissertação de mestrado. Programa de Pós-Graduaçao em Informática da Universidade Federal da Paraíba.

[30] Centro de estudos sobre as Tecnologias de Informação e Comunicação – CETIC (2012). Estudo sobre o uso da internet. Available in: http://www.cetic.br/usuarios/tic/2012/. Acess in may of 2013.

[31] Confederação Nacional das Indústrias – CNI (2013). Retratos da sociedade brasileira – inclusão finaceira. Available in: <http://migre.me/fHa6u>. Acess in may of 2013.

592592592

[IEEE 2013 International Conference on Cloud and Green Computing (CGC) - Karlsruhe, Germany...

Documents

Transcript of [IEEE 2013 International Conference on Cloud and Green Computing (CGC) - Karlsruhe, Germany...