Bussiness intelligence 2011
-
Upload
ding-li -
Category
Technology
-
view
3.331 -
download
1
description
Transcript of Bussiness intelligence 2011
1
Business IntelligenceConnecting the dots for discoveries
D i n g L i
Can Business be Intelligent?
• Today’s business is in an age of dramatic change, Business Intelligence (BI) is an interactive process for corporates to promptly discern the trends or patterns of business operations, products, services, customers, markets and competitors, thereby to derive insights and draw conclusions.
• Human brains are extremely powerful to integrate separated data (even almost forgotten ones) with current scenario to make the best possible decision, cooperation’s decision systems have a long way to go to be even nearly as efficient.
• It requires a combination of technologies, art and human intelligence to surface the value under the data sea efficiently.
• Contents of the study:
2
◦ BI System
◦ BI Data Flow Architecture
◦ BI Development Process
◦ BI and Data Preparation
◦ BI and Data Visualization
◦ BI and Dashboard Design
◦ BI and Web Analytics
◦ BI and Social Network
◦ BI and Semantic Technologies
◦ BI and Algorithm
3
Pressures-Responses-Support Model
Globalization
Customer Demand
Market Conditions
Competition
Technology Advance
Regulations
…
Business Environment
Organization Responses
Strategic Planning
New Business Models
Restructure Business Processes
Choose New Vendors
Improve Partnership Relationships
Improve Information Systems
Encourage Innovation
Improve Customer Service
Improve Communication
Improve Data Access
Automate tasks
Real-time Response
…
Pressures
Opportunities
Decision and Support
Analysis
Predictions
Decisions
Business Intelligence
Support
(Turban, 2010)
4
Brief History of BI
• 1958, Hans Peter Luhn published a paper “A Business Intelligence System” in the IBM System Journal. “the ability to apprehend the interrelationships of presented facts in such a way as to
guide action towards a desired goal.”
• 1983, Teradata sold the first relational database management system (RDBMS) designed specifically for decision support to Wells Fargo.
• 1992, Bill Inmon published a book “Building the Data Warehouse” (Wiley).• 1995, The Data Warehouse Institute (TDWI) was formed.• 1996, Ralph Kimball published a book “The Data Warehouse Toolkit: practical
techniques for building dimensional data warehouses” Business units build their own data “marts”, which could be connected with a “bus”.
• 1996, Jim Gray published an article “Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.” Support OLAP (online analytical processing)
(Hammerbacher, 2009)
5
Bench marking
Goal of BI
Historical Current Predictive
Views of Business Operations
Better, Quicker Business Decision-Making
PerformanceManagementReporting Analytics Data
MiningPredictiveAnalytics
Internal Data External Data
FinanceR&D Supply &Production
Customer & Sales Usage Industry
AnalysisCompetitor
StatusUser
AnalysisProduct Ranking
TechnologyAnalysis
6
Common Pitfalls of Current System• Reporting data from departments are fragmented (e.g. in excel/PDF files).
• Manual extraction is prevalent.
• Updating frequency is relatively low, usually monthly.
• Analysts often spend more time on data collection than data analysis. Developers spend previous time for manual data feeding instead of improving the products and services.
• “Information silo problem”: rich information at source is not easily accessible, or even known to users.
• If it is a sin to have useful data unused or underused, then most organization, if not all, in the business world have the sin. The waste is tremendous.
Financial Supply Production Subscription Usage IndustryAnalysis
CompetitorStatus
Executives
7
Target System
Departmental Product Customer Relation
Categorized, Top-down Business Views
Automatic and Integrated SystemCross departments
Data Integration Statistical AnalysisBusiness Metrics Calculation
Overnight/Real time Data Collection
Executives
Strategic Analysis
Knowledge Workers
Operational Analysis
Managers
Tactic Analysis
Internal Data External Data
FinanceR&D Supply & Production
Subscription & Sales Usage Industry
AnalysisCompetitor
StatusUser
AnalysisProduct Ranking
TechnologyAnalysis
8
What BI is Not
• BI is not a panacea for a poor or outdated information system If information is not complete because some pieces are still in text file manipulated
manually, it is better to change the business process to move all the information into better data systems and automate business logics.
If the information is fragmented because there is no unique and well formatted keys to link them, it is better to improve the production system with well designed keys.
• BI is not just a collection of charts or tables BI is supposed to transform data into information. BI is supposed to link information together to provide insights and assist discovery. BI is supposed to support both information aggregation and drilldown. BI is supposed to support “information retrieval” – search capability. Replication of excel chart/table in BI system often results in static or mediocre
reports.
9
Where does Intelligence come from in BI?
• BI system organize and visualize information so well that human intelligence can be well engaged to analyze the information efficiently.
• Human put analysis methods and knowledge into BI system so a BI system can behave like a “smart” expert, following pre-defined logics.
• In a well-designed BI system, tremendous data can be linked together in a data network and manifest their underlying relationships which can be hidden from human eyes.
• In a (near) real time system, fresh data can arrive to decision makers’ fingertips so quickly that prompt steps can be taken before permanent damages are done, such as to retain customers just requested to cancel services.
10
BI Vendor Examples
• QlikTech – Qlikview is a flexible, nimble BI solution
• Microsoft – SQL server + SharePoint + Excel Power Pivot + Silverlight
• Actuate – Business Performance Management (BPM), built on BIRT (an open source BI platform)
• Oracle – comprehensive platform
• SAS – Business Analysis, Forecast, and Data Visualization
• IBM Cognos – Corporate Performance Management (CPM)
• SAP – supports a software-as-a-service infrastructure
• Google – Google Analytics
• Information Builders – Customer Relationship Management (CRM)
11
QlikView
• Pros Click driven, visually interactive interface is simple to learn and use. Based on in-memory associative technology, which is fast. Flexible data source (Oracle, SQL, excel, txt file). Quicker to build comparing with traditional BI systems.
• Cons Need straight-forward relationship among tables, which requires very clean
data to link multiple tables. Its underlining calculation logic, set analysis, is not rigorous and hard to use
for complicated logics. Its script language is not complete enough to accomplish comprehensive
tasks. All the data need to be in memory.
12
References• Gray, J., Bosworth, A., Layman, A. & Priahesh, H. (1996). Data cube: a relational aggregation operator generalizing group-by, cross-tab,
and sub-totals”. In Su, S. (Ed.). Proceedings of the 12th International Conference on Data Engineering (pp. 152-159). New York, NY: IEEE.
• Hammerbacher, J. (2009). Information platforms and the rise of the data scientist. In Segaran, T. & Hammerbacher, J. (Eds.). Beautiful Data, chapter 5. Sebastopol, CA: O’Reilly Media.
• Inmon, W. H. (1992). Building the data warehouse. New York, NY: Wiley and Sons.
• Kimball, R. & Ross, M. (1996). The Data Warehouse Toolkit: practical techniques for building dimensional data warehouses. New York, NY: Wiley and Sons.
• Luhn, H.P. (1958). A business intelligence system. IBM Journal of Research and Development, 2(4), 314-319.
• Turban, E., Sharda, R., Delen, D., and King, D. (2010). Business Intelligence: a managerial approach (2nd ed.). Upper Saddle River, NJ: Prentice Hall.
13
BI and Data SourceData Warehouse
Data Mart,Staging Table
Production Database
Manual Edited File
Description
A repository of an organization’s electronically stored data
A subset of an organizational data store, usually oriented to a specific purpose
Data extracted from production system directly
Files maintained by information workers.
Pros
Integrated,Validated,Logic clearly defined
Validated,Logic clearly defined,Easy to build
Real time,No extra storage
Flexible,Cheap
Cons
Long time to build,Expensive
Not fully integrated Impact on production,Data not validated,Transformation limited
Prone to human error,Lack of details
14
Data Warehouse and ETL
Oracle MS SQL Excel File Text File Web
Extract
Standardize Primary Keys
Cleaning
TransformTransform
FormatTranslate Embedded
Logic
Referential Integrity Check Indexing
Load
BI Data Warehouse
Summarization, Derivation
Merge Sort
Integration,Aggregation
BI System
(Moss, 2003)
15
Data Flow ArchitectureBuilding Data Mart
• Each department offers aggregated data in staging tables, or BI system queries directly from production/standby table.
• BI system integrates data and generates reports.
• Pro Quick to build
• Con Data not fully integrated
BI System
SubmissionStaging Tables
ProductionStagingTables
SalesStaging Tables
UsageStagingTables
FinancialsStaging Tables
External DataStaging Tables
16
Data Flow ArchitectureBuilding Data Warehouse
• Each department offers raw data or aggregated data in staging tables and push the data to a central database repository.
• BI system pulls data and results from data warehouse and generates reports.
• Pro Deep data integration and complicated
analysis can be realized efficiently.
• Con Long time to build
Data Warehouse
SubmissionStaging Tables
ProductionStagingTables
SalesStaging Tables
UsageStagingTables
FinancialsStaging Tables
External DataStaging Tables
BI System
17
Data Flow ArchitectureHybrid Design
• Start with data Mart.
• Gradually build data warehouse.
• Pro Quick to build data mart, eventually
have the advantage of data warehouse.
• Con Complicated process.
Central Staging
Repository
ProductionStaging Tables
SalesSummary
Tables
External DataStaging Tables
BI SystemUsage
StagingTables
FinancialsStaging Tables
SubmissionStaging Tables
18
Facebook’s Dataspace Management with Open Source Tools
Transactional Databases
Application LogsWeb
Crawls (Post)
All Data from Enterprise
Structured Data Unstructured Data
Hadoop Distributed File System (HDFS)
Query language Query UI (HiPal)
Hive
15 terabytes new data per day in 2009
Data Warehousing Framework
Argus
Portal for Sharing Charts and Graphs
Databee
Workflow Management
System
PyHive
Python Script Framework for
MapReduce
Cassandra
Storage System for Serving Data to End
Users
Tools
Parallelized Data Processing at Massive Scale
(Hammerbacher, 2009)
19
References• Hammerbacher, J. (2009). Information platforms and the rise of the data scientist. In Segaran, T. & Hammerbacher, J. (Eds.). Beautiful
Data, chapter 5. Sebastopol, CA: O’Reilly Media.
• Moss, L. T. & Atre, S. (2003). Business intelligence roadmap: the complete project lifecycle for decision-support applications. Boston, MA: Addison –Wesley.
20
Heavyweight Development Process
(Moss, 2003)
21
Agile Development ProcessPlan•Business Goals•KPI
Analysis•Data Sources•Calculation Logics
Data ETL•Extraction•Transform•Loading
Design•Report Layout•Data Visualization
Validation•Data•Logics
Feedback•New Requirements
Phased Release.
◦ Important KPI first.
◦ Well connected data first.
Quick Feedback
◦ Design
◦ Data
◦ Logic
22
Challenges of BI Management
• BI project is across all departments, winning a cooperative support is the key for its success.
• BI development often encounter unexpected issues. Forcing a deadline may cause low-quality report; relaxing due date too much may halt a project.
• BI system is very efficient to expose data abnormalities, if data owners and suppliers can treat the process as a rare opportunity to fix data at source, a more cleaner data system can be an excellent bonus of a BI project.
23
References• Moss, L. T. & Atre, S. (2003). Business intelligence roadmap: the complete project lifecycle for decision-support applications. Boston,
MA: Addison –Wesley.
24
Data Connection and Naming Issues
• Naming issues to link dataSame thing with different namesDifferent things with the same name
• Possible SolutionsMatching on multiple fields
Choose a set of parameters and create a set of fixed rules deciding things match or not.
Collective reconciliation Take advantage of the full network of data for record
matching.(Segaran, 2009)
25
Matching on Multiple Fields
• Setup matching rules1. First Name
Last NameCountryOrganizationDepartment
2. EmailLast Name
Submit AuthorTufte, Ed
CountryUS
OrganizationPrinceton
DepartmentPolitics
Author ProfileEdward R. Tufte
CountryUnited States
OrganizationPrinceton University
DepartmentPolitical Science
26
Collective Reconciliation
• Even not one field match perfectly for the submitting author, we can conclude this as a match by combining the similarity of multiple fields.
Submit AuthorTufte, Ed
CountryUS
OrganizationPrinceton
DepartmentPolitics
Author ProfileEdward R. Tufte
CountryUnited States
OrganizationPrinceton University
DepartmentPolitical Science
27
Data Modeling – Event ChainSubmit
Editor Review
Peer Review
Production
Online
Downloads
Submit Date
Final Decision Date
Online Date
Review Dates
Download Date
• Separate dates make it easy to trace the history of articles in the system.
• User can select a period of submit date, and the charts of accept articles and published articles will only include the articles submitted in the period.
• The model is suitable for detail analysis.
28
Data Modeling – Event List• All the event dates
and regions are consolidated in the event table.
• When a journal, a period or a region is selected, all the charts will be changed to reflect the selection.
• The data model is suitable for high level overview.
Event IDJournal
Event DateEvent Region
Submit
Editor Review
Peer Review Production
Online
Usage
29
References• Segaran, T. (2009). Connecting data. In Segaran, T. & Hammerbacher, J. (Eds.). Beautiful Data, chapter 20. Sebastopol, CA: O’Reilly
Media.
30
Data Visualization
• Informative Reveal intended message clearly with enough data With different perspectives to facilitate discovery
• Efficient Visually emphasize what matters and reveal relationship Use axes, color and size to convey meaning
• Novel Break the limit of default format, choose best format to suit data A fresh look at the data A new level of understanding
• Aesthetic Appropriate usage of graphical construction to offer visual appeal.
(Lliinsky, 2010)
31
1854 Cholera Epidemic in London
The epidemic took the lives of 600 Londoners in September 1854. What was the cause?
Dr. John Snow started the mapping of incident location.
(Tufte, 2001)
32
Discovery seems so easy when right information are put togetherThen Dr. John Snow linked the incident location to pump sites.
It is verified later the Broad Street pump was the cause of the epidemic.
(Tufte, 2001)
33
2008 Electoral Vote Results of Presidential Election
(Nagourney, 2008)
Issue: the geographically accurate map is actually a very inaccurate map of electoral influence.
Electoral Votes
N.J.
16 15
34
2008 Electoral Vote Results of Presidential Election - Revision
(Lliinsky, 2010)
Accurate and beautiful: a proportionally weighted electoral vote results map of the United States
Electoral Votes
16 15
35
Mining and Visualizing Social Patterns
From public data on a local newspaper: 18 women attending 14 different social events.
The links between woman are weighted by the number of events both woman attended.
Start with strongest link to reveal clustering.
(Krebs, 2010)
36
Mining and Visualizing Social Patterns(2)
Gradual Inclusion: focuses initially on the strongest tires in the structure and then gradually lowers the membership threshold to reveal weaker tiers in the network.
Very weak links are dismissed as social noise.
(Krebs, 2010)
37
References• Krebs, V. (2010). Your choices reveal who you are: mining and visualizing social patterns. In Steele, J. & Lliinsky, N. (Eds.). Beautiful
visualization, Chapter 7. Sebastopol, CA: O’Reilly Media.
• Lliinsky, N. (2010). On beauty. In Steele, J. & Lliinsky, N. (Eds.). Beautiful visualization, Chapter 1. Sebastopol, CA: O’Reilly Media.
• Nagourney, A., Zeleny, J. & Carter, S. (2008). The electoral map: key states. The New York Times. Retrieved from http://elections.nytimes.com/2008/president/whos-ahead/key-states/map.html.
• Tufle, E. (2001). The Visual Display of Quantitative Information (2nd ed.). Connecticut , US: Graphics Press.
38
Challenge of Dashboard Design• “A dashboard is a visual display of the most important information needed to achieve one
or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance.”
• “Most dashboards fail to communicate efficiently and effectively, not because of inadequate technology (at least not primarily), but because of poorly designed implementations.”
• “No matter how great the technology, a dashboard’s success as a medium of communication is a product of design, a result of a display that speaks clearly and immediately.”
• “Dashboards can tap into the tremendous power of visual perception to communicate, but only if those who implement them understand visual perception and apply that understanding through design principles and practices that are aligned with the way people see and think.”
• Unfortunately, most vendors focus their marketing efforts on flash and dazzle that subvert the goals of clear communication. “Once implemented, however, these cute displays lose their spark in a matter of days and become just plain annoying.”
(Few, 2006)
39
Common Measures (KPIs)Category Measures
Sales BookingsBillingsSales pipelineNumber of ordersOrder amountsSelling prices
Marketing Market shareCampaign successCustomer demographics
Finance RevenuesExpensesProfits
Web Services
Number of visitorsNumber of page hitsVisit durations
Comparative measure Example
The same measure at the same point in time in the past
The same day last year
The same measure at some other point in time in the past
The end of last year
The current target for the measure
A budgeted amount for the current period
A prior prediction of the measure
Forecast of where we expected to be today
An extrapolation of the current measure
Projection out into the future, e.g. year end.
Some measure of the norm for this measure
Average, normal range or a bench mark.
(Few, 2006)
40
Non-Quantitative Dashboard Data
• Tasks that behind schedule
• Tasks that need to be completed
• Accomplishments that should to be highlighted.
• Issues that need to be investigated
(Few, 2006)
41
Utilize Short-Term Memory
• Memory comes in three fundamental types: Iconic memory (a.k.a. the visual sensory register) Short-term memory (a.k.a. working memory) Long-term memory
• Only 3-9 chunks of information can be stored in short-term memory.• Graphs over text.
Individual numbers are stored in discrete chunks. One or more lines in a line graph, can represent a great deal of information as a single
chunk.
• Relevant information on the same screen. Once the information is no longer visible, unless it is one of the few chunks of
information stored in short-term memory, it is no longer available. If everything remains within eye span, users can exchange information in and out of
short-term memory at lighting speed.(Few, 2006)
42
Information in Well-designed Dashboard
• Exceptionally well organized All important data in one page
• Condensed, primarily in the form of summaries and exceptions Single numbers from sums or averages. Something falls outside the realm of normality, which needs attention.
• Specific to and customized for the dashboard’s audience and objectives Information should be narrowed to address the objective(s). Use audience’s vocabulary.
• Displayed using concise and often small media that communicate the data and its message in the clearest and most direct way possible. Reduce the non-data pixels. Enhance the data pixels.
(Few, 2006)
43
Reducing the Non-Data Ink
(Few, 2006)
When the non-data ink is removed or reduced, the data become more manifest and it is easier to find the trending or pattern among them.
44
Emphasize Most Important Data
(Few, 2006)
Different degrees of visual emphasis are associated with different regions of a dashboard.
The information in the center results in the emphasis only when it is set apart from what surrounds it.
Recent data often deserve display with smaller timing scale than remote history data.
Visual attributes, such as color, size, line width, enclosure, and added marked, can also be used to manifest important data.
45
Effective Dashboard Display Media
(Few, 2006)
Easier to spot trend with line chart
Clean display of
related data
Simple symbol
or number
46
(Few, 2006)
Organize the display objects to reveal their intrinsic relationship
47
Sample Sales Dashboard
(Few, 2006)
48
Add Interactivity to Dashboard
Add selection box so users can focus on a subset of data
49
When Dashboard is not Enough
• As soon as a dashboard shows abnormalities, users will often want to know more details about them.
• The responsible individual can be called to provide the details, who may query the database or ask IT staff to do the query… The process is long and resource consuming.
• Layered reports can provide top-down views: Layer 1: One page dashboard
Layer 2: More detailed aggregation such as regional reports
Layer 3: Data tables with all the details needed
• The data in detail views can be narrowed from top views, which offers a natural analysis flow.
50
References• Few, S. (2006). Information Dashboard Design. Sebastopol, CA: O’Reilly Media.
51
BI and Web AnalyticsSo many data, still so little insights
The reason for so few actionable insights even with abundant web click data:
The clickstream is about “what”, but not “why”.
(Kaushik, 2010)
52
Web Analytics 2.0
(Kaushik, 2010)
53
Web Analytics Tools
(Kaushik, 2010)
54
Metrics for Clickstream Analysis
• Visits and Unique Visitors Using session ID and persistent cookie ID
• Time on Page and Time on Site No leaving time on last page, unless using “unloaded” script.
• Bounce Rate People leave the site without a single click. Useful: bounce rate from top referrers
• Exit Rate Useful only in the middle of “sequential” pages.
• Conversion Rate
(Kaushik, 2010)
55
Top Questions to Answer
• How many visitors to my site?• Where are visitors coming from?
Direct traffic. Referring sites. Search engine: Keywords. Campaign and paid ads.
• What do I want visitors to do on my site?• What visitors are actually doing?
Top entry pages. Top viewed pages. Site overlay analysis (navigation analysis) Abandonment analysis.
(Kaushik, 2010)
56
Typical Analysis Flow
(Kaushik, 2010)
Bounce Rate of Top Search keywordsSearch Keywords: Users’ intentBounce: not happy with findingQ: ranked wrong keyword?Q: landing pages miss info?
Site Overlay (Click Density) Analysis% clicks or conversions
User Behavior:
Also check days to convert
57
Source of Traffic AnalysisWho sends valued traffic?
(Kaushik, 2010)
58
Module Click Analysis
• Pages using same layout template share same modules.
• Click analysis at module level can reveal which modules are outperforming or underperforming.
• Click on link positions within each module can reveal more user behavior pattern.
Many PagesSame Layout
Performance Across Pages?
59
Scroll Percentage for Long Page0-20% Scrolled
30%
20%-40% Scrolled22%
50%-60% Scrolled11%
60%-80% Scrolled9%
80%-100% Scrolled26%
60
Visitor Segmentation
• What/how are they viewing?
• Why do they leave?
• How to engage them more?
• How to connect them?
New Visitors
Casual Visitors
Loyal Visitors
Elapsed Visitors
• Growing the loyal visitors is essential to keep the site thriving.
• So it is important to understand their navigation pattern, what do they like and unlike.
61
Consumption of Content
62
Navigation Flow Among Top Pages/Content
(Adobe.com)
63
Navigation Flow to a Page
(Adobe.com)
64
Navigation Flow from a Page
(Adobe.com)
65
Markov Chain AnalysisGrouping Page Views for Behavior Analysis
(Gwizdka, 2010)
66
Factors Influencing Satisfaction for Information Retrieval
• System Effectiveness Measures how well a given IR system achieves it objective.
Precision (relevant documents retrieved /total retrieved documents)
Recall (relevant documents retrieved / total relevant documents in database)
• User Effectiveness Measures accuracy and completeness with which users achieve certain goals.
Number of tasks successfully completed
Number of relevant documents obtained
Time taken by users to complete set tasks
• User Effort Measures users’ effort to get relevant information.
Number of Clicks
Number of queries and queries reformulation
Rank position accessed
(Al-Maskari, 2010)
67
See Users’ Experienceby Visual Replay of HTML Steam
http://www.tealeaf.com/products/real-time-customer-experience-management.phpAccessed on Dec 6th, 2011
Tealeaf is one of tools to record all the dynamically generated HTML at the network level and store it for later searching and visual replay.
68
See Users’ Joy and Tearby Visual Replay of What Users Saw and Their Actions
Such case study can help to understand the reasons behind the summarized numbers.
http://www.tealeaf.com/products/real-time-customer-experience-management.php Accessed on Dec 6th, 2011
69
Web DetectiveSolve the web mysteries
Post
Third Party Payment System received payment for one candy, forwarded the user to application server to receive a receipt.
ReceiptServer stored the order as two candies, and print a receipt of two candies.
Valid
70
Web DetectiveReplaying web session can reveal true culprit
Open Tab 1 and add one candy.
Time
10:00
10:05 Open Tab 2 and add second candy.
10:10 Submit Tab 1.
10:10Receive the payment in Tab 1.
10:11 Process the order in Tab 2.
71
References• Adobe training video. Retrieved from https://outv.omniture.com/.
• Al-Maskari, A. and Sanderson, M. (2010). A review of factors influencing user satisfaction in information retrieval. Journal of the American Society for Information Science and Technology, 61: 859–868. Doi: 10.1002/asi.21300
• Gwizdka, J. (2010). Distribution of cognitive load in Web search. Journal of the American Society for Information Science and Technology, 61: 2167–2187. DOI: 10.1002/asi.21385
• Kaushik, A. (2010). Web Analytics 2.0. Indianapolis, IN: Wiley Publishing.
72
BI and Social Network
• Social Networks, such as LinkedIn, Facebook, and Twitter, are becoming important means for people, including scientists, to share information, though academic world had been slow to utilize social network. (Curry, 2009)
• Capability to extract the tremendous, unstructured, time-sensitive information is becoming increasing important for business analysis.
• The recent development of literature-based scientific social networks is promising Sites
BioMedExperts UniPHY
Unique for research world Preloaded professional profiles based on publications. Preloaded networking based on co-authorship analysis. Periodically sending publication updates in each user’s network.
• The effective ways to analyze the content on social network and promote scientists’ contribution on social network are still need to be developed.
73
Effectiveness of Scientific Social Networks
• Academic social networks will soon be out of favor if it cannot help scientists effectively.• We need to study weather such network can improve scientists’ research productivity,
increase collaboration among scientists, as well as increase the traffic to scientific content web sites. Statistical analysis based on user’s profiles on the site. Web analytics using tools like Google Analytics. Scenarios analysis using session capture tools like Tealeaf. Traditional usability test using tools like Morae. Survey.
• Linking user’s activities on academic social networks, profiles on professional member societies and clicking streaming on academic content sites can help to understand and server each user efficiently. Organize the order of contents to user’s long/short term interest. Recommend relevant events, such as academic forum/seminar, industrious shows. Let users promote academic contents or events interesting to them via social network.
74
Building Users’ Expert Profile Based on Concepts in Publications
(Gunter, 2009)
Document fingerprints aggregated to expert profiles
75
Motivating Contribution in Social Media
• Social Learning People learn by observation in social situations, and that they will begin to act like people they
observe even without external incentives. (Bandura, 1977). Social sites can make it easy for users to observe the behaviors of active users.
• Feedback Theories of reciprocity (Cialdini, 1984;Gouldner, 1960), reinforcement (Ferster, 1957) and the need
to belong (Baumeister, 1995) all suggest that feedback from other users should predict long-term participation of the social media users.
Site design and its backend technologies can bring users convenience to tag and comment
• Distribution Reputation is a common motivation for participation in many online environments.
Competitive motivations in the form of reputation and status attainment have been cited as a primary incentive for continued participation for open-source software. (Hertel, 2003)
Bloggers cite the intent to affect their professional reputation as being among their top motivations for blogging. (Marlow, 2006).
Promoting active users and distributing their influence is the effective social currency to ‘bribe’ key contributing users.
76
Case Study at Facebook: Motivating Newcomer Contribution
• Measures Dependent variable
The number of photos uploaded by the newcomers between their third and fifteenth weeks on the site.
Independent variables Learning – the number of photo-uploading stories the newcomers saw in
their News Feeds during their first two weeks. Singling out – whether the newcomer was tagged in a photo during his or
her first two weeks. Feedback – whether the newcomer received any comments on his or her
initial photos during the first two weeks. Distribution – the number of News Feed stories shown to friends about
the newcomer’s photos.(Burke, 2009)
77
Result of Case Study at Facebook: Motivating Newcomer Contribution
• “Design elements which facilitate learning from friends, singling out, feedback, and content distribution can help increase the level of engagement for new users, leading to further content contributions and an overall better user experience.
• “The most consistent result we found was for learning from friends. An increase in visible photo activity was always predictive of increased newcomer contribution.”
• “Designers of social networking sites should also find ways to support newcomers with varying behavioral patterns.”
“For newcomers who are active, highlighting opportunities for others to leave them feedback and allowing the newcomers to increase the size of their audience may be particularly effective.”
“For newcomers who are relatively inactive, designers might want to encourage their friends to pay more attention to them, whether through singling out in a public fashion or sending more directed private communication.”
(Burke, 2009)
78
References• Bandura, A. (1977). Social Learning Theory. New York, NY: General Learning Press.
• Baumeister, R. & Leary, M. (1995). The need to belong: desire for interpersonal attachments as a fundamental human motivation. Psychological Bulletin, 117(3), 497-529.
• Burke, M., Marlow, C. & Lento, T. (2009). Feed me: motivating newcomer contribution in social network sites. Proceedings of the 27 th international conference on human factors in computing systems (pp. 945-954). Boston, MA: ACM Press.
• Cialdini, R.B. (1984). Influence. New York, NY: William Marrow and Company.
• Curry, R., Kiddle, C. and Simmonds, R. (2009). Social networking and scientific gateways. Proceedings of the 5th Grid Computing Environments Workshop. Doi: 10.1145/1658260.158266.
• Gouldner, A. (1960). The norm of reciprocity: A preminary statement. American Sociological Review, 25(2), 161-178.
• Ferster, C. & Skinner, B. (1957). Schedules of Reinforcement. New York, NY: Appleton-Century-Corfts.• Gunter, D. (2009). Semantic Search. Bulletin of the American Society for Information Science and Technology, 36: 36-37.
• Gunter, D. (2009). Semantic Search. Bulletin of the American Society for Information Science and Technology, 36: 36-37.
• Hertel, G., Niedner, S. & Herrmann. S. (2003). Motivation of software developers in open source projects: An internet-based survey of contributiors to the linux kernel. Research Policy, 32(7), 1159-1177.
• Marlow, C. (2006). Linking without thinking: Weblogs readership and online social capital formation. In Proceedings of the International Communication Association, Dresden, Germany.
79
Semantic Technologies,BI and Just-in-Time Discovery
• “Discoverability requires the ability to recall related historical data so that an arriving piece of data can find its place, similar to the way each jigsaw puzzle piece is assessed relative to a work-in-progress puzzle.” (Jonas, 2009)
• Directories for enterprise-wide discoverability Context-less directories
Basic directories to locate information
Semantically reconciled directories Concepts with similar meanings are bundled together
Semantically reconciled and relationship-aware directories. Information are linked together in Context
Context-based discovery
• Academic publishers can organize the factors and activities of their subscribers, users and authors in a way to be easily pulled together, and put new information into the context to assist business discovery.
80
Semantic Web – Linked Data
(Berners-Lee, 2001)
Relational database is too strict to catch the dynamic relationship. New fields and new relationship need to be added to the database all the times, which is not efficient.
Graphical database is designed to store the dynamic relationship with simple and flexible schema. Here are some open source examples:Sesame (http://openrdf.org)Jena (http://jena.sourceforge.net)AllegroGraph (http://agraph.franz.com)Neo4J (http://neo4j.org)
(Segaran, 2009)
81
Semantic Web ElementsURI, RDF, Ontology
Gene 1 Modify Gene 2
Gene 2 Affect Disease A
Gene 1 May Affect Disease A
URIUniversal Resource Identifier• Specify an entity• Identical, exchangeable in different
documents
RDF Resource Description Framework• Subject – Predicate – Object (Triples)• Express the relationship between entities
Ontology• Collection of URI, RDF• Collection of inferring rules
82
Dublin Core Metadata InitiativeThe Dublin Core is a set of predefined properties for describing documents.
The following example demonstrates the use of some of the Dublin Core properties in an RDF document:
<?xml version="1.0"?><!DOCTYPE rdf:RDF PUBLIC "-//DUBLIN CORE//DCMES DTD 2002/07/31//EN" "http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-xml-dtd.dtd"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc ="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://dublincore.org/"> <dc:title>Dublin Core Metadata Initiative - Home Page</dc:title> <dc:description>The Dublin Core Metadata Initiative Web site.</dc:description> <dc:date>2001-01-16</dc:date> <dc:format>text/html</dc:format> <dc:language>en</dc:language> <dc:contributor>The Dublin Core Metadata Initiative</dc:contributor> <!-- guesses for the translation of the above titles --> <dc:title xml:lang="fr">L'Initiative de métadonnées du Dublin Core</dc:title> <dc:title xml:lang="de">der Dublin-Core Metadata-Diskussionen</dc:title> </rdf:Description></rdf:RDF>
83
Semantic ToolsRDFS, OWL, SPARQL
(Shadbolt, 2006)
<rdfs:Class rdf:ID="animal" />
<rdfs:Class rdf:ID="horse"> <rdfs:subClassOf rdf:resource="#animal"/></rdfs:Class>
RDFSRDF Schema • RDFS is an extension to RDF• Provides the framework to describe application-specific classes
and properties
Class(a:cat_owner complete intersectionOf(a:person restriction(a:has_pet someValuesFrom (a:cat))))SubPropertyOf(a:has_pet a:likes) Class(a:cat_liker complete intersectionOf(a:person restriction(a:likes someValuesFrom (a:cat))))
• Cat owners have cats as pets.• has pet is a subproperty of likes, so anything that has a pet
must like that pet.=> Cat owners must like a cat.OWL
Web Ontology Language • A family of knowledge representation languages for authoring
ontologies• Express and Process information on the web
PREFIX abc: <http://example.com/exampleOntology#>SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa . }
What are all the country capitals in Africa?
SPARQLA RDF query language
84
Linked Data for STM Publication
R. Arlen Price
Faculty
An obesity-related locus in chromosome region 12q23-24
Diabetes
Author
Subscribe
Read
American Diabetes Association
Publication
National Institutes of Health
Funding
Research InterestGenetics of Complex Traits, Genetics of Obesity, Behavioral Genetics, Genetic Epidemiology
Faculty Profile
Research TechniquesLinkage mapping, linkage disequilibrium association analyses, and gene expression profiling
Profile
Research Strength Ding Li
Author
Student
Attend EventsProposal
Review
Linking data helps to server each researcher’s need better.
85
Semantic Publishing – Integrate Data in Academic Journals
(Serinhaus, 2007)
Publish machine-readable summary information in XML along with the article.
BI system can retrieve and organize the meta data.
86
Semantic Publishing – Semantic Enhancement to Research Articles
(Shotton, 2009)
The relevant data can be linked together online.
BI system can help to retrieve and organize the relationship and data.
87
BI and E-ScienceResearch is becoming more data-driven, often require to link data in large scale.
BI can trace the location of data sources, understand the relationship of these academic databases, and provide user with corresponding data services.
(Luciano, 2007)Multiple pathway databases are linked to construct the human insulin signaling pathway.
88
References• Berners-Lee, T.,Hendler, J. and Lassila, O.(2001)The Semantic Web. Scientific American, 284(5), 28–37.
• Jonas, J. & Sokol, L. (2009). Data finds data. In Segaran, T. & Hammerbacher, J. (Eds.). Beautiful Data, chapter 7 . Sebastopol, CA: O’Reilly Media.
• Luciano, J. and Stevens, R. (2007). e-Science and biological pathway semantics. BMC Bioinformatics, 8(Suppl 3): S3. doi: 10.1186/1471-2105-8-S3-S3.
• Segaran, T. (2009). Connecting data. In Segaran, T. & Hammerbacher, J. (Eds.). Beautiful Data, chapter 20. Sebastopol, CA: O’Reilly Media.
• Seringhaus, M. and Gerstein, M. (2007). Publishing perishing? Towards tomorrow's information architecture. BMC Bioinformatics, 8:17. doi: 10.1186/1471-2105-8-17.
• Shadbolt, N., Berners-Lee, T., and Hall, W. (2006). The Semantic Web Revisited. IEEE Intelligent Systems 21(3): 96–101.
• Shotton, D., Portwin, K., Klyne, G., and Miles, A. (2009). Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article. PLoS Comput Biol ,5(4): e1000361. doi: 10.1371/journal.pcbi.1000361.
89
BI and Algorithm by exampleIn US and Canada:A list of hospitalsA list of medical groupsAll with (latitude, longitude)
How to find the nearby hospitals (within 1 mile) for each medical group?
It is too time-consuming to calculate the distance of all combination.
We need to limit candidates before calculation.
Simple spherical law of cosines formula to calculate distance:
d = acos(sin(lat1).sin(lat2)+cos(lat1).cos(lat2).cos(long2−long1)).R
where R is earth’s radius (mean radius = 6,371km)
90
Can We Find a Key?Database is efficient to judge two things are same, but we need to find a key.
The distance of two points 1apart at equator:69.1 mile mile
NYC lat:40.714623, long:-74.006605 Round(latitude * 70)=2850Round(longitude * 70)=-5180How about using Key ‘2850_-5180’?
The distance of two points 1apart in longitude depends on latitude, but it will be less than 70 miles, so the above key is sufficient.
Simple spherical law of cosines formula to calculate distance:
d = acos(sin(lat1).sin(lat2)+cos(lat1).cos(lat2).cos(long2−long1)).R
where R is earth’s radius (mean radius = 6,371km, or 3,959mi)
91
Boundary ConditionThe two points within a square will be less than 1 mile away, but how about the points across the adjacent squares?
We produce 9 keys of adjacent squares for one group (say hospital), then compare them with another group (say medical groups).
This also solve the boundary problem of the transition point of longitude: it is adjacent from 89.99 to – 89.99
2851,-5181
2851,-5180
2851,-5179
2850,-5181
2850,-5180
2850,-5179
2849,-5181
2849,-5180
2849,-5179
92
References• Movable Type Ltd. Calculate distance, bearing and more between Latitude/Longitude points. Retrieved from http://www.movable-
type.co.uk/scripts/latlong.html
93
Future Plan
• More on Data Mining• More on Data Modeling• BI and User Experience• BI and Predictive Analysis• BI and Technology Intelligence
94
Thank You
• Please send your comment, suggestion and discussion to [email protected]
• The file will be updated at: http://www.slideshare.net/dingli2/