Mining the Database Administration data | Stack Exchange

18
Business Intelligence and Big Data Analytics Project The case of Stack Exchange - Data Administration Lamprini Koutsokera [email protected] Alexandros Lattas [email protected]

Transcript of Mining the Database Administration data | Stack Exchange

Page 1: Mining the Database Administration data | Stack Exchange

Business Intelligence and Big Data Analytics ProjectThe case of Stack Exchange - Data Administration

Lamprini Koutsokera

[email protected]

Alexandros Lattas

[email protected]

Page 2: Mining the Database Administration data | Stack Exchange

Working Space

Page 3: Mining the Database Administration data | Stack Exchange

Data Acquisition

41.779 Posts 22.390 Users 123.697 Posts History 69.185 Comments 148.425 Votes 42.127 Badges

XML to CSV Converter(Online tool)

447.603 rows

Page 4: Mining the Database Administration data | Stack Exchange

Data Cleansing - Adjustment

Comments & Post History & Posts Users without Id but with Display Name -> Guest Users

Post History

Users without Id & Display Name -> 10.039 rows deletedVotes -> 12.207 rows deleted

Badges -> 213 rows deleted -> 73 distinct badges remained

Primary & Foreign keys

5% of data deleted

Varchars to NumericspostHistoryTypes | postTypes | voteTypes

age | reputation | viewsTables/dimensions creation

(1)

(2)

(3)

Page 5: Mining the Database Administration data | Stack Exchange

Star - Snowflake Schema

Fact MetricsTotal Comment Score

Posts EditsUsers Participated

Score View Count

Answer CountComment CountFavorite Count

Page 6: Mining the Database Administration data | Stack Exchange

Cube Creation

Dimensions Users (Age, Reputation, Views)Badge TypesPost Types Post History TypesCreation DateVotes Types και Tags

Measurements

Bridge Tables Posts Post HistoryBridge TagsVotesBadges

Fact Table + Posts

Posts

Bridge Tags Tags Post History Post History Types

Votes Votes Types

Users Badges Badges Types

Dimension Usage

Page 7: Mining the Database Administration data | Stack Exchange

Stack Exchange in Metrics

Top 10 Tags

Wednesday 3:00 p.m. Age Group25-34

Page 8: Mining the Database Administration data | Stack Exchange

Posts through months

#

#

#

Page 9: Mining the Database Administration data | Stack Exchange

Posts through countries

United States3.525 posts

India1.648 posts

United Kingdom1.857 posts

Canada1.473 posts

Page 10: Mining the Database Administration data | Stack Exchange

Data Transformation

postid firebird checkpoint warning oracle-apex aggregation subquery

16956 0 0 0 1 0 0

21733 0 0 0 0 0 0

35756 0 0 0 0 0 0

44484 1 0 0 0 0 0

43484 0 0 0 0 0 0

40422 0 0 0 0 0 0

44726 0 0 0 0 0 0

35932 0 0 0 0 0 1

13.608 Posts – 694 Tags

Tag separation into distinct words

<sql-server><aggregation>

Page 11: Mining the Database Administration data | Stack Exchange

Data Mining

Clustering Association Rules

Scalable EM

30% testing set – 70 % training setdefault 10 number of clusters

min. support 0.01 min. confidence 0.1

Page 12: Mining the Database Administration data | Stack Exchange

3.343 score

6.556 edits

1.035.024 views

609 favorites8.847 users participated

8.700 score13.654 edits

1.695.060 views1.065 favorites

20.637 users participated

7.999 score

12.364 edits2.067.306 views

1.028 favorites

19.521 users participated

2.818.903 views

1.391 favorites18.741 users participated

6.436 score

15.655 edits

5.078score

7.016 edits948.036 views

11.936 users participated1.038 favorites 3.294 score

6.939 edits1.538.607 views

497 favorites8.914 users participatedCluster Mapping – Posts View

13.608 Posts

Page 13: Mining the Database Administration data | Stack Exchange

11.347 badges475.314 reputation

42.600 views

56.657 upvotes2.907 downvotes

29.844 badges1.605.644 reputation

131.913 views205.183 upvotes

9.812 downvotes

177.444 upvotes

6.503 downvotes128.337 views

1.355.876 reputation

27.052 badges

81.750 views

2.308 downvotes75.049 upvotes

25.612 badges

1.005.826 reputation

13.754badges

709.640 reputation55.846 views

3.421 downvotes90.959 upvotes 6.008 downvotes

163.349 upvotes

81.289 views1.332.268 reputation

21.083 badgesCluster Mapping – Users View

6.534 Users

25-34 age group

25-34 age group

25-34 age group

25-34 age group

25-34 age group

25-34 age group

Page 14: Mining the Database Administration data | Stack Exchange

Association Rules

backup

sqlserver

index

mysql

replication

performance

optimization

database-design

Page 15: Mining the Database Administration data | Stack Exchange

Map Reduce

Cleansing

XML FilesPosts & Users

(&).*?(;)^((?!AboutMe=).)*$

Reducer

Mapper #1

Mapper #2

Page 16: Mining the Database Administration data | Stack Exchange

Map Reduce ResultsPosts Users Posts further analysis

Body About Me

• Key• Value• Default• Clustering• Slave• Physical• Node

• Logging• Relationship• C• Dynamic• Language

Tags’ description enhancement

DBs’ problemsolving

Graph DBsProgramming Languages

Visualization

Users’ backgroundexploration

• Developer• Software• Web• Programming• Server• Engineer• SQL

• Java• C#• PHP• Microsoft• Linux

Skills KnowledgeInterests KnowledgeJob recommendation

“without”

• Without Time Zone • Without Restarting • Without using SQL

Timestamp type without losing timezone information.

Related with Oracle and PostregSQL.MySQL automatically deals with it.

Page 17: Mining the Database Administration data | Stack Exchange

Practical Implications

Insights for Solutions & Improvements

Targeted Marketingactions per DB Product

Insights on customer behavior per DB Product

Improve data-driven decision making SE process

Improve descriptivetags quality

Page 18: Mining the Database Administration data | Stack Exchange