Data Mining User Behavior

download Data Mining User Behavior

of 9

Transcript of Data Mining User Behavior

  • 8/12/2019 Data Mining User Behavior

    1/9

    Data Mining User Behavior

    Tom Wilson

    1 Introduction

    In order to test a developed system, we must understand how the systems users will interact with it. Someuser behaviors that we will be investigating are (1) user arrival rate, (2) session duration, and (3) think time. Onebehavior that we will not be investigating here is workload, which is the mix of the activities that the users chooseto perform. For the data that we will be looking at, we will use both science and art in our analyses.

    We will use visualization to assist in communicating the analyses. A few graphs will use interval sampling tosummarize large amounts of data. Interval sampling consists of counting events during a time interval. It has itslimitations ([Wil10a]), but often is sufficient in helping us see how data are organized. Some graphs will summarizedata with boxplots. Boxplots have several useful features: (1) the top and bottom of the box correspond to the75th and 25th percentiles, (2) the median, or 50th percentile, is shown by a line within the box, (3) values in the

    interquartile range are represented above and/or below the box by a whisker (a dashed line, which can be up to1.5 times the size of the box), and (4) potential outliers are shown as points above and/or below the whiskers. Inthis paper, boxplots are clipped near the upper whisker so we can see the details of the box better. The boxplotquickly shows us how much the data are clustered or spread. We will also plot the mean when drawing the boxplot.

    Histograms are also a graphical method of summarizing data, but suffer from being data sensitive. Our presen-tation of a histogram is modified slightly in that it is drawn using a line rather than a series of bars because ourchoices for bins often result in small widths.1 So only the heights of the bins are shown as the points on the linethey create. We will then add the mean and the 25th, 50th, and 75th percentiles to the histogram in order to relateit to the boxplot. In this paper, the distribution is clipped like the boxplot so as not to compress the majority of thedata.

    As we analyze each data set, we will comment on some of its content. Such discussion usually focuses on theextremes (why are some values very small or very large) since we should question if these values are valid. Hopefully,the analyses presented here can strengthen your analysis.

    2 Source Data

    The proprietary system providing example data supports logistics and maintenance for military equipment. Forthe purpose of anonymity, some general details are provided and some specific details have been changed. Minordetails of the system are scattered throughout the discussion.

    Figure1 graphs the number of users and logins over an 8-month period. Both graphs have a similar pattern of a5-day series of peaks (i.e., the work week) followed by a 2-day extended valley (i.e., the weekend). The login graphhas several sharp spikes that are probably not a function of time. Besides a steady growth in each parameter overthe interval, a couple of small dips occur that are a result of holiday periods (months 2 and 6). We will take a closerlook at these two metrics before turning our attention to the three user behaviors that we are interested in.

    2.1 User CountsUsers access the system as part of their jobs with the bulk of the users working during a first-shift weekday

    concept. So, user load is highly influenced by the time of day, the day of the week, and seasonal events (e.g.,holidays). Figure2ashows the minimum, average, and maximum number of users per hour for every hour of eachday of the week. The statistics are computed for the entire 8-month interval. The weekday minimums are certainlyaffected by holidays; if holidays were excluded, the minimums would be about 100 higher.

    Figure2bshows a boxplot of the same data, but constrains the day of the week to Tuesday. The limited intervalallows more information to be presented clearly. As before, eliminating holidays would drive the minimums up by

    Published in CMG MeasureIT, September 2010.1The resulting graph looks like a density plot, but it is not.

    1

  • 8/12/2019 Data Mining User Behavior

    2/9

    Month

    1 2 3 4 5 6 7 8

    sers

    0

    500

    1,000

    1,500

    Month

    1 2 3 4 5 6 7 8

    ogins

    0

    500

    1,000

    1,500

    Figure 1: The charts show the number of users (top) and the number of logins (bottom) over an 8-month period. Thecharts use the same y-axis limits for better visual correlation. It is apparent that several login spikes have occurred.

    Users by Hour of Day of Week

    Day of Week

    Users

    0

    100

    200

    300

    400

    500

    600

    700

    Sun Mon Tue Wed Thu Fri Sat

    MaxAvgMin

    (a)

    0 2 4 6 8 10 1 2 1 4 1 6 1 8 2 0 2 2

    0

    100

    200

    300

    400

    500

    600

    Users by Hour of Day (Tuesday)

    Hour

    Users

    (b)

    Figure 2: Chart (a) shows the maximum, average, and minimum number of users for every hour of the week. Thesummary contains 8 months of data. The chart suggests that the number of users is sensitive to the time of day and dayof week. Chart(b) shows a boxplot of the users for every hour of all Tuesdays. Means are added to the boxplot with aline connecting them.

    about 100. The location of the mean below the median (during most hours) suggests that the distribution is leftskewed. We did not plot this distribution because it is mostly uninteresting.

    It is quite easy to see which hours constitute the workday for most users (i.e., 6 to 16). The typical lunch-time

    2

  • 8/12/2019 Data Mining User Behavior

    3/9

    lull is also present, giving a double-hump appearance to most descriptive statistics on the graph.

    2.2 Login Counts

    Initially, we might think that login and user counts are equivalent, but they are not. A user can log in severaltimes during a sampling interval, yet only count as one user (this assumes that a user cannot be logged in multipletimes). Likewise, a user can persist over several sampling intervals without logging in during any of the intervals.So, we expect only a moderate-to-high correlation. This was demonstrated in [Wil10b]. Users can log in and out

    frequently because, for many users, most of their jobs consist of work away from the system. Limited numbers ofterminals may also be a factor.

    Figure3a shows the number of logins per hour for each day and hour of the week. These data contain holidays,which are likely to impact weekday minimums. The spikes (large maximums) are notable since they are significantlyhigher than the averages. A downtime event is one possible explanation since, when the system is restored, manyusers are waiting to login at the same time. Such possible outliers could be removed, but they truly are valid datawhen our explanation is correct.

    Logins by Hour of Day of Week

    Day of Week

    Logins

    0

    500

    1,000

    1,500

    Sun Mon Tue Wed Thu Fri Sat

    MaxAvg

    Min

    (a)

    0 2 4 6 8 10 12 14 16 18 20 22

    Logins by Hour of Day (Tuesday)

    Hour

    0

    500

    1,000

    1,500

    Logins

    (b)

    Figure 3: Chart (a) shows the maximum, average, and minimum number of logins for every hour of the week. Thesummary contains 8 months of data. The chart suggests that the number of logins is sensitive to the time of day and dayof week. Chart(b) shows a boxplot of the logins for every hour of all Tuesdays.

    Figure3bshows only the Tuesday login data as a boxplot. The spikes now appear as circles and are significantlydistant from the whiskers. They are also few in number.

    2.3 Interarrival Time Data

    Interarrival time is computed from the login times. The time between successive logins (for any users) defines aninterarrival time. There are almost 325,000 interarrival times in the source data. As illustrated in Figure3, thesetimes appear to be sensitive to the time of day and day of week, although no correlation has been performed. Otherfactors, such as holidays, are also important.

    The Original column of Figure4ashows the distribution of the interarrival times using a boxplot and Figure 4bshows the distribution using a (line) histogram. It appears from the histogram that the data are exponentiallydistributed. Because the data are right-skewed, the median might be a better location estimator than the mean.There is a slight bulge at the median; we cannot offer an explanation for this.

    3

  • 8/12/2019 Data Mining User Behavior

    4/9

    Original Filtered

    0

    10

    20

    30

    40

    50

    60

    Interarrival Time Boxplot

    Minutes

    Mean

    Mean

    (a)

    0 10 20 30 40 50 60

    Interarrival Time (Line) Histogram

    Minutes

    Frequency

    0%

    2%

    4%

    6%

    8%

    25th Pct.

    Median

    75th Pct.

    Mean

    (b)

    Figure 4: Chart(a) shows a boxplot of the interarrival times for both the original and filtered data, while chart (b)shows the original data as a (line) histogram. Each chart also shows important statistics such as the median and mean.

    During heavily-loaded time periods, users can arrive very closely together. This contributes to the large numberof small values. During lightly-loaded time periods, there can be a signficant gap between users arriving. An eventthat brings the system down will likely result in a large cluster of logins as soon as the system is back up. If so, thenumber of such events is small.

    So, what do the data look like if we consider the busier hours of workdays? We will define these data as thosefor hours 6 through 16 of weekdays that are not holidays, and refer to them as the filtereddata. How do important

    statistics change? Figure4ashows the boxplot of the filtered data. The change in most descriptive statistics is smallbut apparent. The mean has changed considerably. Approximately 28,000 times were eliminated by the filteringcriteria. The histogram for the filtered data is not shown since it is so similar to the original data.

    If we were going to simulate users arriving, we would use the interarrival times to create a distribution model.However, our testing did not include any test of a long enough duration to use this information.

    2.4 Session Duration Data

    Session duration measures the amount of time that a user is logged into the system. This is the time betweenthe log in and log out. We should not associate any level of activity with the measurement. It is very feasible tohave one or more periods of inactivity during the session. Sessions can easily span several days if time-outs do notterminate sessions, which is the case here. There are over 400,000 times in the source data.

    The Original column of Figure 5ashows the distribution of the session durations using a boxplot. Figure5b

    shows the distribution using a (line) histogram. It is appears from the histogram that the data are exponentiallydistributed. As before, because the data are right-skewed, the median might be a better location estimator than themean.

    We would like to eliminate large session durations since this is unlikely to represent someone continuously workingon the system. So, any session longer than 10 hours was eliminated. This only reduces the data set by about 8,500times. The boxplot of the filtered data are shown in Figure5a. As before, there is a slight change in most of thedescriptive statistics, except for the mean, which changed significantly. Also, the corresponding histogram is notshown because of its similarity to the original graph.

    Many questions could be asked about the session duration data. We do not have answers to many of thesequestions since we did not perform any tests requiring a session duration to be computed. We will mention themnonetheless.

    4

  • 8/12/2019 Data Mining User Behavior

    5/9

    Original Filtered

    0

    20

    40

    60

    80

    100

    120

    Session Duration Boxplot

    Minutes

    Mean

    Mean

    (a)

    0 20 40 60 80 100 120

    Session Duration (Line) Histogram

    Minutes

    Frequency

    0%

    2%

    4%

    6%

    8%

    10%

    25th Pct.

    Median

    75th Pct.Mean

    (b)

    Figure 5: Chart(a) shows a boxplot of the session durations for both the original and filtered data, while chart (b)shows the original data as a (line) histogram. Each chart also shows important statistics such as the median and mean.

    Is session duration sensitive to the time of day, day of week, or seasonal periods? Any result becomes difficult tointerpret because of the long sessions that are mainly idle. In order to assess productivity of a user, how long of anidle period should be allowed? It may be sufficient to shorten the session duration by subtracting sufficiently longidle periods. This would give a better estimate for the rate of work (e.g., transactions/hour). Long idle periods cansignificantly lower this number. A highly-detailed, long-term test could allow idleness with a long think time, butthis is rarely the objective of running a test. There is some merit to the perspective since an idle session still ties up

    some resources.What value do extremely short sessions have? A session of 0.5 seconds hardly seems worth implementing: Log

    in then log out! This is easy to implement and does affect some resources, but do we really want such sequencesto happen with a high frequency? The last question seems to imply that the exponential distribution might not bethe desired distribution. A lognormal distribution has a similar right-skew property, but can significantly reduce thefrequency of extremely small times. We are not promoting this, but it could be studied further. The data say thatthere are a lot of them. Are they anomalies?

    Other difficult questions are: Is session duration sensitive to user count or system load? Are sessions lengthenedbecause it is taking longer to get responses, which is causing users to multitask away from the terminal? Are sessionsshortened because users are frustrated with response times and resolve to come back at a later time?

    2.5 Think Time Data

    Think time is the time the user spends between tasks. It might be possible to categorize the times based uponthe activity being performed. Examples include: selecting from a menu, entering data into a text field, reviewing alist of results before selecting one or more of them, navigating to a completely new task. Significant instrumentationis required to provide this resolution. What cannot be determined is whether or not the user is truly interacting withthe system. As with session duration, idle times impact think times. Thesholds could be established for eliminatingoutliers that probably contain idle intervals.

    There are over 13 million think times in the source data. Almost 110,000 times are less than or equal to 0.Negative think times can occur when two or more transactions are generated by the same user action. In such cases,the earlier one finishes after the next one has already started. Sometimes, a user submits another transaction beforethe previous has finished. This is often due to a slow response time, but can also be a result of abandoning theprevious transaction due to change of mind. We have chosen to filter these times out of our analysis since they

    5

  • 8/12/2019 Data Mining User Behavior

    6/9

    cannot be used in a test. We refer to these data as Filtered 1.Very short think times greater than 0 can be a result of anticipation. Because of familiarity with the application,

    the user does not review the result and quickly submits the next transaction to move on. We do this when wemanually batch processdoing the same thing over and over again. These times are valid.

    The Filtered 1 column of Figure6ashows the distribution of the think times using a boxplot. Figure6b showsthe distribution using a (line) histogram. It appears from the histogram that the data have the shape of a lognormaldistribution (although we did not fit the data to this distribution). Because the data are right-skewed, the medianmight be a better location estimator than the mean.

    Filtered 1 Filtered 2

    0

    50

    100

    150

    Think Time Boxplot

    Seconds

    Mean

    Mean

    (a)

    0 50 100 150

    Think Time (Line) Histogram

    Seconds

    Frequency

    0%

    1%

    2%

    3%

    4%

    5%

    6%

    25th Pct.

    Median

    75th Pct.

    Mean

    (b)

    Figure 6: Chart(a) shows a boxplot of the think times for both filtered data sets, while chart (b) shows the same data

    as a (line) histogram. Each chart also shows important statistics such as the median and mean.

    If we were to conduct a short test, we would not want very large times. We removed times over an hour andwere still left with over 13 million times. The Filtered 2 column of Figure 6a shows the distribution of these thinktimes using a boxplot.

    3 Modeling Think Time

    The point of doing analysis is to either make decisions based on it or feed it into a model. In the cases ofinterarrival time and session duration, the analyses were just exercises since we did not perform tests that could usethe results. However, the think time analysis was used in our performance testing.

    Our testing is for a new system, while the data we have is from its predecessor system. Each generation adds new

    functionality and supports more users. We expect the user to behave in a consistent manner on the new system. So,the think time data are applicable.

    More fidelity is needed when modeling think times. The time spent thinking varies for different actions. Insertinga random time from one distribution for all actions is not accurate. We need a think times distribution for eachaction. Table1 lists the actions pertinent to our system. We have chosen to associate a range with each action, andbase the think times for each action on a normal distribution. The testing tool can generate uniformly distributedrandom numbers, and we can use those generated numbers to index a table containing the normally distributedrange. The table can be updated without any impact to the scripts.

    Understanding the think time types is not critical, but we will give a short description of each. Tabis used whenthe user selects one of several tabs, each leading to a different dialog on the screen. Menu is used when the userselects an item from a pull-down menu. Search Options is used when the user selects or checks options associated

    6

  • 8/12/2019 Data Mining User Behavior

    7/9

    Table 1: Think Time Types and Ranges (in Seconds)

    Type Min Max

    1 - Tab 1 52 - Menu 3 93 - Search Options 5 154 - View 8 245 - Update 10 30

    6 - Search Results 15 457 - Create 15 608 - Inter-task 30 90

    with searching. Viewis used when the user looks at data on various screens besides search results. Update is usedwhen the user modifies retrieved data. Search Resultsis used when the user reviews results from a search. Createisused when the user enters new data. Inter-taskis used between iterations of a script.

    Each think time type has a table of 100 values. The distribution of each table is shown in Figure7a. The hope isthat, when all of the times are collected, a distribution results that compares to the original data. This is difficult topredict without knowing the frequency of each type. Those frequencies could be computed if all of the scripts weredone.

    0 20 40 60 80

    Think Times by Type

    Seconds

    Frequency

    0%

    10%

    20%

    30%

    40%

    50%TabMenuSearch OptionsViewUpdateSearch ResultsCreateIntertask

    (a)

    0 20 40 60 80 100 120

    Think Time Comparison

    Seconds

    Frequency

    0%

    1%

    2%

    3%

    4%

    5%

    6%

    7%

    Actual

    Test

    (b)

    Figure 7: Chart(a) shows the think time distributions for each type. Each distribution is normal with the intent thatthe accumulation will result in a lognormal distribution. Chart(b)shows the filtered source think time data and the

    filtered test think time data for comparison using (line) histograms.

    After a test is run, think times are computed from the transactions. We encountered issues in computing thinktimes with the new system similar to those encountered with the current system. False times result from twoconsecutive transactions that are generated by one user action. However, the transactions do not overlap, and thetime between the transactions is small (less than 1 second). So, these times are filtered out.

    Figure 7b shows a comparison between the source data, filtered to include only times less than 125 seconds,and the filtered test data. In the test data, times can be larger than 90 seconds because some scripts contain twoconsecutive think times. One think time might be a view type; the second is always an inter-task type. The resultsare close at a high level. The times in Table 1 could be modified to improve the overall shape, but this task will bepostponed until the entire workload is present (i.e., do not waste time tuning an incomplete system).

    7

  • 8/12/2019 Data Mining User Behavior

    8/9

    4 Use R!

    The graphs in this paper were generated with R ([R D09]), although they could be created with another tool,such as Microsoft Excel. Since some non-standard R graphs were created, we thought we would demonstrate howthey were made. Script1shows an R script that generates a figure similar to Figure 4(only the original data areplotted in the boxplot). Basic R facilities are used, but extra work is done to make the graphs more presentable.Similar scripts generate Figure5 and Figure6.

    Script 1 Interarrival Time R Script

    # read the data from a CSV file. the column were interested in is "Duration"

    w = read.csv("interarrival_times.csv")

    m = mean(w$Duration) # compute the mean to plot on the graphs

    b = boxplot(w$Duration, plot=FALSE) # create a boxplot without drawing it

    # compute the y-axis limit. b$stats[5] is the top whisker of the boxplot

    ymax = max(b$stats[5], m)

    n = ceiling(ymax) # convert the limit to an integer for the histogram

    # create a histogram without drawing it. the "breaks" run from 0 to n with an

    # additional break created at the largest value. were not going to plot

    # anything in the last interval

    h = hist(w$Duration, breaks=c(0:n, ceiling(max(w$Duration))), plot=FALSE)

    x = h$breaks[2:(n+1)]y = h$intensities[1:n]

    # compute where the 25th, 50th, and 75th percentiles and the mean fall on the

    # histogram. approx does the math so that the points are on the line

    pts = approx(x, y, xout=c(b$stats[2:4], m))

    #

    # draw the boxplot

    #

    pdf("interarrival_time_boxplot.pdf") # create a PDF file

    # draw the previously computed boxplot (b), but limit the y-axis so that the

    # whisker and mean can be drawn. since there are so many points above the

    # whisker, suppress the points by using "NA_integer_". this prevents the PDF

    # file from becoming large

    bxp(b, ylim=c(0, ymax), ylab="Minutes", las=1, pch=NA_integer_,

    main="Interarrival Time Boxplot")

    # "fake" the points above the whisker by drawing a thick line -- looks the same

    lines(c(1, 1), c(b$stats[5], par("usr")[4]), lwd=7)

    points(m, pch=23, bg="yellow") # plot the mean. pch=23 is a filled diamond

    dev.off() # close the file

    #

    # draw the (line) histogram

    #

    pdf("interarrival_time_distr.pdf") # create a PDF file

    # plot the histogram as a line. the last bin is not drawn. suppress the y-axis

    plot(x, y, type="l", yaxt="n", xlab="Minutes", ylab="Frequency",

    main="Interarrival Time (Line) Histogram")yticks = axTicks(2) # compute the standard y-axis ticks

    # draw the y-axis, but make it prettier by using a 0%-100% format

    axis(2, at=yticks, labels=sprintf("%d%%", yticks*100), las=1)

    # plot the percentiles and mean. pch=23 is a filled diamond

    points(pts, pch=23, bg=c("red", "red", "red", "yellow"))

    # print text next to the points

    text(pts, labels=c("25th Pct.", "Median", "75th Pct.", "Mean"),

    pos=c(4, 4, 4, 3), offset=1)

    dev.off() # close the file

    8

  • 8/12/2019 Data Mining User Behavior

    9/9

    R provides the facility to compute the data for a graph (e.g., boxplot or histogram) without drawing it. Thedata can then be further manipulated or used to improve the graphs appearance. In Script1, the boxplots upperwhisker is compared to the mean so that the larger determines the upper limit of the graph. This same cut-off pointis used to limit the histogram. The histogram actually has one more big bin to hold all of the data that is off thechart. The histograms breaks provide the x-values, while the intensities provide the y-values. These valueswhich define the histogram line are also used to determine where the predictive statistics fall since the x-values willprobably not be integers.

    The boxplot contains a large number of values that exceed the upper whisker. There are so many that they

    appear as a thick line. To save space in the generated file, the points are suppressed and are simulated by a thickline. The y-axis of the histogram is labeled with formatted percentages rather than decimal numbers less than 1. Incontrast to the next example, these percentages fit in the space provided and can be drawn with the axiscommand.

    Figure 1 and Figure 3 create a pretty y-axis with a comma separator for numbers over 1000 and with thenumbers written horizontally. Script2 is an excerpt of a script that demonstrates how to create the axis tick labels.The main problem that is solved is avoiding the y-axis label colliding with the tick labels. Two things must be doneto overcome this: The margin must be made larger and the axis label must be written as margin text (i.e., it cannotbe created by the plotcommand or the axiscommand).

    Script 2 R Script Excerpt Demonstrating Other Techniques

    #num_hours, maxs, avgs, and mins are set earlier in the real script

    par(mai=c(1.02, 1.02, 0.82, 0.42)) # set the margins

    plot(c(0, num_hours), c(0, max(maxs)), type="n", ylab="", xlab="Day of Week",

    main="Users by Hour of Day of Week", xaxt="n", yaxt="n")

    mtext("Users", side=2, line=4) # put the axis label in the margin

    yticks = axTicks(2) # compute the standard y-axis ticks

    # draw the y-axis, but make it prettier by adding commas

    axis(2, at=yticks, labels=prettyNum(yticks, big.mark=","), las=1)

    #

    polygon(c(1, 1:num_hours, num_hours), c(0, maxs, 0), col="pink", border="red3")

    polygon(c(1, 1:num_hours, num_hours), c(0, avgs, 0), col="lightblue", border="blue3")

    polygon(c(1, 1:num_hours, num_hours), c(0, mins, 0), col="lightgreen", border="green4")

    The day of week charts in Figure 2 and Figure 3 are drawn using polygon commands. The last few lines ofScript2 demonstrates drawing them. Smaller polygons overlay the larger polygons.

    5 Conclusions

    This paper examines some common user metrics mined from a real transaction system. Some characteristics ofeach metric were discussed. Some questions were posed, but with only a few answers given. The distributions ofinterarrival time, session duration, and think time data were shown.

    In the case of think time, the data were used for testing in the new system. However, the test required finer detailthan the available data provided. The inability to reverse-engineer the different types of think times left us guessingas to how to apply it in a test. Normal distributions were applied to the different think time categories with thehope that the accumulation of times would result in a distribution similar to the source data.

    References

    [R D09] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria, 2009.

    [Wil10a] Tom Wilson. Principles of Performance Measurement. CMG MeasureIT, June 2010.

    [Wil10b] Tom Wilson. Workload Correlation and Visualization. CMG 10 International Conference, December2010.

    9