Getting started with R when analysing GitHub commits
-
Upload
barbara-fusinska -
Category
Technology
-
view
374 -
download
0
Transcript of Getting started with R when analysing GitHub commits
![Page 1: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/1.jpg)
Getting started with R when analysing GitHub
eventsBarbara Fusinska
barbarafusinska.com
![Page 2: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/2.jpg)
About me
ProgrammerMath enthusiast
Sweet tooth@BasiaFusinska
https://github.com/BasiaFusinska/RTalk
![Page 3: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/3.jpg)
Agenda• R ecosystem • R basics
• Analysing GitHub events• Data sources• Code… a lot of code
![Page 4: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/4.jpg)
Why R?• Ross Ihaka & Robert Gentleman• Name:• First letter of names• Play on the name of S• S-PLUS – commercial alternative
• Open source• Nr 1 for statistical computing
![Page 5: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/5.jpg)
R Environment• R project• console environment• http://www.r-project.org/
• IDE• Any editor• RStudiohttp://www.rstudio.com/products/rstudio/download/
![Page 6: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/6.jpg)
RStudio
Editor
Console
Environment variables
PlotsFilesHelp
Packages
![Page 7: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/7.jpg)
R Basics
![Page 8: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/8.jpg)
Basics - Types> myChar <- "a"> myChar[1] "a"> typeof(myChar)[1] "character"
> myNum <- 10> myNum[1] 10> typeof(myNum)[1] "double"
> # Dynamic> myNum <- "some text"> typeof(myNum)[1] "character"
![Page 9: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/9.jpg)
Vectors> myVector <- c("a", "b", "c")> myVector[1] "a" "b" "c"> typeof(myVector)[1] "character"
myVector <- 1:10myVector <- double(0)
myVector <- c(2, 5:10, 20)myVector <- letters[1:5]
myVector[5]
![Page 10: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/10.jpg)
Lists> myList <- list("a", "b", "c")> myList[[1]][1] "a"
[[2]][1] "b"
[[3]][1] "c"
> typeof(myList)[1] "list"
![Page 11: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/11.jpg)
Named elements> myVector <- c(a="a", b="b", c="c")> myVector a b c "a" "b" "c"
> myList <- list(a="a", b="b", c="c")> myList$a[1] "a"
$b[1] "b"
$c[1] "c"
![Page 12: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/12.jpg)
Accessing element> myVector[1] a "a" > myVector[[1]][1] "a"> myVector['a'] a "a" > myVector[['a']][1] "a"
> myList[1]$a[1] "a"> myList[[1]][1] "a"> myList['a']$a[1] "a"> myList[['a']][1] "a"> myList$a[1] "a"
![Page 13: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/13.jpg)
Data frames> dataFrame <- data.frame(col1=c(1,2,3), col2=c(4,5,6))> dataFrame col1 col21 1 42 2 53 3 6> typeof(dataFrame)[1] "list"
![Page 14: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/14.jpg)
Summary> summary(dataFrame) col1 col2 Min. :1.0 Min. :4.0 1st Qu.:1.5 1st Qu.:4.5 Median :2.0 Median :5.0 Mean :2.0 Mean :5.0 3rd Qu.:2.5 3rd Qu.:5.5 Max. :3.0 Max. :6.0
![Page 15: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/15.jpg)
Summary statisticsmean(dataFrame$col1)max(dataFrame$col1)min(dataFrame$col1)sum(dataFrame$col1)median(dataFrame$col1)quantile(dataFrame$col1)
![Page 16: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/16.jpg)
Filtering vectors and lists> a <- 1:10> a[a > 4][1] 5 6 7 8 9 10
> select <- function(x) { x > 4}> a[select(a)][1] 5 6 7 8 9 10
> Filter(select, a)[1] 5 6 7 8 9 10
![Page 17: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/17.jpg)
Filtering data framesdataFrame <- data.frame( age=c(20, 15, 31, 45, 17), gender=c('F', 'F', 'M', 'M', 'F'), smoker=c(TRUE, TRUE, FALSE, TRUE, FALSE))
> dataFrame age gender smoker1 20 F TRUE2 15 F TRUE3 31 M FALSE4 45 M TRUE5 17 F FALSE
![Page 18: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/18.jpg)
Filtering by rows> dataFrame$age[ dataFrame$gender == 'F'][1] 20 15 17
> dataFrame[2:4, ] age gender smoker2 15 F TRUE3 31 M FALSE4 45 M TRUE
> dataFrame[ dataFrame$age < 30, ] age gender smoker1 20 F TRUE2 15 F TRUE5 17 F FALSE
> dataFrame[ dataFrame$gender == 'M', ] age gender smoker3 31 M FALSE4 45 M TRUE
![Page 19: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/19.jpg)
Filtering by columns> dataFrame[, 3][1] TRUE TRUE FALSE TRUE FALSE
> dataFrame[, c(1,3)] age smoker1 20 TRUE2 15 TRUE3 31 FALSE4 45 TRUE5 17 FALSE
> dataFrame[, c(3,2)] smoker gender1 TRUE F2 TRUE F3 FALSE M4 TRUE M5 FALSE F
> dataFrame[, c('age', 'smoker')] age smoker1 20 TRUE2 15 TRUE3 31 FALSE6 45 TRUE7 17 FALSE
![Page 20: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/20.jpg)
Goal: Language distribution
![Page 22: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/22.jpg)
Google BigQuery
![Page 23: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/23.jpg)
Language information• Only Pull Requests event types
have language information
• Data source – 1h events from 01.01.2015 3 PM• ~11k events• ~500 pull requests
![Page 24: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/24.jpg)
Gender bias?• 4,037,953 GitHub user
profiles• 1,426,121 identified
(35.3%)
http://arstechnica.com/information-technology/2016/02/data-analysis-of-github-contributions-reveals-unexpected-gender-bias/
Open ClosedWomen 8,216 111,011
Men 150,248 2,181,517
![Page 25: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/25.jpg)
![Page 26: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/26.jpg)
Reading data from files - csv> sizes <- read.csv(sizesFile)> sizes category length width1 B 20.0 3.02 A 23.0 3.63 B 75.0 18.04 B 44.0 10.05 C 2.5 6.06 B 7.2 27.07 A 45.8 34.08 C 12.0 2.09 A 5.0 13.010 A 68.0 14.5
![Page 27: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/27.jpg)
Reading data from files - lines> lines <- readLines(sizesFile)> lines [1] "category,length,width" "B,20,3" [3] "A,23,3.6" "B,75,18" [5] "B,44,10" "C,2.5,6" [7] "B,7.2,27" "A,45.8,34" [9] "C,12,2" "A,5,13" [11] "A,68,14.5"
![Page 28: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/28.jpg)
Writing data to csv filewrite.csv(sizes, file=outputFile)write.csv(sizes, file=outputFile, row.names = FALSE)
![Page 29: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/29.jpg)
Applying operation across elements> myVector <- c(1, 4, 9, 16, 25)
> sapply(myVector, sqrt)[1] 1 2 3 4 5
> lapply(myVector, sqrt)[[1]][1] 1
[[2]][1] 2
[[3]][1] 3
[[4]][1] 4
[[5]][1] 5
![Page 30: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/30.jpg)
Read GitHub Archive eventslibrary("rjson")
readEvents <- function(file, eventNames) { lines <- readLines(file) jsonEvents <- lapply(lines, fromJSON) specificEvents <- Filter( function(e) { e$type %in% eventNames }, jsonEvents)
return(specificEvents)}
![Page 31: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/31.jpg)
Missing data# Missing values> a <- c(1,2,NA,3,4,5)> a[1] 1 2 NA 3 4 5
# Checking if missing data> is.na(a)[1] FALSE FALSE TRUE FALSE FALSE FALSE> anyNA(a)[1] TRUE
# Setting missing values> is.na(a) <- c(2,4)> a[1] 1 NA NA NA 4 5
# Setting null values> a <- NULL> is.null(a)[1] TRUE
![Page 32: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/32.jpg)
Read pull requestspullRequestEvents <- readEvents(fileName,"PullRequestEvent")
select <- function(x) { id <- x$payload$pull_request$base$repo$id language <- x$payload$pull_request$base$repo$language
if (!is.null(language)) { c(ID=id, Language=language) } else { c(ID=id, Language="") }}
pullRequests <- sapply(pullRequestEvents, select)
![Page 33: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/33.jpg)
Some solutionsfor(x in pullRequests) { # version 1 rbind(dataFrame, x)
#version 2 idColumn <- c(idColumn, x[“ID”,]) languageColumn <- c(languageColumn, x[“Language”,])}
# version 2dataFrame <- data.frame(
id=idColumn, language=languageColumn)
![Page 34: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/34.jpg)
Prepare datareposLanguages <- data.frame(
id=pullRequests["ID",],language=pullRequests["Language",])
head(reposLanguages)summary(reposLanguages)
![Page 35: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/35.jpg)
Little look on the data> head(reposLanguages) id language1 3542607 C++2 10391073 Python3 28668460 Python4 28608107 Ruby5 5452699 JavaScript6 19777872 C#
> summary(reposLanguages) id language28648149: 12 Ruby : 66 28688863: 8 PHP : 55 20413356: 5 Python : 53 28668553: 5 : 51 10160141: 4 JavaScript: 47 206084 : 4 C++ : 30 (Other) :436 (Other) :172
![Page 36: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/36.jpg)
Duplicated data> myData <- c(1,2,3,4,3,2,5,6)> duplicated(myData)[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE> anyDuplicated(myData)[1] 5
> unique(myData)[1] 1 2 3 4 5 6
![Page 37: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/37.jpg)
Unique repositories data> reposLanguages <- unique(reposLanguages)> summary(reposLanguages) id language 25994257: 2 Python : 36 28528325: 2 JavaScript: 35 10126031: 1 Ruby : 35 10160141: 1 PHP : 34 10344201: 1 : 27 10391073: 1 Java : 22 (Other) :297 (Other) :116
![Page 38: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/38.jpg)
Distribution tables> collection <- c('A','C','B','C','B','C')
> oneWayTable <- table(collection)
> oneWayTablecollectionA B C 1 2 3
> attributes(oneWayTable)$dim[1] 3
$dimnames$dimnames$collection[1] "A" "B" "C"
![Page 39: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/39.jpg)
Language distribution> languages <- table(reposLanguages$language)> head(languages) ActionScript Bluespec C 27 1 1 9 C# C++ 11 20
> languages <- sort(languages, decreasing=TRUE)> head(languages) Python JavaScript Ruby PHP 36 35 35 34 27 Java 22
![Page 40: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/40.jpg)
Recognised languagesreposLanguages <- reposLanguages[reposLanguages$language != "",]
languages <- table(reposLanguages$language)languages <- sort(languages, decreasing=TRUE)
![Page 41: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/41.jpg)
Language names> languagesNames <- names(languages)> languagesNames [1] "Python" "JavaScript" "Ruby" [4] "PHP" "Java" "C++" [7] "CSS" "C#" "C" [10] "Go" "Shell" "CoffeeScript”[13] "Objective-C" "Puppet" "Scala" [16] "Lua" "Rust" "Clojure" [19] "Emacs Lisp" "Haskell" "Julia" [22] "Makefile" "Perl" "VimL" [25] "ActionScript" "Bluespec" "DM" [28] "Elixir" "F#" "Haxe" [31] "Matlab" "Swift" "TeX" [34] ""
![Page 42: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/42.jpg)
Plotting languages2Display <- languages[languages > 5]barplot(languages2Display)
![Page 43: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/43.jpg)
Summary• GitHub Archive• Introduction to R• Data types• Filtering• I/O• Applying operations• Missing values & duplicates• Binding data• Distribution tables• Plotting (barplot)
![Page 44: Getting started with R when analysing GitHub commits](https://reader036.fdocuments.in/reader036/viewer/2022062400/58a773f91a28ab99238b6359/html5/thumbnails/44.jpg)
Thank you
[email protected]@BasiaFusinskabarbarafusinska.com
https://github.com/BasiaFusinska/RTalk
Questions?